Steve: Developing on the Edge - Cloud MapReduce
Steve: Developing on the Edge
Thoughts on development, Web-services, technology and mountains.
19Jan
Tue2010
Cloud MapReduce

Someone forwards me a link to an AWS Blog entry, Cloud MapReduce, which looks at Accenture's prototype MR engine built directly on top of the AWS stack: S3, simpledb, the message queue, etc

The paper is worth a read, they are very pleased about how few LOC it took to get working, and how it can be 60X faster than Hadoop.

My Initial thoughts

  1. A lot of the speedup comes from not shuffling; in Hadoop, shuffling/sorting stuff is optional. If you need to do it, and you do it in the right place, it pays off
  2. Physical Hadoop clusters normally bottleneck at the disk IO rates. Perhaps they are comparing Hadoop-on-EC2-VM performance with their Cloud MRs
  3. The LOC comparison is somewhat flawed as it doesn't include the lines of code needed for S3 (Java based, I believe), EC2 infrastructure, the message queue, database, etc
  4. Those LOC are the lines that Accenture get to maintain, from a maintenance cost perspective, the low #of lines is better.
  5. The SPOFs have not gone, just moved. No, the namenode may not fail, but all I have to do is report your credit card stolen to the bank and your cluster goes offline.

Its interesting in that it shows that a base cloud "stack" is important, and there is more than just VM hosting. You do benefit from a highly available , highly scalable filestore with direct (no need for a VM) remote access. Some kind of database is good, as is a message queue. Sometimes. And yes, you can write/rewrite applications that only work in this single environment. And once you do that, whoever owns the stack owns your code and your data

.

Comments