19Jan Tue2010 | Cloud MapReduce
Someone forwards me a link to an AWS Blog entry, Cloud
MapReduce, which looks at Accenture's prototype MR engine built
directly on top of the AWS stack: S3, simpledb, the message queue,
etc
The paper
is worth a read, they are very pleased about how few LOC it
took to get working, and how it can be 60X faster than Hadoop.
My Initial thoughts
- A lot of the speedup comes from not shuffling; in Hadoop,
shuffling/sorting stuff is optional. If you need to do it, and you
do it in the right place, it pays off
- Physical Hadoop clusters normally bottleneck at the disk IO
rates. Perhaps they are comparing Hadoop-on-EC2-VM performance with
their Cloud MRs
- The LOC comparison is somewhat flawed as it doesn't include the
lines of code needed for S3 (Java based, I believe), EC2
infrastructure, the message queue, database, etc
- Those LOC are the lines that Accenture get to maintain, from a
maintenance cost perspective, the low #of lines is better.
- The SPOFs have not gone, just moved. No, the namenode may not
fail, but all I have to do is report your credit card stolen to the
bank and your cluster goes offline.
Its interesting in that it shows that a base cloud "stack" is
important, and there is more than just VM hosting. You do benefit
from a highly available , highly scalable filestore with direct (no
need for a VM) remote access. Some kind of database is good, as is
a message queue. Sometimes. And yes, you can write/rewrite
applications that only work in this single environment. And once
you do that, whoever owns the stack owns your code and your
data
. |