Blogxter: HDFS reliability
HDFS reliability
Jan 14, 2009 1:26:52 PM, steve

Tom White has put up a lovely document on HDFS reliability

I'd back up what he says about configuration. Only change options (replication, block size) if you are prepared to become the first person to test a cluster with those options set. Hadoop really needs something to push a cluster through a set of configurations under semi-random control, to see what happens.

One thing I'd add is my rule about networks: Programs work best in the network environment in which they were written. Sun has a well managed DNS infrastructure with not many laptops, so Java doesn't like changing network addresses or DNS playing up. Lots of Linux is written at home, hence proxy configuration is a nightmare that some apps simply don't support. And Hadoop?

It assumes DNS is working, IPAddresses are stable, and that nobody malicious is in the datacentre. There's probably an implicit assumption that all clocks are moving forward at the same rate -VM-hosted code can break lots of programs that way, though its usually a sign your VM-server farm is overloaded. These assumptions all hold for the well-managed datacentres of Yahoo!, Facebook! and similar. They don't all hold on, say, Amazon EC2, where anyone can run code in the same subnet for a few cents an hour, and so scan all your ports for no bandwidth charges. They don't hold in my house either, which causes problems for me, but for nobody else.

What could Hadoop do here? I think the network assumptions should be documented more, so we know what to set up/expect. I also think we (and it's a we :) could do better diagnostics, to identify whether things are good, and if not, what's wrong.

(C) 2003-2006 1060 Research Limited
1060 Registered Trademark and NetKernel Trademark of 1060 Research Limited
Powered by Netkernel