Mohs’ law and big data (Hadoop is hard)

I’ve spent more time than usual the past two weeks talking with people, and listening to people, about Hadoop. I’ve been administering Hadoop clusters for (part of) a living for about 4-5 years now, and I’ve gotten pretty good at answering questions people don’t have, or want, answers for.

In the past week or so I’ve heard one vendor advocate that Hadoop gives you a free analytics environment with no need for expensive developers since it’s free software, and another vendor advocated that you can just virtualize Hadoop by putting lots of  datanodes on a single host and save lots of money. Easy peasy, right?

no-just-no

 I’m proposing we consider Mohs’ Law in this situation.

No, I’m not misspelling Moore’s Law, which tells us that compute power/efficiency will double every 24 months. I’m suggesting a law that’s more of a diamond in the rough, if you don’t mind.

Hadoop is hard. 

 It’s based on Friedrich Mohs developing a method of describing hardness of materials about 200 years ago. And it’s a great pun. But it’s also a reminder that “yum install” does not a production application make.

But Rob, I can get Hadoop in 15 minutes!

yellow-hadoop

It is pretty easy to get started with Hadoop. It’s even free of charge to get started (or even to go into production) with the platform itself. I recommend it. Go do it now. I’ll wait.

For starters, go grab the Cloudera QuickStart VM or the Hortonworks Sandbox VM from their respective websites. Pull it into your desktop virtualization platform of choice. Look at the docs. Run some of the tests. At that point you’re farther along than most people who promote Hadoop.

But at that point you don’t have a functioning business intelligence/data warehouse/analytics application environment, any more than installing Ubuntu 13.04 into VirtualBox gives you a production e-commerce site.

There’s still a lot of work to be done. Some of it is difficult, but a fair bit of it is just downright hard. Understand what you want to do, what data you can pull into your environment. Figure out what your customers/users/analysts need out of the data. Make sure you can validate the output. Automate all your tests. Go back to your data sources and make sure you’re getting all the data. Go back to your end users and make sure you’re giving them what they want. Lather, rinse, repeat.

Rob’s Corollaries to Mohs’ Law

If you remember nothing else, think about an analytics environment the way you would a monitoring environment. I’ve supported both for almost a decade, and the take-home I’ll save you ten years on is this:

Make sure you’re measuring what you think you’re measuring.

Make sure you’re measuring what you need to be measuring.

This rule also applies to a lot of other technology… customer surveys, dating sites, and so forth. But it takes formidable effort to get these two corollaries right (without coronaries), and even if you do throw together something with Insta-analytics.com (probably not a real site, not meant as an endorsement), they won’t be able to tell you what you need or whether you’re getting it.

So where do we go from here?

First of all, if you’re interested in getting familiar with Hadoop, go grab a VM above and give it a try. Simulate Pi Indiana-style. Grab a book and try some of the stuff it suggests.

Then, go talk to the BI team in your company, or the analyst who does performance dashboards when she’s not writing code and designing employee event signage and chasing your kids out of the server closet, or whoever. Find out what they’re doing.

And finally, unless your vendor makes its livelihood supporting Hadoop, don’t take their take on Hadoop as gospel. Apocrypha maybe, mistranslation at worst, and probably not enough to go on.

Hey, I’m in Silicon Valley and want to learn more, what can I do?

Funny you should ask.

BayLISA is hosting a Hadoop meeting on Thursday, May 16, at Yahoo! in Sunnyvale. There’s a waiting list but it usually fades closer to the event. Come see Alan Gates of Hortonworks, Eric Sammer of Cloudera, and Ryan Orban of Nutanix talking about Hadoop innovations and how to get involved.  (Disclaimer: I am president of BayLISA, but I don’t get any profit or direct benefit if people come to the meetups.)

There’s also a Hadoop User Group meetup on Wednesday, May 15, although it’s a bit more suited to advanced users who are already familiar with the technology. Their waitlist is also a fair bit longer. But check it out and see if it fits your needs.

If you’re not in Silicon Valley, check Meetup for local groups, or see if one of the Hadoop vendors has local meetings or events you can attend. If you find one, feel free to add it in the comments here so other people will know where to look.

1 thought on “Mohs’ law and big data (Hadoop is hard)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.