Monday, May 03, 2010

Mongo Good.

Been working on an interesting project lately where I was handed some 6 million XML files with no meaningful schema, and told to make a warehouse out of them.  Each XML file also happens to have several hundred elements in it.  With no schema it would be a nightmare to try and make one, never fully sure whether you'd allocated enough space, whether just because your sample of this field yielded integers means that it always has to be an integer .. stuff like that.

I tried a very simple "Elements and Attributes" schema that is capable of loading any XML file.  Basically each Element and each Attribute get a row in the database.  This works fine for smallish data sets, but mine quickly blew up into billions of rows and was no longer manageable.

Then I stumbled over the "No SQL" movement, where we quite literally throw out decades of knowledge about how to set up a SQL schema of tables and rows, joins and where clauses .. and replace it with something more document-driven, no schema. This sounded perfect, since that's exactly what I had.

Enter Mongo DB.  With a little help from Ruby, my loader script for 6 million XML files now looked like this:

Foreach XML file:
  Load it via REST web service
  Parse it into a JSON object
  mongo.insert(object) Done!

Done!  Now if I'm hiding the name of a software vendor somewhere in the middle of one of those objects, like say {"machine"=>{"software=>[ "instance"=>{"name"=>"Microsoft something" ... }]}}  and I want to find them?  All I need to say is  find("machine.software.instance.name" => /Microsoft/) and presto, I've got back my filtered list.  I can add a count() to get the total number of matches, or an each(...) or map(...) to process the list, or basically anything else I might want to do.  Of course I can also add an index on it to keep the performance high, too.

I'm finding this NoSQL stuff fascinating.  Takes a little while to mentally get around everything you've learned about normalizing a schema.  Instead of a dozen tables all crosslinked with each other, I've got one "collection" that is nested in a variety of ways.  But I quite literally spent 0 effort on schema design. 

I haven't even begun to touch on things like the built in sharding, which enables you to scale across multiple instances without having to worry about which server hold which portion of your data. Another player in this space, Cassandra (from the Facebook people), is most well known for being great at that. We looked at Cassandra but it had some schema setup required that I was not able to complete.  With Mongo I just inserted everything and now I'm free to query at will, no matter how complex the original key structure was.

I hope to write more here as I come up to speed on what is obviously the next big thing in the world of database design.

No comments: