satine.org

by Charles Ying

Archive for December, 2007

What You Need To Know About Amazon SimpleDB

Thursday, December 13th, 2007

Well after being under NDA for so long, I’m glad to be able to say that Amazon SimpleDB has gone into limited beta. Congratulations to everyone on the SDS / SimpleDB team; their several years of work on SimpleDB (formerly called SDS) is a brilliant piece of engineering.

What’s cool about SimpleDB

  • Really large data sets
  • Really Fast
  • Highly Available – It’s Amazon. Running Erlang. Whoa.
  • On demand scaling – Like S3, EC2, with a sensible data metering pricing model
  • Schemaless – major cool factor for me here; items are little hash tables containing sets of key, value pairs

Considerations you’ll want to think about

  • Eventual Consistency – Data is not immediately propagated across all nodes… the latency is usually around a second, but for high data sets or loads, you may experience more latency. On the plus side, your data isn’t lost!
  • Queries are lexigraphical – You’ll need to store data in lexicographical ordered form (zero-pad your integers, add positive offsets to negative integer sets, and convert dates into something like ISO 8601)
  • Search Indexes – You’ll need to construct your own indexes for text search – The SimpleDB query expressions don’t support text search, so you’ll have to construct inverted indexes to properly do “text search”. This is actually a really great lightweight way to do this and I’m sure many interesting indexing schemes will be possible.

Under the hood

According to the SimpleDB team, SimpleDB is built on top of Erlang. One of the developers, Jim Larson and I worked together at Sendmail, and he was part of a team doing some amazing stuff with an Erlang message store way back in 2000.

While you don’t need to know Erlang to use SimpleDB, many people have visited here interested in its Erlang roots. If you are interested in learning Erlang, I can recommend Programming Erlang, written by Erlang’s creator – the best introduction you can find. I’ve associate-linked to it on Amazon; just for a little meta-fun.

The data model is simply:

  • Large collections of items organized into domains.
  • Items are little hash tables containing attributes of key, value pairs.
  • Attributes can be searched with various lexicographical queries.

Now you can easily build:

  • Search indexes
  • Log databases / analysis tools –
  • Data mining stores
  • Tools for World Domination

Further Reading

I also wrote a very basic Python module for SimpleDB to handle the XML and REST stuff (too bad it’s not JSON, at least for now), which I’ll release as soon as I figure out how much of the NDA is now lifted. There are a few floating around, so it shouldn’t be too long before they appear publicly.

Updates:

  • Added a link to Nick Christenson‘s paper on Sendmail’s Erlang message store – A great read for those of you building large scale messaging systems or anything in Erlang.
  • Added a link to Werner Vogels’ article on eventual consistency – a great background behind SimpleDB’s consistency design choice.
  • Whether or not SimpleDB and Dynamo are the same underlying technology has never been confirmed by an authoritative source. That’s all I’m allowed to say.

Technorati Tags: , , , ,