Thursday, June 24, 2010

Distributed Data Structures

I'm trying to catch up with some of the fundamental ideas behind giant Internet systems such as Google, Amazon, Facebook, etc. Modern systems in these areas include things like: Map/Reduce, Hadoop, GFS, Project Voldemort, Dynamo, etc. So I'm slowly working my way through some of the underlying research papers on the subject.

One early paper in this area is: Scalable, Distributed Data Structures for Internet Service Construction. This project demonstrated many of the basic principles about distributed data structures:

In this paper, we bring scalable, available, and consistent data management capabilities to cluster platforms by designing and implementing a reusable, cluster-based storage layer, called a distributed data structure (DDS), specifically designed for the needs of Internet services. A DDS presents a conventional single site in-memory data structure interface to applications, and durably manages the data behind this interface by distributing and replicating it across the cluster.


The extrememly important data structure that the paper discusses is called a distributed hash table:

The API provides services with put(), get(), remove(), create(), and destroy() operations on hash tables. Each operation is atomic, and all services see the same coherent image of all existing hash tables through this API. Hash table names are strings, hash table keys are 64 bit integers, and hash table values are opaque byte arrays; operations affect hash table values in their entirety.


The basic implementation described in the paper uses a collection of "storage bricks", with a two-phase-commit update layered above the bricks, issuing updates to all replicas of a hash table partition in a consistent fashion. As we will see in later discussions, many other systems have relaxed these aspects of the implementation for a variety of reasons.

Perhaps the most interesting part of the paper is the metadata management, termed "metadata maps" in the paper:

The first map is called the data partitioning (DP) map. Given a hash table key, the DP map returns the name of the key's partition. The DP map thus controls the horizontal partitioning of data across the bricks.


Their implementation uses the trie data structure, one of the grand old ladies of search trees, to index the key space bit-by-bit from LSB to HSB:

the DP map is a trie over hash table keys; to find a key's partition, key bits are used to walk down the trie, starting from the least significant key bit until a leaf node is found. As the cluster grows, the DP trie subdivides in a "split" operation.


The paper continues with lots of discussion of implementation details, performance experiements, and a great section discussing how the distributed hash table ends up getting used in the building of higher-level services:

The hash table was a resounding success in simplifying the construction of interesting services, and these services inherited the scalability, availability, and data consistency of the hash table.


I think that the primary reason that this paper is still read today, a decade later, is that it is very practical, and quite concrete, and quite approachable, even if (as we'll see in future posts) many other systems have built upon this early work by violating most of its basic assumptions in order to see what sort of systems resulted.

This paper is a great "first paper" for those trying to learn about the construction of modern Internet-scale services, and the UC Berkeley Computer Science Ninja project was an important early gathering of researchers in this area. In the last decade, many of the early researches have founded research teams at other institutions to continue this work, and I'll return to some of those ideas later.

No comments:

Post a Comment