Scalable Datastores
Rick Cattell

Last revised 7/30/2012


In recent years, a number of data storage systems have been developed with excellent horizontal scaling properties.  Most are commonly called "NoSQL" systems.  Horizontal scaling allows dozens or hundreds of machines to operate as a single database system, performance improving approximately linearly with the number of machines. This is interesting because traditional relational database systems failed to scale well when their data is distributed over many servers (with the exception of read-mostly data warehousing).  I have been studying scalable datastores, and have written a paper comparing them:

    Datastore Comparison

This paper has now been published in ACM SIGMOD Record.  In the paper, I discuss scalable data stores and categorize them into four groups:

I also wrote a paper summarizing the most important elements of scalable database systems (in my opinion).  This paper is unpublished, but can be viewed here:

    Requirements for Scalability

Any input on these papers is welcomed.  You can contact me at rick(at)cattell.net.  I will post corrections and revisions on this web site (cattell.net/datastores).

If you'd like further reading on scalable SQL and NoSQL datastores, you can click on the links above to learn more about specific systems, or the links below for some other general references that I like:
You will find lots of claims about the performance and scalability of systems out there, but few apples-to-apples comparisons.  In my opinion, the best scalability benchmark today is the Yahoo Cloud Serving Benchmark.   With the help of Roberto Zicari, I interviewed two authors of the Yahoo benchmark paper:

  YCSB Interview

I'm trying to encourage others to run the Yahoo (YCSB) benchmark as well.  This will hopefully reduce some of the scalability hype  and confusion out there.

I've recently written a paper with Mike Stonebraker, as well, weighing the important factors in making a datastore scalable:

    CACM paper

This paper has some discussion of SQL vs NoSQL scalability.  A revision of the paper has been published in Communications of the ACM (June 2011).