In
recent
years,
a
number
of
data
storage
systems
have
been
developed
with
excellent
horizontal
scaling
properties.
Most
are
commonly called "NoSQL" systems. Horizontal scaling allows
dozens or hundreds of machines to operate as a single database
system, performance improving approximately linearly with the
number of machines. This is interesting because traditional
relational database systems failed to scale well when their data
is distributed over many servers (with the exception of
read-mostly data warehousing). I have been studying scalable
datastores, and have written a paper comparing them:
Datastore Comparison
This paper has now been published in ACM
SIGMOD Record. In the paper, I discuss scalable data
stores and categorize them into four groups:
- Key-value stores,
including Voldemort, Riak, Redis, Membase,
Membrain,
and Dynamo.
- Document stores,
including CouchDB, MongoDB, Terrastore, and SimpleDB.
- Extensible record stores,
including BigTable,
HBase, HyperTable, and Cassandra.
- Scalable RDBMSs,
including MySQL
Cluster, VoltDB, Clustrix, and ScaleDB.
I also wrote a paper summarizing the most important elements of
scalable database systems (in my opinion). This paper is
unpublished, but can be viewed here:
Requirements
for Scalability
Any input on these papers is welcomed. You can contact me at
rick(at)cattell.net. I will post corrections and revisions on
this web site (cattell.net/datastores).
If you'd like further reading on scalable SQL and NoSQL datastores,
you can click on the links above to learn more about specific
systems, or the links below for some other general references that I
like:
- NoSQL.mypopescu.com
is frequently updated with posts, videos, and articles on NoSQL
topics and systems.
- NoSQL-Database.org
has lots of good articles, upcoming events, and a complete list
of all the systems.
- NoSQLDatabases.com
has regular postings, jobs, upcoming events, and links.
- HighScalability.com
has good articles on scalability of databases and applications.
- Tim
Anglade's NoSQL Tapes include great interviews of leading
NoSQL players.
- NoSQLWeekly.com offers
weekly newsletters on NoSQL systems from Rahul Chaudhary.
- Krishna
Sankar's blog had lots of good NoSQL and cloud computing
posts, along with web references.
- Jonathan
Ellis has broad knowledge in NoSQL systems and has good
posts on his blog. There are some interesting discussions
in the responses as well.
- Schooner has a
good blog on an important trend I don't cover in my paper:
effectively using solid state disks as a third level in the
RAM/disk hierarchy.
- ODBMS.org
has various papers and links on NoSQL systems as well as other
object stores.
You will find lots of claims about the performance and scalability
of systems out there, but few apples-to-apples comparisons. In
my opinion, the best scalability benchmark today is the Yahoo Cloud Serving
Benchmark. With the help of Roberto Zicari, I
interviewed two authors of the Yahoo benchmark paper:
YCSB
Interview
I'm trying to encourage others to run the Yahoo (YCSB) benchmark as
well. This will hopefully reduce some of the scalability
hype and confusion out there.
I've recently written a paper with Mike Stonebraker, as well,
weighing the important factors in making a datastore scalable:
CACM paper
This paper has some discussion of SQL vs NoSQL scalability. A
revision of the paper has been published in Communications of the ACM (June
2011).