In
recent
years,
a
number
of
data
storage
systems
have
been
developed
with
excellent
horizontal
scaling
properties.
Most
are commonly
called
"NoSQL"
systems. Horizontal scaling allows dozens or hundreds of
machines
to operate as a single database system, performance improving
approximately
linearly with the number of machines. This is interesting because
traditional relational database systems failed to scale well when
their data is distributed over many servers (with the exception of
read-mostly data warehousing). I have been
studying scalable datastores, and have written a paper comparing
them:
Datastore Comparison
This paper has now been published in ACM SIGMOD
Record. In the paper, I discuss scalable data
stores and categorize them
into four
groups:
- Key-value stores,
including Voldemort, Riak, Redis, Membase,
Membrain,
and
Dynamo.
- Document stores,
including CouchDB, MongoDB, Terrastore, and SimpleDB.
- Extensible record stores,
including
BigTable, HBase, HyperTable,
and Cassandra.
- Scalable RDBMSs,
including MySQL
Cluster, VoltDB, Clustrix, and ScaleDB.
Any input on the paper is welcomed, particularly from experts on
specific systems. You
can contact me at rick(at)cattell.net. I will post corrections
and revisions of the paper on this web site (cattell.net/datastores).
If you'd like further reading on scalable SQL and NoSQL datastores, you
can click on the links above to learn more
about specific systems, or the links below for some other general
references that I like:
- NoSQL.mypopescu.com is
frequently updated with posts, videos, and articles on NoSQL topics and
systems.
- NoSQL-Database.org has
lots of good articles, upcoming events, and a complete list of all
the systems.
- NoSQLDatabases.com
has regular postings, jobs, upcoming events, and links.
- HighScalability.com has
good articles on scalability of databases and applications.
- Tim
Anglade's
NoSQL
Tapes include great interviews of leading NoSQL
players.
- NoSQLWeekly.com offers
weekly newsletters on NoSQL systems from Rahul Chaudhary.
- Krishna
Sankar's
blog had lots of good NoSQL and cloud computing posts,
along with web references.
- Jonathan
Ellis has broad knowledge in NoSQL systems and has good posts
on his blog. There are some interesting discussions in the
responses as well.
- Schooner has a
good blog on an important trend I don't cover in my paper:
effectively using solid state disks as a third level in the RAM/disk
hierarchy.
- ODBMS.org
has various papers and links on NoSQL systems as well as other object
stores.
You will find lots of claims about the performance and scalability of
systems out there, but few apples-to-apples comparisons. In my
opinion, the best scalability benchmark today is the Yahoo Cloud Serving
Benchmark. With the help of Roberto Zicari, I
interviewed two authors of the Yahoo benchmark paper:
YCSB
Interview
I'm trying to encourage others to run the Yahoo (YCSB) benchmark as
well. This will hopefully reduce some of the scalability
hype and confusion out there.
I've recently written a paper with Mike Stonebraker, weighing the
important
factors in making a datastore scalable:
CACM paper
This paper has some discussion of SQL vs NoSQL scalability.
A revision of the paper has been published in Communications of the
ACM (June 2011).