for Database Scalability
Rick Cattell, July
This document is a summary of techniques to achieve a
highly-scalable database solution, particularly for large-scale applications
requiring 1000+ transactions per second, with millions of records. A number of
recent systems aspire to high scalability; this taxonomy provides a basis for
There are two dimensions to scalability:
scaling: supporting higher
transaction throughput by splitting the database and associated load over
many networked servers.
scaling: supporting higher
transaction throughput by more effectively using a single server with a
multi-core CPU and lots of memory.
The highest scalability is achieved by combining these two
dimensions: distributing the load over many servers, and using each server very
There are two caveats I should mention in this document:
is a brief overview, with little justification of my claims. I have created a web site cattell.net/datastores
where I am including additional references, in case my assertions are not
“tricks” I suggest here do not produce a general-purpose database
solution. Specifically, they
are specialized to applications performing simple operations on a small
amount of data, where the working set can fit in the collective RAM or SSD
of the servers. Different
tricks can be used in other situations, e.g. for data warehousing where
the data is 99.99% read-only, allowing data to be replicated and
pre-optimized for queries.
There is no general-purpose solution to produce scalability for all kinds of database applications; read-write
transactions and complex queries that span data on many nodes will be more
There are two important techniques for achieving horizontal
scalability of databases:
database can be partitioned
(sharded) over servers, using a partitioning key that is available in
nearly every transaction (to avoid sending the transaction to every
server). Partitioning allows
the database transaction load to be split over multiple servers,
increasing the throughput of the system.
partition can be replicated across
at least two physical servers.
Replication is useful for failure recovery, and it also can provide
scalability for frequently-read data, because the load can be split over
the replicas. But unlike
partitioning, replication increases the cost of writes, so it should only be used as appropriate. And if you wait for
replication to another server before transaction commit, the servers
should be on the same LAN; you can do remote replication for disaster recovery,
but that must be asynchronous to avoid queuing incomplete transactions.
There are two secondary requirements that become necessary
with horizontal scalability, in order to make the solution practical:
system must be completely self-maintaining, automatically recovering from failures. This requires failure detection, automatic failover
(replacing a partition with a hot standby replica), and reasonably fast
replacement of the hot standby replica (since the system is vulnerable to
a failure until it is replaced).
- The database
must evolve without taking down nodes. The database must either be
schema-less, or it must be possible to update the schema on a node without
taking it down or breaking application code. It should also be possible to add or remove physical
nodes, automatically rebalancing the load over servers. Ideally, the database software can
be upgraded online as well.
Without these “continuous operation” features, it is not
practical to scale a system to dozens or hundreds of nodes: manual intervention
on failures would be too costly, and the downtime unacceptable. Human intervention should only be
required for “background” tasks, like replacing failed hardware, providing new
servers that the system can utilize.
Note that multiple partitions can be stored on each physical
server. This allows multiple CPU
cores to be utilized, and it reduces the cost of replacing a failed partition
(because there is less data to copy).
The main downside of many [smaller] partitions is if more operations
span partitions, making them expensive.
There are two important techniques for vertical scalability:
instead of disk should be used to
store the database. Disk should only be used for backups and possibly for
logging: disk is the new tape drive!
If the database is too big to fit in the collective RAM and SSD of
all of the database servers you can afford, then a conventional
distributed DBMS may be a better choice. The key to the approach here is zero disk wait.
I will explain the advantages of this shortly.
must be minimal overhead for database durability. Durability can be guaranteed by logging to
disk and doing online backups in the background. You might even let the
system log to disk after the transaction completes, depending
on your comfort level. Alternatively, for many applications you can
achieve acceptable durability by completing the replication to another
server before returning from the transaction.
When these vertical scaling techniques are used, the database
system becomes CPU-limited instead of disk-limited. In order to reap the performance benefits, the system must
then be optimized to minimize the work associated with each database
- The overhead
of network calls and associated parsing must be minimized, because this is now significant relative to
the actual transaction work (typically simple in-memory index lookup and
data fetch/store). A simple
call built on TCP/IP requires thousands of machine instructions in many
systems. The call overhead can be reduced through various means: using a
protocol specialized to the database operations, encoding data in the
server’s format, pre-compiling the expected operations to avoid parsing,
and so on.
DBMS locking and latching should be avoided. I believe the best way to accomplish this is to serialize the
operations on each partition.
Since there are no “waits” for disk or locks in my proposed
approach, it is acceptable for transactions to be queued for a short time
while other transactions complete.
Alternatively, locking can be avoided by using MVCC, copying
instead of locking data; this has the benefit of maintaining history, but
at the cost of more memory usage, and more garbage collection overhead.
To summarize in a table, my 8 requirements for scalability
are as follows:
Database in RAM
Avoid locking and latching
Low-overhead server calls
The “2x2x2” in the title of this paper refers to this 2x2x2
matrix: two solutions and two secondary requirements for the two kinds of