2x2x2 Requirements
for Database Scalability
Rick Cattell, July
2009
 
This document is a summary of techniques to achieve a
highly-scalable database solution, particularly for large-scale applications
requiring 1000+ transactions per second, with millions of records. A number of
recent systems aspire to high scalability; this taxonomy provides a basis for
comparing them.
 
There are two dimensions to scalability:
 
 - Horizontal
     scaling: supporting higher
     transaction throughput by splitting the database and associated load over
     many networked servers.
- Vertical
     scaling: supporting higher
     transaction throughput by more effectively using a single server with a
     multi-core CPU and lots of memory.
 
The highest scalability is achieved by combining these two
dimensions: distributing the load over many servers, and using each server very
effectively.
 
There are two caveats I should mention in this document:
 
 - This
     is a brief overview, with little justification of my claims.  I have created a web site cattell.net/datastores
     where I am including additional references, in case my assertions are not
     self-evident.
- The
     ÒtricksÓ I suggest here do not produce a general-purpose database
     solution.  Specifically, they
     are specialized to applications performing simple operations on a small
     amount of data, where the working set can fit in the collective RAM or SSD
     of the servers.  Different
     tricks can be used in other situations, e.g. for data warehousing where
     the data is 99.99% read-only, allowing data to be replicated and
     pre-optimized for queries. 
     There is no general-purpose solution to produce scalability for all kinds of database applications; read-write
     transactions and complex queries that span data on many nodes will be more
     expensive.
 
 
Horizontal Scaling
 
There are two important techniques for achieving horizontal
scalability of databases:
 
 - The
     database can be partitioned
     (sharded) over servers, using a partitioning key that is available in
     nearly every transaction (to avoid sending the transaction to every
     server).  Partitioning allows
     the database transaction load to be split over multiple servers,
     increasing the throughput of the system.  
- Each
     partition can be replicated across
     at least two physical servers. 
     Replication is useful for failure recovery, and it also can provide
     scalability for frequently-read data, because the load can be split over
     the replicas.  But unlike
     partitioning, replication increases the cost of writes, so it should only be used as appropriate.   And if you wait for
     replication to another server before transaction commit, the servers
     should be on the same LAN; you can do remote replication for disaster recovery,
     but that must be asynchronous to avoid queuing incomplete transactions.
 
There are two secondary requirements that become necessary
with horizontal scalability, in order to make the solution practical:
 
 - The
     system must be completely self-maintaining, automatically recovering from failures.  This requires failure detection, automatic failover
     (replacing a partition with a hot standby replica), and reasonably fast
     replacement of the hot standby replica (since the system is vulnerable to
     a failure until it is replaced). 
- The database
     must evolve without taking down nodes.  The database must either be
     schema-less, or it must be possible to update the schema on a node without
     taking it down or breaking application code.  It should also be possible to add or remove physical
     nodes, automatically rebalancing the load over servers.  Ideally, the database software can
     be upgraded online as well. 
 
Without these Òcontinuous operationÓ features, it is not
practical to scale a system to dozens or hundreds of nodes: manual intervention
on failures would be too costly, and the downtime unacceptable.  Human intervention should only be
required for ÒbackgroundÓ tasks, like replacing failed hardware, providing new
servers that the system can utilize.
 
Note that multiple partitions can be stored on each physical
server.  This allows multiple CPU
cores to be utilized, and it reduces the cost of replacing a failed partition
(because there is less data to copy). 
The main downside of many [smaller] partitions is if more operations
span partitions, making them expensive.
 
 
Vertical Scaling
 
There are two important techniques for vertical scalability:
 
 - RAM
     instead of disk should be used to
     store the database. Disk should only be used for backups and possibly for
     logging: disk is the new tape drive! 
     If the database is too big to fit in the collective RAM and SSD of
     all of the database servers you can afford, then a conventional
     distributed DBMS may be a better choice.  The key to the approach here is zero disk wait.
     I will explain the advantages of this shortly.
- There
     must be minimal overhead for database durability. Durability can be guaranteed by logging to
     disk and doing online backups in the background. You might even let the
     system log to disk after the transaction completes, depending
     on your comfort level. Alternatively, for many applications you can
     achieve acceptable durability by completing the replication to another
     server before returning from the transaction.
 
When these vertical scaling techniques are used, the database
system becomes CPU-limited instead of disk-limited.  In order to reap the performance benefits, the system must
then be optimized to minimize the work associated with each database
transaction: 
 
 - The overhead
     of network calls and associated parsing must be minimized, because this is now significant relative to
     the actual transaction work (typically simple in-memory index lookup and
     data fetch/store).  A simple
     call built on TCP/IP requires thousands of machine instructions in many
     systems. The call overhead can be reduced through various means: using a
     protocol specialized to the database operations, encoding data in the
     serverÕs format, pre-compiling the expected operations to avoid parsing,
     and so on.
- Traditional
     DBMS locking and latching should be avoided. I believe the best way to accomplish this is to serialize the
     operations on each partition. 
     Since there are no ÒwaitsÓ for disk or locks in my proposed
     approach, it is acceptable for transactions to be queued for a short time
     while other transactions complete. 
     Alternatively, locking can be avoided by using MVCC, copying
     instead of locking data; this has the benefit of maintaining history, but
     at the cost of more memory usage, and more garbage collection overhead.
 
Summary
 
To summarize in a table, my 8 requirements for scalability
are as follows: 
 
 
  |   | Primary solutions | Secondary requirements | 
 
  | Horizontal Scaling | Replication Partitioning | Automatic recovery Automatic evolution | 
 
  | Vertical Scaling | Database in RAM Fast durability | Avoid locking and latching Low-overhead server calls | 
 
The Ò2x2x2Ó in the title of this paper refers to this 2x2x2
matrix: two solutions and two secondary requirements for the two kinds of
scaling.