2x2x2 Requirements for Database Scalability

Rick Cattell, July 2009

This document is a summary of techniques to achieve a highly-scalable database solution, particularly for large-scale applications requiring 1000+ transactions per second, with millions of records. A number of recent systems aspire to high scalability; this taxonomy provides a basis for comparing them.

There are two dimensions to scalability:

Horizontal scaling: supporting higher transaction throughput by splitting the database and associated load over many networked servers.
Vertical scaling: supporting higher transaction throughput by more effectively using a single server with a multi-core CPU and lots of memory.

The highest scalability is achieved by combining these two dimensions: distributing the load over many servers, and using each server very effectively.

There are two caveats I should mention in this document:

This is a brief overview, with little justification of my claims. I have created a web site cattell.net/datastores where I am including additional references, in case my assertions are not self-evident.
The “tricks” I suggest here do not produce a general-purpose database solution. Specifically, they are specialized to applications performing simple operations on a small amount of data, where the working set can fit in the collective RAM or SSD of the servers. Different tricks can be used in other situations, e.g. for data warehousing where the data is 99.99% read-only, allowing data to be replicated and pre-optimized for queries. There is no general-purpose solution to produce scalability for all kinds of database applications; read-write transactions and complex queries that span data on many nodes will be more expensive.

Horizontal Scaling

There are two important techniques for achieving horizontal scalability of databases:

The database can be partitioned (sharded) over servers, using a partitioning key that is available in nearly every transaction (to avoid sending the transaction to every server). Partitioning allows the database transaction load to be split over multiple servers, increasing the throughput of the system.
Each partition can be replicated across at least two physical servers. Replication is useful for failure recovery, and it also can provide scalability for frequently-read data, because the load can be split over the replicas. But unlike partitioning, replication increases the cost of writes, so it should only be used as appropriate. And if you wait for replication to another server before transaction commit, the servers should be on the same LAN; you can do remote replication for disaster recovery, but that must be asynchronous to avoid queuing incomplete transactions.

There are two secondary requirements that become necessary with horizontal scalability, in order to make the solution practical:

The system must be completely self-maintaining, automatically recovering from failures. This requires failure detection, automatic failover (replacing a partition with a hot standby replica), and reasonably fast replacement of the hot standby replica (since the system is vulnerable to a failure until it is replaced).
The database must evolve without taking down nodes. The database must either be schema-less, or it must be possible to update the schema on a node without taking it down or breaking application code. It should also be possible to add or remove physical nodes, automatically rebalancing the load over servers. Ideally, the database software can be upgraded online as well.

Without these “continuous operation” features, it is not practical to scale a system to dozens or hundreds of nodes: manual intervention on failures would be too costly, and the downtime unacceptable. Human intervention should only be required for “background” tasks, like replacing failed hardware, providing new servers that the system can utilize.

Note that multiple partitions can be stored on each physical server. This allows multiple CPU cores to be utilized, and it reduces the cost of replacing a failed partition (because there is less data to copy). The main downside of many [smaller] partitions is if more operations span partitions, making them expensive.

Vertical Scaling

There are two important techniques for vertical scalability:

RAM instead of disk should be used to store the database. Disk should only be used for backups and possibly for logging: disk is the new tape drive! If the database is too big to fit in the collective RAM and SSD of all of the database servers you can afford, then a conventional distributed DBMS may be a better choice. The key to the approach here is zero disk wait. I will explain the advantages of this shortly.
There must be minimal overhead for database durability. Durability can be guaranteed by logging to disk and doing online backups in the background. You might even let the system log to disk after the transaction completes, depending on your comfort level. Alternatively, for many applications you can achieve acceptable durability by completing the replication to another server before returning from the transaction.

When these vertical scaling techniques are used, the database system becomes CPU-limited instead of disk-limited. In order to reap the performance benefits, the system must then be optimized to minimize the work associated with each database transaction:

The overhead of network calls and associated parsing must be minimized, because this is now significant relative to the actual transaction work (typically simple in-memory index lookup and data fetch/store). A simple call built on TCP/IP requires thousands of machine instructions in many systems. The call overhead can be reduced through various means: using a protocol specialized to the database operations, encoding data in the server’s format, pre-compiling the expected operations to avoid parsing, and so on.
Traditional DBMS locking and latching should be avoided. I believe the best way to accomplish this is to serialize the operations on each partition. Since there are no “waits” for disk or locks in my proposed approach, it is acceptable for transactions to be queued for a short time while other transactions complete. Alternatively, locking can be avoided by using MVCC, copying instead of locking data; this has the benefit of maintaining history, but at the cost of more memory usage, and more garbage collection overhead.

Summary

To summarize in a table, my 8 requirements for scalability are as follows:

Primary solutions

Secondary requirements

Horizontal Scaling

Replication

Partitioning

Automatic recovery

Automatic evolution

Vertical Scaling

Database in RAM

Fast durability

Avoid locking and latching

Low-overhead server calls

The “2x2x2” in the title of this paper refers to this 2x2x2 matrix: two solutions and two secondary requirements for the two kinds of scaling.