This article will be of particular interest to people who are curious about how powerful encryption can be implemented in high performance systems. It is also for people who just want to know a bit more about how strong encryption works and how all the pieces fit together.
Making a system secure is very difficult. Data breaches of large and widely used systems — for example, at Adobe, Anthem, eBay, Equifax, Home Depot, JP Morgan Chase, Marriott, Sony, Target, Uber, VeriSign, Yahoo, the US Office of Personnel Management, and even good security companies like RSA Security — have cost billions of dollars and compromised the privacy or identity of over a billion users.
Everyone says they want more secure systems, but people don’t like using systems that are slow or hard to use, and making a system more secure can add delays and reduce convenience. It is clearly bad when people avoid using a system that is secure, and instead choose a system that is less safe but faster or easier to use.
Because databases store the data that we most need to protect, they are a major piece of the security puzzle. How can we add strong encryption to a database without unduly compromising speed or usability? Looking at how various companies have done this can help.
Badger, a Fast DB
This article examines how the team at Dgraph Labs added data encryption to Badger, an extremely fast and reliable open-source database. Badger has over 7.3K stars on Github and is used in over 850 projects, including Uber’s Jaeger and IPFS. Most notable is UsenetExpress, which stores petabytes of data using Badger.
Badger is a key-value database (KVDB). A KVDB stores data in associative arrays (also variously called objects, dictionaries, maps, or hash tables).
KVDBs have become popular because they are one of the simplest and fastest kind of databases, and they are easy to learn because they use a data structure that is common in most programming languages.
KVDBs are well suited for use in cloud computing, and Badger is particularly fast for cloud use because it is heavily optimized for typical cloud hardware, especially SSD storage. Badger provides great performance on such hardware, compared to other popular KVDBs.
The challenge for the team was to figure out how to implement encryption without compromise. We knew that this should be possible, because companies were able to (eventually) figure out how to add encryption to http (creating https) without making it noticeably slower for the end user.
As we shall see, Badger’s advanced implementation positively affected how encryption was added to it.
Implementation of Badger
Even without encryption, the implementation of Badger makes it safer and less prone to errors that could compromise security. It is written in the Go language, which was developed by a small team at Google that included Ken Thompson (UNIX, regular expressions, Plan 9, UTF-8). The goal of Go was to improve the C language, by adding memory safety, automatic garbage collection, safe concurrency, and multiprocessing across networked computers. Badger has quickly become the most active KVDB written in Go.
Badger is designed to be used inside of other projects. It is implemented as an embeddable library, rather than as a separate service. This increases both security and speed. And as mentioned before, Badger is open source, so it can be reviewed to find unsafe code or even modified to meet project needs.
Badger is also very reliable. It guarantees serializable snapshot isolation (SSI) by running concurrent ACID transactions using multiversion concurrency control (MVCC). Badger is resilient in the face of process, filesystem, and even hardware crashes.
Separating keys from values
An advanced feature of Badger’s implementation is splitting up the key-value pairs and storing the values separately from the keys. Keys are typically short strings, so storing them together in one place makes searching for a key significantly faster. Keys typically don’t change as often as values, so more extensive optimization of the key files (including the hash tables used to look up a key) can be done to improve performance even more.
On the other hand, values are typically larger and change more often than keys. For example, a value can be an entire document that is updated as the document is edited. Multiple values are stored in a value log file (vlog). Associated with each key is a pointer to its value, consisting of the vlog file id, the offset of the value in the file, and the size of the value. Note that as an optimization, values smaller than a user specified size (default 32 bytes) are stored with their associated keys rather than in a vlog file.
Assuming 16 bytes per key and 16 bytes per value pointer, a single 64MB file can store two million key-value pairs. This is small enough to fit inside the RAM of virtually any computer, which means that often a key search can be done in high-speed memory, which can be orders of magnitude faster than searching secondary storage.
In addition to being used by itself as a database, Badger can be used to implement more sophisticated databases. In particular, Dgraph Labs created Badger specifically for use in its flagship product, Dgraph, the first native GraphQL database. Dgraph is open source, and is highly scalable to a large number of servers either on fixed computers or in the cloud.
Layering Dgraph on top of Badger provides significant benefits, in particular separation of concerns. Dgraph uses Badger to store the user’s data, so Dgraph is not concerned with data storage at all (including encryption of stored data). Because Badger is a smaller, independent system, new features like encryption can be built, tested, and verified more easily and with more confidence. Badger transparently provides encryption to Dgraph, or to any other system that uses Badger to store data.
This is similar to the Internet Protocol Suite, which provides high level protocols like TCP on top of lower level protocols like IP. This allows other high level protocols (e.g. UDP, DCCP, SCCP, RSVP) to be built on top of IP. When improvements are made to IP (such as IPv4 to IPv6), higher level protocols can all take advantage of them easily.
Advanced Encryption Standard
One of the easiest decisions to make when adding encryption to Badger was picking the highly regarded AES encryption algorithm, standardized by the US NIST and used by MongoDB, SQLite, and many other databases and systems. AES has become an industry standard as it is the most secure and widely used algorithm for encrypting data. Wide use is critical for encryption standards as they help potential security flaws get found and fixed.
AES is symmetric; the same encryption key is used for both encrypting and decrypting data. Badger supports key rotation to further secure access to data. This allows Badger to be used in systems that need to meet the various data protection regulations and requirements, including GDPR, HIPAA, and PCI DSS.
To encrypt and decrypt data requires an encryption key. AES keys come in three sizes: 128, 192, and 256 bits. Which size you use is actually not very important, as a brute force crack of a 128-bit key would take the fastest computer in the world over a hundred-quadrillion (10 to the 17th power) years. Even if you could get 10 million supercomputers to work together to do the crack, it would still take longer than the current age of the universe.
Much more important is maintaining the security of the key to avoid key leak. Obviously, encryption keys must be stored securely by the user and should not be easy to guess. However, there are other sources of key leak. For example, “side channel” attacks have demonstrated that it is possible to crack even the most secure (256-bit) AES keys in a short period of time by measuring electromagnetic radiation coming from a computer, using a device that costs around $200 and can fit into a jacket pocket. Computers doing encryption must be physically secure to prevent this.
Even without physical access, if the same key is used too often there are known ways that attackers can determine the value of the key by analyzing large amounts of data encrypted using that key. In order to avoid this, the key must be changed regularly, which is referred to as key rotation. However, when the key is changed, existing encrypted data must be decrypted using the old key and then re-encrypted using the new key. These computations can severely reduce performance for encrypted data.
One Key to Rule Them All
Instead of using the AES encryption key directly to encrypt data, Badger uses two types of keys:
- A user-provided AES encryption key, called the master key, is used to encrypt data keys.
- Data keys are generated automatically by Badger, and used to encrypt and decrypt data on disk. Each encrypted data key is stored on disk along with the encrypted data.
By default, Badger rotates the data keys every ten days, but this can be changed by the user.
The length of the master key provided by the user must be 16, 24 or 32 bytes. This determines what version of AES encryption is used (AES-128, AES-192, or AES-256 respectively).
Note that you should never use a predictable string as your master key. If you have a password manager (such as 1Password, LastPass, etc.), you can use its built-in password generator to generate a strong encryption key. Even if you don’t have a password manager, you can use a reputable online password generator, such as 1Password to generate your master key.
You should rotate your master key on a regular schedule. Fortunately, because the master key is used to encrypt only data keys, which are (even all together) much smaller compared to the data stored in the database, the master key does not need to be rotated as often (as data keys) to prevent key leak. Even better, when the master key is rotated, only the data keys need to be decrypted using the old master key, and then re-encrypted using the new master key. This is much faster than re-encrypting all of the data on disk.
Avoiding Encrypted Duplicates
When data keys are used to encrypt data stored in the database, the same data will often be encrypted multiple times before the rotation period for the data key expires. Encrypting the same data with the same data key always generates the same encrypted text to be stored in the database. This increases the ability of an attacker to reverse engineer the original plaintext data.
To reduce the predictability of the original data, Badger incorporates a standard encryption technique that doesn’t use the data key directly to encrypt the data. Instead, Badger generates a random 16-byte array called an Initialization Vector (IV). The data key is used to encrypt the IV and then the encrypted IV is XORed with the user data, and the result is stored on disk. This means that even if the same block is encrypted multiple times, the random value of the IV ensures that the stored text will be different each time.
In addition to reducing data predictability, the use of IVs increases performance because an XOR operation is vastly faster than running the AES encryption algorithm directly on the data to be encrypted. The overhead of this increased security and performance is the extra storage used for each 16-byte IV. Because of this, a potential issue with using IVs is that it can lead to data bloat.
Reducing the Storage Overhead of IVs
The files containing the keys are each divided into blocks, where each block is 4KB in size. Badger uses a unique IV to encrypt each of these blocks and stores the IV in plaintext at the end of the encrypted block. The storage overhead of a 16B IV over a 4KB block is 0.4%, which is reasonable.
Note that storing the IVs in plaintext is still secure. Assume that a cracker gets access to the IV and an encrypted block. To decrypt the block, they’d need to encrypt the IV with the data key (to XOR the encrypted data back to plaintext). But to get access to the data key, they’d need to decrypt it using the master key. If master key is safe and secure, that won’t be possible, rendering the effort futile. Thus, knowing the IV is not sufficient to decrypt the data.
Avoiding the bloat of attaching a 16-byte IV to every value
Next, we need to encrypt the values stored in the vlog files. Each value is read individually from a vlog file, using the value pointer associated with each key, which stores the vlog file id, the offset in the file, and the size of the value. If we encrypt a whole block at a time, having multiple values in a block would cause a significant performance slowdown because considerable extra data would need to be decrypted to read an individual value.
As values are often accessed individually, we decided to encrypt each value individually. Because each encrypted value is required to have its own random IV, using one IV per value adds a 16B overhead for each value. For a vlog file that contains 10,000 entries, storing an IV with each value would require 160,000 bytes, which is not good.
It isn’t quite as bad as that, because small values are stored directly in the key files, instead of pointers, and that value gets decrypted along with its key. But even with this optimization, unless all values are very small it is easy to imagine that having one IV per value could double the size of a vlog file. Luckily, we came up with a better solution.
To optimize the encryption of the vlog entries, Badger uses a unique technique. Instead of generating a 16-byte IV and storing it at the end of each value in the vlog, Badger generates a single 12-byte IV that is used for all values in a vlog file. Along with it, Badger attaches 4 bytes that contain the offset of the value in the vlog file, which together make up the required 16-byte IV. For decryption, the 4-byte vlog offsets are available from the (decrypted) value pointer stored with each key. The 12-byte IVs are unique for each vlog, and the 4-byte offsets are unique within a single vlog file, so each composite 16-byte IV is also unique.
This technique saves 16 bytes of space on disk for every value in a vlog file. Instead, this technique only requires 12 bytes per vlog file, regardless of the number of values in a file. That’s an even better savings than for the key files.
We added encryption to Badger using established and proven standards and best practices, implemented in a way that works well with the existing features and benefits of Badger. Most importantly, the addition of encryption did not appreciably slow Badger down, make it difficult or inconvenient to use, or significantly increase the storage used.
Badger is open source, and has been used as the storage engine for many large systems, and now it includes encryption for free. We hope Badger will make it easier for even more projects, especially ones that need to securely store encrypted data.