Adding Encryption to a Fast Database

Dgraph Labs

This article will be of particular interest to people who are curious about how powerful encryption can be implemented in high performance systems. It is also for people who just want to know a bit more about how strong encryption works and how all the pieces fit together.


Everyone says they want more secure systems, but people don’t like using systems that are slow or hard to use, and making a system more secure can add delays and reduce convenience. It is clearly bad when people avoid using a system that is secure, and instead choose a system that is less safe but faster or easier to use.

Because databases store the data that we most need to protect, they are a major piece of the security puzzle. How can we add strong encryption to a database without unduly compromising speed or usability? Looking at how various companies have done this can help.

Badger, a Fast DB

Badger is a key-value database (KVDB). A KVDB stores data in associative arrays (also variously called objects, dictionaries, maps, or hash tables).

associative array

KVDBs have become popular because they are one of the simplest and fastest kind of databases, and they are easy to learn because they use a data structure that is common in most programming languages.

KVDBs are well suited for use in cloud computing, and Badger is particularly fast for cloud use because it is heavily optimized for typical cloud hardware, especially SSD storage. Badger provides great performance on such hardware, compared to other popular KVDBs.

The challenge for the team was to figure out how to implement encryption without compromise. We knew that this should be possible, because companies were able to (eventually) figure out how to add encryption to http (creating https) without making it noticeably slower for the end user.

As we shall see, Badger’s advanced implementation positively affected how encryption was added to it.

Implementation of Badger

The Go Gopher

Even without encryption, the implementation of Badger makes it safer and less prone to errors that could compromise security. It is written in the Go language, which was developed by a small team at Google that included Ken Thompson (UNIX, regular expressions, Plan 9, UTF-8). The goal of Go was to improve the C language, by adding memory safety, automatic garbage collection, safe concurrency, and multiprocessing across networked computers. Badger has quickly become the most active KVDB written in Go.

Badger is designed to be used inside of other projects. It is implemented as an embeddable library, rather than as a separate service. This increases both security and speed. And as mentioned before, Badger is open source, so it can be reviewed to find unsafe code or even modified to meet project needs.

Badger is also very reliable. It guarantees serializable snapshot isolation (SSI) by running concurrent ACID transactions using multiversion concurrency control (MVCC). Badger is resilient in the face of process, filesystem, and even hardware crashes.

Separating keys from values

On the other hand, values are typically larger and change more often than keys. For example, a value can be an entire document that is updated as the document is edited. Multiple values are stored in a value log file (vlog). Associated with each key is a pointer to its value, consisting of the vlog file id, the offset of the value in the file, and the size of the value. Note that as an optimization, values smaller than a user specified size (default 32 bytes) are stored with their associated keys rather than in a vlog file.

key-value file and vlog file

Assuming 16 bytes per key and 16 bytes per value pointer, a single 64MB file can store two million key-value pairs. This is small enough to fit inside the RAM of virtually any computer, which means that often a key search can be done in high-speed memory, which can be orders of magnitude faster than searching secondary storage.

Layering Badger


Layering Dgraph on top of Badger provides significant benefits, in particular separation of concerns. Dgraph uses Badger to store the user’s data, so Dgraph is not concerned with data storage at all (including encryption of stored data). Because Badger is a smaller, independent system, new features like encryption can be built, tested, and verified more easily and with more confidence. Badger transparently provides encryption to Dgraph, or to any other system that uses Badger to store data.

This is similar to the Internet Protocol Suite, which provides high level protocols like TCP on top of lower level protocols like IP. This allows other high level protocols (e.g. UDP, DCCP, SCCP, RSVP) to be built on top of IP. When improvements are made to IP (such as IPv4 to IPv6), higher level protocols can all take advantage of them easily.

Advanced Encryption Standard

AES is symmetric; the same encryption key is used for both encrypting and decrypting data. Badger supports key rotation to further secure access to data. This allows Badger to be used in systems that need to meet the various data protection regulations and requirements, including GDPR, HIPAA, and PCI DSS.

Key Rotation

Physical security is important

Much more important is maintaining the security of the key to avoid key leak. Obviously, encryption keys must be stored securely by the user and should not be easy to guess. However, there are other sources of key leak. For example, “side channel” attacks have demonstrated that it is possible to crack even the most secure (256-bit) AES keys in a short period of time by measuring electromagnetic radiation coming from a computer, using a device that costs around $200 and can fit into a jacket pocket. Computers doing encryption must be physically secure to prevent this.

Even without physical access, if the same key is used too often there are known ways that attackers can determine the value of the key by analyzing large amounts of data encrypted using that key. In order to avoid this, the key must be changed regularly, which is referred to as key rotation. However, when the key is changed, existing encrypted data must be decrypted using the old key and then re-encrypted using the new key. These computations can severely reduce performance for encrypted data.

One Key to Rule Them All

  • A user-provided AES encryption key, called the master key, is used to encrypt data keys.
  • Data keys are generated automatically by Badger, and used to encrypt and decrypt data on disk. Each encrypted data key is stored on disk along with the encrypted data.
Master key and Data keys

By default, Badger rotates the data keys every ten days, but this can be changed by the user.

The length of the master key provided by the user must be 16, 24 or 32 bytes. This determines what version of AES encryption is used (AES-128, AES-192, or AES-256 respectively).

Note that you should never use a predictable string as your master key. If you have a password manager (such as 1Password, LastPass, etc.), you can use its built-in password generator to generate a strong encryption key. Even if you don’t have a password manager, you can use a reputable online password generator, such as 1Password to generate your master key.

You should rotate your master key on a regular schedule. Fortunately, because the master key is used to encrypt only data keys, which are (even all together) much smaller compared to the data stored in the database, the master key does not need to be rotated as often (as data keys) to prevent key leak. Even better, when the master key is rotated, only the data keys need to be decrypted using the old master key, and then re-encrypted using the new master key. This is much faster than re-encrypting all of the data on disk.

Avoiding Encrypted Duplicates

To reduce the predictability of the original data, Badger incorporates a standard encryption technique that doesn’t use the data key directly to encrypt the data. Instead, Badger generates a random 16-byte array called an Initialization Vector (IV). The data key is used to encrypt the IV and then the encrypted IV is XORed with the user data, and the result is stored on disk. This means that even if the same block is encrypted multiple times, the random value of the IV ensures that the stored text will be different each time.

Initialization Vectors

In addition to reducing data predictability, the use of IVs increases performance because an XOR operation is vastly faster than running the AES encryption algorithm directly on the data to be encrypted. The overhead of this increased security and performance is the extra storage used for each 16-byte IV. Because of this, a potential issue with using IVs is that it can lead to data bloat.

Reducing the Storage Overhead of IVs

Note that storing the IVs in plaintext is still secure. Assume that a cracker gets access to the IV and an encrypted block. To decrypt the block, they’d need to encrypt the IV with the data key (to XOR the encrypted data back to plaintext). But to get access to the data key, they’d need to decrypt it using the master key. If master key is safe and secure, that won’t be possible, rendering the effort futile. Thus, knowing the IV is not sufficient to decrypt the data.

Avoiding the bloat of attaching a 16-byte IV to every value

As values are often accessed individually, we decided to encrypt each value individually. Because each encrypted value is required to have its own random IV, using one IV per value adds a 16B overhead for each value. For a vlog file that contains 10,000 entries, storing an IV with each value would require 160,000 bytes, which is not good.

It isn’t quite as bad as that, because small values are stored directly in the key files, instead of pointers, and that value gets decrypted along with its key. But even with this optimization, unless all values are very small it is easy to imagine that having one IV per value could double the size of a vlog file. Luckily, we came up with a better solution.

To optimize the encryption of the vlog entries, Badger uses a unique technique. Instead of generating a 16-byte IV and storing it at the end of each value in the vlog, Badger generates a single 12-byte IV that is used for all values in a vlog file. Along with it, Badger attaches 4 bytes that contain the offset of the value in the vlog file, which together make up the required 16-byte IV. For decryption, the 4-byte vlog offsets are available from the (decrypted) value pointer stored with each key. The 12-byte IVs are unique for each vlog, and the 4-byte offsets are unique within a single vlog file, so each composite 16-byte IV is also unique.

Initialization Vectors for values in vlog files

This technique saves 16 bytes of space on disk for every value in a vlog file. Instead, this technique only requires 12 bytes per vlog file, regardless of the number of values in a file. That’s an even better savings than for the key files.


Badger is open source, and has been used as the storage engine for many large systems, and now it includes encryption for free. We hope Badger will make it easier for even more projects, especially ones that need to securely store encrypted data.

For more information about Badger, see the Dgraph blog entries about it, including this one about encryption in Badger.

Musician, computer scientist, artist.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store