Adding structure to NoSQL with dependent types

NoSQL databases (such as MongoDB) give developers much more freedom in how to store their data. But this freedom means that there is much more risk that the data gets into a state which can’t be read by the interfacing application. SomeSQL is a method I’m proposing for using NoSQL databases in a way that allows frequent changes to a live database schema while still being confident that nothing has been broken by the change.

We help our customers harness the power of data to deliver insight and value.

The problem

It’s likely database schemas will need to change over time, especially with iterative software development. Eventually fields will need to be added, removed, renamed etc.

There are two problems with migrating to a new schema: Updating the existing data in the database and updating the application to handle the new structure. Unlike relational databases, NoSQL databases will let all your documents (roughly equivalent to rows in relational databases) have a completely different structure. This means that updating the schema can be done incrementally instead of all at once.

Incremental schema migration fixes the problem with traditional migration in that the database can easily be used while the migration is going on. There is no need for mirrored databases etc. However incremental migration comes with its own problems. As the database itself has no concept of a schema, the schema and migration logic must be implemented in the application.  This means that the application can write invalid data to the database and the database won’t stop it, resulting in entries that the application can’t then read back.

Guaranteeing the application can understand anything read from the database and always write valid entries is extremely fiddly. Especially as incremental migration means that there may be multiple schemas in use at once. 

Modern software development processes focus on continuous small iterative changes. This means as requirements are better understood the database schema may need to change, possibly frequently. However, if every schema change has a large risk of introducing subtle bugs then lots more time will have to be spent in testing, slowing development down. This isn’t as much of a problem with relational databases as the database itself will reject invalidly formed data.

So, in an ideal world, if you want to do routine incremental schema changes with NoSQL in confidence, you will need the application to do the following:

  • After a schema change, tell you if and where the code is now broken
  • Guarantee the application will never write invalidly structured data to the database
  • Specify how the data is structured with at least as much power as relational databases. For example, that certain fields are present
  • Specify functions to convert data from one schema to the other and guarantee that the converted data matches the new schema.

The solution

The solution I’ve developed is called SomeSQL. SomeSQL is the combination of: a NoSQL database, lazy migration and dependent types. Note there are alternative solutions such as the support for schema validation in MongoDB.

Lazy migration

There are two approaches to migrating data in NoSQL databases to a new structure, eager migration and lazy migration.

Eager migration is the form of migration that relational databases use. All the data is read into the application space, updated to the new structure then written back. This can take a long time and requires a lot of work to be able to do without shutting down access to the database. The advantage of eager migration is that the application only has to support one version of the schema at once.

With lazy migration, data is migrated to the new schema only when the main application needs to. This means that dead data that isn’t needed anymore will never be updated.

For SomeSQL, every schema includes a version field which is incremented for every schema change. For every schema change the developer creates a migration function. The migration function takes data in the old schema format and any extra information (additional fields etc.) and outputs data in the new schema format. 

A master migration function can be written that can update any data from any schema to a later schema by checking the version field on the input data then applying all the individual migration functions to get to the desired version. In the simplest case, the migration takes place between reading data from the database and passing the data to the main application. This means that the main application only has to support the latest schema.

In the case of more complex migrations, which require additional information, for example updating a password hash, the migration is performed at the write stage after collecting the relevant extra information.

Dependent types

Dependent typing is type system for functional programming languages where the type of something in a program can depend on the run-time value of something else. 

One of the useful concepts that drops out of this basic idea is using types as mathematical proofs. The programmer can define types that prove specific runtime properties of the program. This takes advantage of the Curry–Howard correspondence

Some examples of properties that can be proven are: a list is sorted, a given value is, or is not, in a data structure. Functions can demand that certain proofs about their inputs are provided and if the program doesn’t provide them then compilation will fail. This means that the programmer can be certain that if it compiles their program will do what it’s supposed to do.

With the creation of the Idris programming language, dependent types are an idea that is finally emerging from academia and is usable in real applications. As an example of the power of dependent types, here is the type of a vector (similar to std::vector in C++) addition function in Idris.

add: Vect n e -> Vect m e -> Vect (n+m) e

It takes a vector of length n and a vector of length m and returns a new vector of length n+m. Thus, the output type depends on the length of the inputs at run time. If the implementation didn’t always return a vector of the right length the compiler would raise an error.

In the case of SomeSQL, dependent types are used to create proofs that the data matches some schema. These proofs are used to guarantee the migration functions migrate the data correctly. 

Sub-proofs validate that a certain field exists in the schema and has a given type, and can be used separately within the main application. So, in the part of an application that reads a name field, the function would need to be passed in a proof that the field exists. This sub-proof would be part of the main schema proof so would normally be available. 

However, if the schema changed so the name field didn’t exist, the name reading function wouldn’t get the proof it needed, and the compiler would raise an error. 

This means the developer can change the schema used by the application and know that:

  • Anywhere in the code that won’t work with the new schema will fail at compile time
  • If no compiler errors are raised, then the application will work with the new schema

This approach massively reduces the uncertainty and risk of changing schemas, making fast schema iterations practical. In effect the schema definition becomes the tests without having to actually write them.

To summarise, the process for updating to a new schema is:

  1. Create a proof for the new schema
  2. Write a migration function from the last schema to the new one
  3. Add the migration function to the migration logic
  4. Change every use of the old schema proof in the main logic to the new one
  5. Fix every instance where the code is incompatible with the new schema, i.e. where the compiler complains.

Further Work

SomeSQL was conceived with NoSQL databases in mind, however there are more kinds of database other than relational and NoSQL for example GraphQL. The SomeSQL pattern of moving schema validation to the application may be useful for these other database paradigms as well.

Acknowledgements

SomeSQL is the product of being encouraged by my manager to “do something innovative”. Thanks to Cambridge Consultants for giving me the opportunity to do this.
 

Author
Alex Skinner
Senior Engineer, Wireless and Digital Services