Recently, I spoke at NoSQL Matters in Barcelona about database history. As somebody with a history background, I was pretty excited to dig into the past, beyond the hype and marketing fluff, and look specifically at what technical problems each generation of database solved and where they in-turn fell short.
However, I got stuck at one moment in time I found utterly fascinating: the original development of relational databases. So much of the NoSQL movement feels like a rebellion against the “old timey” feeling relational databases. So I thought it would be fascinating to be a contrarian, to dig into what value relational databases have added to the world. Something everyone thinks is obvious but nobody really understands.
It’s very easy and popular to criticize relational databases. What folks don’t seem to do is go back and appreciate how revolutionary relational databases were when they came out. We forget what problems they solved. We forget how earlier databases fell short, and how relational databases solved the problems of the first generation of databases. In short, relational databases were the noSomething, and I aimed to find out what that something was.
And from that apply those lessons to today’s NoSQL databases. Are today’s databases repeating mistakes of the past? Or are they filling an important niche (or both?).
The First Databases
To appreciate how revolutionary the relational model is, we need to appreciate what the earliest, pre-relational databases looked like. These databases largely reflect what we would build if we were tasked by our boss to “build a database”. Maybe, if we’re tracking movie rentals, in our ignorance we might start out with a line-by-line csv listing of customers and what movies they’ve rented. Maybe something silly that looks like this:
Poor Man’s Database
We might also add an “index” to the back of the file to help us find specific records, in the same way we’d use an index in a book. Here the index is telling us that the “doug” record is 512 characters into the file while the “rick” record is 9212 characters into the file.
Now the earliest designers of these data storage systems would have had to face the problem, “How do I store movies as their own record?”. For example, we’re going to need to start storing how much a movie costs to rent and how many we have in stock. Should movies like “Top Gun” be treated as top-level records? Or should they be as parts-of (something owned by) the user records we already have? With movies as their own records, we don’t have to duplicate all the inventory and price data everywhere, but we’ll need to create another construct (an additional record type?) to specify the relationship.
Poor Man’s Database
The databases that aggregate videos into users are considered to be following the “hierarchical” model. The databases that break out movies into their own records with the ability to link records fall into the “navigational” model.
You can imagine, if we were interacting with this database, our first reaction would be to create a rather primitive interface – create something low-level that talked only in movies and users. Luckily, others saw past this and realized that patterns like the ones in our video rental and other data stores could be generalized–that we could abstract the notion of “record” and “ownership” into something much more powerful. Eventually this is exactly what Codasyl/DBTG (Database Task Group) did, creating a standard language for creating and interacting with navigational and hierarchical databases.
So reflecting on our data model, you could create a record in one of these databases by using the RECORD command:
Record Name is USER Location Mode is Calc Using username Duplicates are not allowed username Type character 25 address Type character 50 phonenumber Type character 10
And when we want to declare a relationship between two records, we can define a set that maps them in an owner-ownee relationship.
Set Name is USER-VIDEOS Order is Next Retention is Mandatory Owner is USER Member is VIDEO
This model reflects first building many databases and then attempting to generalize important abstractions. We can think of the early history of databases like our video rental example. Since the advent of storage media, folks needed to store all kinds of data. Eventually somebody did something laudable, reflected on all the patterns being used in data storage and management and created a generalized database that was codified in the DBTG language.
Whereas the first database’s abstractions grew out of patterns learned building from the bottom-up, the relational model did just the opposite. The relational model creates an amazingly powerful abstraction rooted in predicate calculus, and then expects the implementation details to follow (which as we all know they did).
When Codd wrote his paper, he criticized the DBTG databases of the day around the area of how the application interacted with the databases abstractions. Low-level abstractions leaked into user applications. Application logic became dependent on aspects of the databases: Specifically, he cites three criticisms:
- Access Dependencies: We often need to navigate from users -> videos to get at the videos. Application logic depends on how records are linked or aggregated. I must use one record type to get another record type.
- Order Dependencies: Applications depend on how data is physically stored in the database. Notice the “ORDER is NEXT” line in the Set declaration above. This specifies storage based on insert order, so retrieval will in turn happen on index order.
- Index Dependencies: When accessing indices, these databases required database indices be referred to explicitly by name.
Codd proposed to get around these limitations by focusing on a specific abstraction: relations. A relation is simply a tuple of elements. The ith element of each tuple is a member of some set, known as that element’s domain. Perhaps a given element’s domain is the set of users, user ids, possible movies to rent, etc. So for our videos, a tuple might look like:
(user=u, address=a, list of movies rented=l)
Or in other words
(doug, 1234 Bagby St, [Top Gun 3.99, Terminator 12.99])
One can interpret this relation to “mean” any number of things. We might equate this to a statement that “Doug rents Top Gun at 3.99 and Terminator at 12.99” or “Doug lives at 1234 Bagby St”.