Wednesday, December 28, 2011

Perspective is Everything: Why even the most intelligent software architects don't understand the Relational Model

A few weeks ago I stumbled on this article "A co-Relational Model of Data for Large Shared Data Banks" ("coRel" hereon) in the on-line version of ACM Queue (this is the Association of Computing Machinery's magazine). The article was authored by two employees (Erik Meijer and Gavin Bierman) of Microsoft.

The authors summarize their thesis as thus: "Contrary to popular belief, SQL and noSQL are really just two sides of the same coin." The problem with this article, is that the author's are asking the entirely wrong the question. They are looking at the world with a very narrow and single minded perspective of data - a perspective which was conventional wisdom up until the 1970s (academically), and the 1980s (commercially).

In technical terms, the authors are basically asking this question: If it is possible to implement a Network Model using a Key/Value [NoSQL] database, and it is possible to implement a Network Model using a Relational database, then can the two be queried and modified by the same declarative language? The answer to this question is a resounding yes. Unfortunately, the Network Model is not perspective neutral, which is why the Relational Model was invented.

Backing up a big, allow me to explain what I mean. The Relational Model takes a perspective neutral approach, and regards all entities, no matter how insignificant they may seem, as "first class citizens". Other data models such as the Network Model and Hierarchical Model lock the data into a given perspective and make certain entities "first class citizens" and others "second class citizens". For example, if you have ever organized your inbox e-mails into folders, or documents on your computer into folders, you have probably chosen a certain hierarchy. Maybe you organized your folders by customer, so that way when a customer asks a question, you can quickly go to the right folder and find all the necessary information. But what if a project manager comes to you and starts asking questions about a particular project, and that project cuts across customers? What normally happens here is you start searching through each customer folder looking for e-mails or documents that pertain to the project. Most people will just copy (or create short-cuts) to those project documents to another folder. We've all been through these searching and sorting exercises. When software developers are confronted with the same problem, they pretty much do the same thing - they reorganize (or refactor) the data.

However, if the data were Normalized (i.e. modeled relationally) to begin with, no such reorganization would be necessary.

The author's (and the majority of software developer's) myopia is apparent in their view of history, the example data model they provide, and even in computer science theory.

Let's talk about history first. coRel has this to say:


Codd's relational model and SQL allowed implementations from different vendors to be (near) perfect substitutes, and hence provided the conditions for perfect competition. Standardizing on the relational model and SQL created a secondary network effect around complementary producers such as educators, tool vendors, consultants, etc., all targeting the same underlying mathematical principles. Differences between actual relational database implementations and SQL dialects became to a large extent irrelevant.

While it's true that standardization around SQL led to wide adoption, such standardization had already emerged prior to the introduction of the relational model. Namely, CODASYL (the same body that created COBOL) developed a standard around the aforementioned Network Model, often referred to as the Data Base Task Group (DBTG). Much of what you see in modern SQL standards actually comes from this standard - in particular the separation of DDL (data definition language) from DML (data manipulation language). However CODASYL vendors (and there were a lot of them), were blindsided by Codd's relational model.

Interestingly, Codd's original language was not SQL, but rather Alpha. Also, the first two major RDBMS vendors had competing standards: Ingress used a language called Quel; and Oracle and IBM used SQL. But because Ingres was always based on the relational model, it was able to simply slap on support for SQL. Ingress lives on to this day in the form of PostgreSQL. The other non-RDBMS vendors also live on to this day, but tend to serve particular niches (e.g. IBM's IMS is still heavily used in banking). There was nothing inherently special about SQL, and other relational languages are still around and continue to be invented. What is special is the underlying Relational Model.

For a better explanation of why the Relational Model entered the marketplace, here's a passage I scanned in from the article "The Commercialization of Database Management Systems, 1969-1983" found in the IEEE Annals of the History of Computer, Volume 31, Number 4, October-December 2009



I am now going to talk about the sample data model used by coRel. The example is based on an example taken from Amazon's SimpleDB. Here is what the original data looks like, as described by Amazon:


Here is how the authors model this using an object model (essentially a Network Model):



Here is how the authors model this using a Relational Model:



If you're an experienced relational modeller, you will observe that the model is not in BCNF (Boyce-Codd Normal Form), as the Ratings and Keywords entities two of the entities have overlapping candidate keys.

Here is what the data model should look like in BCNF:



From the perspective of the Product Catalog application, this change seems somewhat academic. However, if we extend our normalized data model to include Tweets retrieved via Twitter keyword searches - which is important from the perspective of a marketer - things get more interesting. Here's the updated model:



What is important to note in this normalized model is that all entities are "first class citizens". As an Product Catalog application developer, I can ask questions or make changes to Keywords and Products, without having to involve Tweets. Or, as a marketer I can ask questions about Tweets and Products without having to involve Products.

When data is seen in a larger context with many different perspective, the relational model makes sense. While it may be more efficient to model data for a particular perspective (i.e. the product catalog application) using a Network or Object model, the same model can be very inefficient and lead to anomalies and contradictions in the data. This point is lost on many developers, since most only deal with a single perspective of the data. The following paragraph in coRel makes this very clear:

Summarizing what we have learned so far, we see that in order to use a relational database, starting with a natural hierarchical object model, the designer needs to normalize the data model into multiple types that no longer reflect the original intent; the application developer must reencode the original hierarchical structure by decorating the normalized data with extra metadata; and, finally, the database implementer has to speed up queries over the normalized data by building indexes that essentially re-create the original nested structure of the data as well.

See the problem? There is rarely such thing 'natural' hierarchy. Perspective is everything, and depending on how we view an ontology, we can ascribe many different hierarchies.

Now, you may be wondering if it is possible to represent a relational model in a NoSQL database, such as a key/value store. The answer is: sort of. While it is possible to recreate the structure of the relational model, it is not possible to centralize the integrity of the relational model. This is not a trivial point. Referential integrity (and other forms of integrity, such as uniqueness, nullability, and value domain constraints) are what ensure the correctness of ad hoc queries. When such constraints are removed, it is up to the application developer to examine the underlying data and perform numerous tests to ensure its integrity. The end result is poor data integrity and poor data quality. I can speak quite frankly on this last point as I see the difference between poorly constrained data models and well constrained data models, all the time. Just like the second law of thermodynamics, when unconstrained, over time data entropy tends to infinity.

In other words, the relational model allows us to manage information as a separate concern. You might even say this is the whole point of the relational model.

I now want to move on to the theoretical aspects of the paper, in particular the question of compositionality. CoRel's authors define compositionality as: "the ability to arbitrarily to combine complex values from simpler values without falling outside the system" they go on to say

SQL is rife with noncompositional features. For example, the semantics of NULL is a big mess: why does adding the number 13 to a NULL value, 13+NULL, return NULL, but summing the same two values, SUM(13, NULL), returns 13?

A more precise definition of compositionality comes from Wikipedia which states "An important aspect of denotional semantics of programming languages is compositionality, by which the denotion of a program is constructed from denotions of its parts."

SQL guarantees compositionality since it doesn't have any side effects. Contrast this with most concurrent programming languages (e.g. Java, C#, Python), and they do not guarantee compositionality since it's possible to write modules which impact the other modules.
But I don't think the authors were thinking along these lines. They're really arguing that SQL is inconsistent and point out the example with the NULLs.

The reason why there is this perception of non-compositionality is that sets and tuples are treated as primitives. You cannot make an aggregate function out of tuple functions, and you cannot make a scalar function out of aggregate functions.

NULLs are controversial to this day, and Codd even wanted to take things a step further, and distinguish between "unknown but applicable" and "unknown but inapplicable".
Codd's basic argument for the inclusion of NULLs (and three-value logic) can be summarized as thus: In the real world, handling unknown values is inherently complex. Instead of thrusting the complexity back to the user, the RDBMS should handle unknowns "correctly" - in so far as the behavior correctly models real world behavior. This can result in NULLs being counter-intuitive, but just because something is counter-intuitive it doesn't mean its wrong (think flat earth intuition, round earth reality).

For example, let's say you are hotel manager and you want to know the average number of days a guest stayed for. Assuming you have a database where front reception can log check-in and check-out times. Obviously when you haven't checked out, the check out time is unknown (or doesn't exist), so that attribute would be NULL. When taking the average of checkin [minus] checkout, you will only be including rows where both the check-in and check-out times are known, since when tuple contains an unknown element [unless stated otherwise] the tuple as a whole cannot be known, and should be eliminated from the set. This is what you want. Putting it in a SQL query, it would look like this:

SELECT
AVG(DATEDIFF(CheckOutTime, CheckInTime))
FROM
BOOKINGS

As you can see, it's a very simple query to write and validate, and reflects the correct handling of NULLs. Quite the opposite of a "big mess".

But the whole question of compositionality is also missing the point with SQL. SQL is not a computational language - it's a data retrieval language, based on the relational model. When we're talking about data, compositionality is not our main concern - normalization is. Data which is not normalized, is like a program which is non-compositional. Nasty side effects can and will arise.
---
The funny thing about the relational model is that it is predicated on Relational Algebra which is completely orthogonal to the Universal Turing Machine. The former is about logic, and the latter is about flow. They are not in competition.

But when I hear people say things like "it's about time somebody built a better database than those stupid RDBMSs", it's akin to saying "it's about time somebody build a better Universal Turing Machine". Makes no sense really.

Before I conclude this post, I want to share with you an excerpt from Joe Celko's "Thinking In Sets" which is very telling:

Many years ago, the INCITS H2 Database Standards Committee (née ANSI X3H2 Database Standards Committee) had a meeting in Rapid City, South Dakota. We had Mount Rushmore and Bjarne Stoustrup as special attractions. Mr. Stoustrup did his slide show with overhead transparencies (yes, this was before PowerPoint was ubiquitous!) about Bell Labs inventing C++ and OO programming, and we got to ask questions.

One of the questions was how we should put OO features into the working model of the next version of the SQL standard, which was known as SQL3 internally. His answer was that Bell Labs, with all their talent, had tried four different approaches to this problem and they came to the conclusion that it should not be done. OO was great for programming but deadly for data.


Summing up. While I am being critical of the coRel paper, there was clearly a lot of thought that went into it, and the authors come across as being intelligent and having a good pedigree. My point is that there is an institutional bias towards application-centric data modeling, which comes at the expense of perspective neutral data modeling - i.e. the Relational Model.
It has been my experience that this bias has led to a great deal of friction between software or application architects, and data architects. Much to the frustration of both.

My greatest hope is that by educating students at an earlier age, this deep rooted bias can be avoided. This, I will point out is a long running project of mine. A blog for another day.

Sunday, October 09, 2011

Keys, Data Syncronization, and the Relational Model

A common problem faced by database developers involves keeping schemas and data in sync between environments. The problem isn't even restricted to development. Often databases must be kept synchronized in operational scenarios. Data is often replicated for reasons of high availability and performance.
In this posting, I want to illustrate a real-world scenario that frequently comes up, and a modeling approach that ensures this scenario never turns into a problem. Even if you have not encountered this scenario, it's worth reading this blog in its entirety for tips on better data modeling, and why the relational model is still very relevant.

As anyone who has had to keep their calendar or contacts in sync between devices, or for developers accustomed to using Version Management systems like Visual Source Safe (VSS) or Subversion, the problem really boils down to the following three requirements:

  1. If the data is in the source data store but not the target, then add it to the target.
  2. If the data is in the target but not in the source, then remove it from the target.
  3. If the data is both in the source and the target, overwrite the target's attributes with the source's attributes.

Simple right? Indeed everything is simple... IF you have identifiers which are consistent between source and target data store. For the purposes of this blog posting, I'm going to focus on the use case of syncing data between a development server and a test server, which is equivalent to syncing between a test server and a production server.

I am not going to get into the intricacies of syncing calendar or personal contact data - that scenario is actually quite different since there is never a universally agreed upon key or way of identifying individuals, so you really don't have a proper key to compare against. That's why most PIM synchronization software tends to take a "fuzzy" approach when syncing up contacts. Companies like Facebook and LinkedIn are in a privileged position to address this ongoing problem, as they are effectively becoming a de facto registry for contact information. But I digress...

For this posting, to make my examples more explicit I'll refer to the Microsoft technology stack, although all of the major RDBMS vendors support the same functionality and tools.

The crux of the problem when synchronizing most data comes down to the alignment of keys (or identifiers if you prefer).

A common conundrum that comes up when modeling data is defining the right primary key. For example, I can create an employee table and use the employee's social insurance number as my primary key. Or perhaps I don't have access to the social insurance number, and instead use telephone number and first and last name as my primary key. In the latter case, people are known to change their name or their telephone number. The key is effectively out of our control, in order to have reliable key we end up creating what I refer to as a "technical key". Technical keys are often auto-generated from incrementing sequences, but they can just as easily be generated from a GUID. It is common to find auto-generated primary keys because not only is the data modeler in complete control of the keys values, but they are also compact, especially when referenced as a foreign key from other [dependent] tables. Data synchronization problems begin here simply because the technical key depends on the database or application that generated the key. If you import the same data into a different database for the first time it can and probably will have a different auto-generated key. I am well aware that it is possible to load data into other databases while preserving keys, and if you're taking a "master slave" approach to data synchronization you have nothing to worry about, since you're effectively just copying the data over. However if you already have data loaded in the target table and you simply need to update a few columns then you're going to run into problems.

A good solution to this problem is to define an alternate natural key. Basically this is just a UNIQUE index on one or more columns. For our above example, this could be comprised of name and telephone number. Given that the natural keys can be different, the developer should perform some analysis to assess and quantify the risk that natural keys are different.
Once you have determined the natural keys are in sync, you can [as an example] use Visual Studio 2010 Schema Compare to perform a data schema comparison which will generate a SQLCMD script (assuming you are promoting schema changes) and you can also use Visual Studio 2010 Data Compare to generate a SQL script to perform all the UPDATEs, INSERTs, and DELETEs. In VS 2010 you can choose whether to use a table's primary key, or one of the alternate natural keys you have defined.

I should point out that if your table doesn't have any set of columns (apart from the primary key) which can be guaranteed to be unique, and the primary keys themselves are generated and therefore database server specific, you should try your best to rectify this situation as you now have a more fundamental problem, which I won't be addressing in this blog (hint: You need to start looking at the target's change history).

So far so good. But what if you have a dependent/child table whose natural key depends on the parent table? Let's say there is a table called EMPLOYEE_INVENTORY which is a list of items that have been provisioned to the employee. The natural key for this table might be the composite of EMPLOYEE_ID and the INVENTORY_ITEM_ID (and the Primary Key for EMPLOYEE_INVENTORY is an auto-generated key). To keep things simple, the Primary Key and Natural Key for INVENTORY_ITEM is one and the same, and that it is a universal SKU # of the inventory item. So in summary, the primary key for EMPLOYEE_INVENTORY "ID" is a technical key, and the natural alternate key is "EMPLOYEE_ID" + "INVENTORY_ITEM_ID"

Now we have a bit of a problem when it comes to keeping EMPLOYEE_INVENTORY in synch. Namely, half of its Natural Key is derived from the Auto-generated Primary Key in its parent table (EMPLOYEE). Should we wish to synch based on Natural Keys we're forced to develop code to perform lookups, comparisons, UPDATEs, INSERTs, and DELETEs - we can no longer rely on Data Synching software like VS 2010 Data Compare to do this for us. While there is nothing inherently complicated about this, it will invariably take you a chunk of time to write this code, whether you do it in an ETL tool like SSIS, or stick to a procedural code. If you have another dependent/child table which in turn depends on EMPLOYEE_INVENTORY (e.g. EMPLOYEE_INVENTORY_LOG) things get more complicated and you're spending considerably more time to complete the task.

The preferred approach is to create what is known as an UPDATABLE VIEW which will allow you to substitute the parent table's technical primary key for its natural key. Just to clarify, an UPDATABLE VIEW is exactly what its name says it is: a VIEW you can UPDATE. As you can imagine, there are limitations as to which VIEWs can be updated (clearly anything with an aggregate would not be updatable). For our scenario though creating an UPDATABLE VIEW might look like this:
SELECT
e.TELEPHONE_NUMBER, --Natural Key
e.FIRST_NAME, --Natural Key
e.LAST_NAME, --Natural Key
ei.INVENTORY_ITEM_ID, --Natural Key
ei.LAST ACTIVITY_DATE, --Attribute to synch
ei.QUANTITY --Attribute to synch
FROM
EMPLOYEE AS e INNER JOIN
EMPLOYEE_INVENTORY AS ei ON e.ID = ei.EMPLOYEE_ID



For SQLServer if you want to make this VIEW UPDATABLE, there are two things you need to do. First, you need define it with the "SCHEMABOUND" option. Second, you need to defined a CLUSTERED UNIQUE INDEX on the natural key columns of the view (i.e. "TELEPHONE_NUMBER", "FIRST_NAME", "LAST_NAME", "INVENTORY_ITEM_ID").

You are now in a position where you can use data synchronization tools like VS 2010 Data Compare to automatically synchronize the data for you. Because everything we haven't had to write any procedural code, we can focus instead on problems which arise from the data itself (e.g. a natural key mismatch), as opposed to betting bogged down in throwaway code.

Problem solved.

********

What I want to illustrate in this blog is the power that comes from mindful modelling and adhering to relational principles.

In the last couple of years, NoSQL databases like MongoDB, Redis, Cassandra, Google DataStore have flourished. Indeed, these databases provide significant advantages for application developers in terms of scalability. They also feel like an ORM layer, but with much greater efficiency, so they are very desirable to application developers. And not to be ignored, many of these new databases are open source and can be used for little or no money. Case-in-point I'm planning on building a new hobby application using Google Data Store since I get up and going without paying a cent.

The downside of these modern database technologies is that they suffer from many of the same limitations that plagued pre-relational database developers. Namely, there is a "perspective lock-in". What this means is that once the application developer has modeled data for their application's use cases, it may be difficult for future applications to use the data for their own purposes. It will also be difficult to run any sort of ad hoc queries without first exporting the data to an analytical RDBMS.

This is not to say that NoSQL databases should not be used. This is to say that they should be chosen with eyes wide open and a clear understanding of the trade-offs involved. In fact, I believe the most compelling reason to use a NoSQL database is for low cost (in particular hardware costs). Let me repeat that: If you absolutely cannot afford to scale using an RDBMS and you can't see yourself bootstrapping yourself along, then go with a NoSQL database.

Put another way, virtually every NoSQL innovation I've seen, has been absorbed into an RDBMS. Take for example binary large objects (BLOBS) like videos, music, and documents: Microsoft (and presumably others) now allow you access these objects directly through the file system, while allowing them to be managed transactionally through the RDBMS. Analytical databases like ParAccel an Vertica allow for Petabyte scaling - and still allow full relational capabilities. Contrast this with Google DataStore which doesn't even support basic JOINs or aggregations (GROUP BYs). This means you have to write this code on your own. Not only is that going to be cumbersome and error prone, it's also going to perform worse as well. This is why you're beginning to see "SQL layers" Google Tenzing being added to NoSQL databases to speed up the commonly requested tasks. Even Facebook, which started the Cassandra project, still uses MySQL (combined with Memcached in a sharded configuration).

There's also the notion of BASE versus ACID, and the types of business models each can tolerate. But that's a discussion for another day...

And so I will end this blog entry.