Sunday, October 09, 2011

Keys, Data Syncronization, and the Relational Model

A common problem faced by database developers involves keeping schemas and data in sync between environments. The problem isn't even restricted to development. Often databases must be kept synchronized in operational scenarios. Data is often replicated for reasons of high availability and performance.
In this posting, I want to illustrate a real-world scenario that frequently comes up, and a modeling approach that ensures this scenario never turns into a problem. Even if you have not encountered this scenario, it's worth reading this blog in its entirety for tips on better data modeling, and why the relational model is still very relevant.

As anyone who has had to keep their calendar or contacts in sync between devices, or for developers accustomed to using Version Management systems like Visual Source Safe (VSS) or Subversion, the problem really boils down to the following three requirements:

  1. If the data is in the source data store but not the target, then add it to the target.
  2. If the data is in the target but not in the source, then remove it from the target.
  3. If the data is both in the source and the target, overwrite the target's attributes with the source's attributes.

Simple right? Indeed everything is simple... IF you have identifiers which are consistent between source and target data store. For the purposes of this blog posting, I'm going to focus on the use case of syncing data between a development server and a test server, which is equivalent to syncing between a test server and a production server.

I am not going to get into the intricacies of syncing calendar or personal contact data - that scenario is actually quite different since there is never a universally agreed upon key or way of identifying individuals, so you really don't have a proper key to compare against. That's why most PIM synchronization software tends to take a "fuzzy" approach when syncing up contacts. Companies like Facebook and LinkedIn are in a privileged position to address this ongoing problem, as they are effectively becoming a de facto registry for contact information. But I digress...

For this posting, to make my examples more explicit I'll refer to the Microsoft technology stack, although all of the major RDBMS vendors support the same functionality and tools.

The crux of the problem when synchronizing most data comes down to the alignment of keys (or identifiers if you prefer).

A common conundrum that comes up when modeling data is defining the right primary key. For example, I can create an employee table and use the employee's social insurance number as my primary key. Or perhaps I don't have access to the social insurance number, and instead use telephone number and first and last name as my primary key. In the latter case, people are known to change their name or their telephone number. The key is effectively out of our control, in order to have reliable key we end up creating what I refer to as a "technical key". Technical keys are often auto-generated from incrementing sequences, but they can just as easily be generated from a GUID. It is common to find auto-generated primary keys because not only is the data modeler in complete control of the keys values, but they are also compact, especially when referenced as a foreign key from other [dependent] tables. Data synchronization problems begin here simply because the technical key depends on the database or application that generated the key. If you import the same data into a different database for the first time it can and probably will have a different auto-generated key. I am well aware that it is possible to load data into other databases while preserving keys, and if you're taking a "master slave" approach to data synchronization you have nothing to worry about, since you're effectively just copying the data over. However if you already have data loaded in the target table and you simply need to update a few columns then you're going to run into problems.

A good solution to this problem is to define an alternate natural key. Basically this is just a UNIQUE index on one or more columns. For our above example, this could be comprised of name and telephone number. Given that the natural keys can be different, the developer should perform some analysis to assess and quantify the risk that natural keys are different.
Once you have determined the natural keys are in sync, you can [as an example] use Visual Studio 2010 Schema Compare to perform a data schema comparison which will generate a SQLCMD script (assuming you are promoting schema changes) and you can also use Visual Studio 2010 Data Compare to generate a SQL script to perform all the UPDATEs, INSERTs, and DELETEs. In VS 2010 you can choose whether to use a table's primary key, or one of the alternate natural keys you have defined.

I should point out that if your table doesn't have any set of columns (apart from the primary key) which can be guaranteed to be unique, and the primary keys themselves are generated and therefore database server specific, you should try your best to rectify this situation as you now have a more fundamental problem, which I won't be addressing in this blog (hint: You need to start looking at the target's change history).

So far so good. But what if you have a dependent/child table whose natural key depends on the parent table? Let's say there is a table called EMPLOYEE_INVENTORY which is a list of items that have been provisioned to the employee. The natural key for this table might be the composite of EMPLOYEE_ID and the INVENTORY_ITEM_ID (and the Primary Key for EMPLOYEE_INVENTORY is an auto-generated key). To keep things simple, the Primary Key and Natural Key for INVENTORY_ITEM is one and the same, and that it is a universal SKU # of the inventory item. So in summary, the primary key for EMPLOYEE_INVENTORY "ID" is a technical key, and the natural alternate key is "EMPLOYEE_ID" + "INVENTORY_ITEM_ID"

Now we have a bit of a problem when it comes to keeping EMPLOYEE_INVENTORY in synch. Namely, half of its Natural Key is derived from the Auto-generated Primary Key in its parent table (EMPLOYEE). Should we wish to synch based on Natural Keys we're forced to develop code to perform lookups, comparisons, UPDATEs, INSERTs, and DELETEs - we can no longer rely on Data Synching software like VS 2010 Data Compare to do this for us. While there is nothing inherently complicated about this, it will invariably take you a chunk of time to write this code, whether you do it in an ETL tool like SSIS, or stick to a procedural code. If you have another dependent/child table which in turn depends on EMPLOYEE_INVENTORY (e.g. EMPLOYEE_INVENTORY_LOG) things get more complicated and you're spending considerably more time to complete the task.

The preferred approach is to create what is known as an UPDATABLE VIEW which will allow you to substitute the parent table's technical primary key for its natural key. Just to clarify, an UPDATABLE VIEW is exactly what its name says it is: a VIEW you can UPDATE. As you can imagine, there are limitations as to which VIEWs can be updated (clearly anything with an aggregate would not be updatable). For our scenario though creating an UPDATABLE VIEW might look like this:
e.FIRST_NAME, --Natural Key
e.LAST_NAME, --Natural Key
ei.INVENTORY_ITEM_ID, --Natural Key
ei.LAST ACTIVITY_DATE, --Attribute to synch
ei.QUANTITY --Attribute to synch

For SQLServer if you want to make this VIEW UPDATABLE, there are two things you need to do. First, you need define it with the "SCHEMABOUND" option. Second, you need to defined a CLUSTERED UNIQUE INDEX on the natural key columns of the view (i.e. "TELEPHONE_NUMBER", "FIRST_NAME", "LAST_NAME", "INVENTORY_ITEM_ID").

You are now in a position where you can use data synchronization tools like VS 2010 Data Compare to automatically synchronize the data for you. Because everything we haven't had to write any procedural code, we can focus instead on problems which arise from the data itself (e.g. a natural key mismatch), as opposed to betting bogged down in throwaway code.

Problem solved.


What I want to illustrate in this blog is the power that comes from mindful modelling and adhering to relational principles.

In the last couple of years, NoSQL databases like MongoDB, Redis, Cassandra, Google DataStore have flourished. Indeed, these databases provide significant advantages for application developers in terms of scalability. They also feel like an ORM layer, but with much greater efficiency, so they are very desirable to application developers. And not to be ignored, many of these new databases are open source and can be used for little or no money. Case-in-point I'm planning on building a new hobby application using Google Data Store since I get up and going without paying a cent.

The downside of these modern database technologies is that they suffer from many of the same limitations that plagued pre-relational database developers. Namely, there is a "perspective lock-in". What this means is that once the application developer has modeled data for their application's use cases, it may be difficult for future applications to use the data for their own purposes. It will also be difficult to run any sort of ad hoc queries without first exporting the data to an analytical RDBMS.

This is not to say that NoSQL databases should not be used. This is to say that they should be chosen with eyes wide open and a clear understanding of the trade-offs involved. In fact, I believe the most compelling reason to use a NoSQL database is for low cost (in particular hardware costs). Let me repeat that: If you absolutely cannot afford to scale using an RDBMS and you can't see yourself bootstrapping yourself along, then go with a NoSQL database.

Put another way, virtually every NoSQL innovation I've seen, has been absorbed into an RDBMS. Take for example binary large objects (BLOBS) like videos, music, and documents: Microsoft (and presumably others) now allow you access these objects directly through the file system, while allowing them to be managed transactionally through the RDBMS. Analytical databases like ParAccel an Vertica allow for Petabyte scaling - and still allow full relational capabilities. Contrast this with Google DataStore which doesn't even support basic JOINs or aggregations (GROUP BYs). This means you have to write this code on your own. Not only is that going to be cumbersome and error prone, it's also going to perform worse as well. This is why you're beginning to see "SQL layers" Google Tenzing being added to NoSQL databases to speed up the commonly requested tasks. Even Facebook, which started the Cassandra project, still uses MySQL (combined with Memcached in a sharded configuration).

There's also the notion of BASE versus ACID, and the types of business models each can tolerate. But that's a discussion for another day...

And so I will end this blog entry.

No comments: