Saturday, December 01, 2007

The Secret... to Getting Things Done... in Data Management, for Dummies and Idiots... in Three Simple Steps

I've decided to hop on the dumbing down bandwagon for once. But instead of just offering your typically mind-numbing advice, and providing a bunch of pointless examples to prove my point. I'm going to try to address the skeptics head-on.

So what is this "secret", and these "three simple steps" you might ask? I will get right to the point, and then substantiate my arguments for the skeptics. That said, as with anything there are always "exceptions that prove the rule", and if you can point those out (and I encourage you to do so), then I'll try to address those arguments as well. Without further ado, here is "the secret".

  1. Keep one, and only one copy of your data in an enterprise class RDBMS.
  2. Normalize your entities to third normal form, or Boyce-Codd normal form.
  3. Maintain operational data definitions for each data element, ensuring that the definitions. are understood to mean the same thing by all stakeholders.
That's it. Simple right? Honestly speaking, if you can pull off what I've just described you will have set the foundation for perfect or near-perfect data quality (which should be the goal of all data management professionals). Well, not so fast. This is actually a lot harder than it looks, even for a green field application, let alone for legacy systems. I will point out some of the pitfalls and some of the solutions to those gotchas, starting from easiest to hardest.

Problem #1: Keeping only a single copy of the data has the following two major drawbacks:
  1. Data warehouses need to be kept separate from operational data stores, for performance reasons, storage reasons, structural reasons, and auditability reasons.
  2. Putting all your "eggs in once basket" is inherently risky, and leads to a single point of failure.
These are both valid points, but I will argue that the technology has more or less arrived that invalidates (or is on its way to invalidating) this argument. Generally speaking, we should be striving to let our RDBMS do the data management work, and not our IT professionals. The second something has to be managed by a human being is the second that you can expect quality issues. All these tools you see, such as ETL, data discovery, data cleansing, etc. are basically a way of managing the copying of data. If you can get rid of the copying (or put it in a black box), then you can eliminate the management.

Getting back to reality: First, it is possible with all major RDBMSs (i.e. DB2, Oracle, and SQL Server) to limit a user's resource, and protect another user's resources. In other words, it's possible to provide guarantees of service to the operational systems by denying resources to the reporting systems. If resources becomes a regular problem it is now relatively straightforward [through clustering technology] to swap in more memory and processing power. Second, as far as storage is concerned: Storage Area Network [SAN] and Information Lifecycle Management [ILM], and table partitioning technologies have got to the point where, with proper configuration, tables can grow indefinitely without impacting operational performance. In fact, in Oracle 11g, you can now configure child tables to partition automatically based on parent table partition configuration (before 11g you would have to explictitly manage the child table partitions). While we're on the topic of 11g, it is also now possible to Flashback forever. Namely it's possible to see historical snapshots of the database as far back as you want to go. That said, I don't believe this feature has been optimized for timeseries reporting. So, this may be the last technology holdout to support the case of building a data warehouse. Nevertheless, this is something I'm sure this is a problem the RDBMS vendors can surely solve. Third, as far as structure is concerned, it is not necessary, and in fact can be counter-productive to de-normalize data into star/snowflake schemas. I'll address this in my rebuttal to "problem #2". Fourth, auditability is indeed a big part of the ETL process. But we wouldn't need audit data if we didn't need to move it around to begin with, so it's a moot point. If you want to audit access to information, this can easily be done with current trace tools built into all the major RDBMSs.

On the problem of "putting your eggs in one basket". While I can see that this has an appeal that can be appreciated by the business, it's really a facile argument. Simply put, the more databases you need to manage, and the more varieties of databases you have to manage, results in less depth of knowledge for each individual database, and weaker regimes for each database. If you had all your data in a single database, you could then spend more of your time understanding that database, and implement the following:
  • Routine catastrope drills (i.e. server re-builds, and restores)
  • Geographically distributed failover servers
  • Rolling upgrades (I know this is possible with Oracle)
  • Improved brown-out support
In theory, if you had a lean and mean DB regimes, and well trained DBAs you could recover from a fire in your data centre in a matter of hours, and experience nothing more than a brown-out. However, in reality, DBAs and other resources are spread too thin to be able to do this, and in times of this kind of catastrophe you would be lucky to recover your data and have your applications back on-line within a week.

Problem #2: Keeping data in 3NF or Boyce-Codd Normal form is difficult to work with, and performs poorly in a data warehouse.

These are also valid points. First, I will agree that people employed in writing BI reports or creating cubes (or whatever reporting tool they're using), prefer to work with data in a star or snowflake schema. Organizing data in this format makes it easy to separate qualitative information from quantitative information, and lowers the ramp-up time for new hires. However, whenever you transform data you immediately jeopardize its quality, and introduce the possibility of error. But more importantly you eliminate information and options in the process. The beauty of normalized data, is that it allows you to pivot from any angle your heart desires. Furthermore, things like key constraints constitute information unto themselves, and by eliminating that structure it's harder to make strong assertions about the data itself. For example, an order # may have a unique constraint in its original table, but when it's mashed up into a customer dimension, it's no longer explicitly defined that that field must be unique (unless someone had the presence of mind to define it that way, but once again human error creeps in). Second, as far as performance is concerned, I agree that generally speaking de-normalized data will respond quicker. However, it is possible to define partions, cluster indexes and logical indexes, so as to achieve the exact same "Big O" Order. Therefore the differences in performance are linear, and can be solved by adding more CPUs and memory, and thus overall scalability is not affected.

Problem #3: It is impossible to get people to agree on one version of "the truth".

While all my other arguments were focused on technology, this problem is clearly in the human realm, and I will submit that this problem with never ever be definitively solved. Ultimately it comes down to politics and human will. One person's definition of "customer" may make them look good, but make another person look bad. Perception is reality, and the very way in which we concoct definitions impacts the way the business itself is run. A good friend of mine [Jonathan Ezer] posited that there is no such thing as "The Truth", and it is inherently subjective. As an example, he showed how the former planet Pluto, is no longer a planet through a voting process. Yes, there is still a hunk of ice/rock floating up there in outer space, but that's not what the average person holds to be important. Fortunately, only scientists were allowed to vote, but if astrologers were invited, they would surely have rained on the scientists parade. Okay, so this sounds a little philosophical and airy fairy. But consider this: According to Peter Aiken, the IRS took 13 years to arrive at a single definition of the word "child", including 2.5 years to move through legislation. That said, what advice I can offer to break through these kinds of logjams, and which I've also heard from other experienced practitioners is to leverage definitions that have already been vetted by external organizations. Going back to the IRS example, we now have a thorough operational definition that provides us with a good toehold. Personally, I try to look for ISO or (since I'm Canadian) StatCan definitions of terms, if the word is not clear. Another great place is Wikipedia or its successor Citizendium. Apart from that, all I can suggest is to be as diplomatic and patient as possible.

Am I wrong? Am I crazy? Or does data management need to be complicated. Well, I guess the software vendors like it that way.



Unknown said...

Great article!

Would it be correct to summarize that you propose that the entire concept of a data warehouse to be a performance hack to manage large datasets. And as database vendors and DBAs become better at managing large datasets, the need for a specialized data warehouse goes away?
I just got past the idea that a star/snowflake schema is largely a performance hack, this makes me a bit dizzy.

If we can abandon our data warehouses, I see great practical value. In my experience I've never seen a warehouse get fully implemented. There are always pieces missing, the data that never got imported, the projects that never get funding. The users invariably end up going back to the source system some of the time anyway.
If one of the reasons for a warehouse is simply to offload the source system, and it only offloads part of the work, would all that money now be better spent just beefing up the source system?

And a point on the "eggs in one basket" It is now straightforward, though still a bit expensive, to build a single logical database with no single point of failure. Using clustering technologies, like Oracle RAC, and replication technologies, like Oracle DataGuard, a single logical database can be many eggs in many baskets, all handled transparently at the physical level by the database software and the DBAs.

Neil Hepburn said...

Thanks for the kind words.
There was a time where data management was as simple as I've described it. Somehow along the way, we've been conditioned to believe that the data warehouse is ideal. It's not. It's just a stop-gap solution.

You're also right about the "eggs in one basket" being a bit of an illusion. Physically this can be circumvented using thinks like RAC and DataGuard with a distributed SAN as you pointed out. But this has yet to become second nature for most architects, and we tend to design systems with only the short-term in mind. Ironically, this is exactly what happened to a client of mine, and I'll explain that in my next blog, which should be out within the next few days.