Saturday, September 19, 2009

What Does This Mean? Cost effective Metadata Management

It's been a long while since I've updated this blog. I've been very busy with a number of other projects. But with the wife and kids away this weekend I've found some time to get some ideas out that I've been sitting on for the past few weeks.

This blog post is on Metadata. If you're not sure what Metadata is, you can read an older post I published a couple years ago here. My goal is to go over the challenges of retrieving metadata, how they can be overcome, and how to prioritize one's efforts.

I have yet to walk into an organization that has any kind of Metadata Management Strategy in place. I occassionally hear about organizations - usually very large and mature - that do have a handle on their Metadata, but I myself have yet to walk into an organization where even 25% of the data elements are documented. Even most data warehouses lack good Metadata. The problem with Metadata is that it's not seen as being on any critical path, and his hence below the "priority threshold" and never gets done. Yes, some IT manager knows it should get done, but there's always something more important to do, no matter what's going on.

Building a business case to source metadata is nearly impossible, since it's uses are almost always unexpected. I recently read an article from one of those big research organizations explaining how to build a business case for Master Data Management. The article was mainly predicated on improving data quality for the purpose of optimizing Direct Mail campaigns (read: junk mail), with the goal of reducing the number of wasted mail-outs. I took a look at the companies interviewed for the article and as expected they're mainly just big software vendors. My experience with Metadata is that it is most useful when you least expect it, and most companies end up scrambling to figure out what their data means, or worse, suffer the consequences.

What to do? Well, there's two problems here really: First, how do I ensure I'm properly capturing Metadata on a go-forward basis; Second, how do I retroactively retrieve Metadata for legacy systems?

Addressing the first question I will describe a realistic approach which can be applied to organizations of any size (including a 1 person company), and which do NOT incur additional costs of development, or require the outlay for new systems and staff. This approach is in part based on a distilled version of the ISO 11179 standard for Metadata Registries. I'm a big fan of the ISO 11179 standard since it gets to the essence of the problem, and is not bogged down in tools and technologies. It allows for a wide degree of interpretation which can be as robust or simple as your needs require. Most importantly, it helps us formulation the important questions we should be asking about the data.

At the heart of ISO 11179 is the Data Element. For most organizations this is just a column in a table, but it can an XML attribute, an HTML meta tag, or a field in a VSAM file. The ISO 11179 approach breaks down data elements into what is referred to as a "Data Element Concept", and a "Value Domain" (VD). The Data Element Concept (DEC) describes the semantics of the data, which should be expressed in "plain English" terms. The DEC is not concerned with the nitty gritty representation of the information, just what it means to a layperson. For example, I might have a data element which contains a patient's recorded temperature, taken when they entered the hospital. So the DEC could be written as "patient's recorded temperature at time of hospital triage".

However, this is only part of the picture. How this data is encoded and reprented is also critical to our understanding - primarly to support data interoperability. This is where the Value Domain comes in. When descrbing our VD, we need to ask the following questions:

  1. Are there a discrete set of permitted values (known as an Enumerated Value Domain), or must the VD be described as range of values (known as a Described Value Domain)?
  2. If the VD is an Enumerated Value Domain, what are each of the Value Meanings behind each permitted value? List them all.
  3. If the VD is not Enumerated, but rather "Described", we need to know how the data is encoded and what its valid range of values are. Depending on the data in question, different encodings are possible. I'll give you some common examples: for dates and times, what timezone are we using, what our time measurement unit is (e.g. 24 hours, 12 hour, or POSIX time), and how accurate is the time? For monetary values, we should know what currency are we transacting in? For physical measures, what is our dimensionality, unit of measure, accuracy, and upper and lower limits? For text based fields, what characters or patterns are allowed or disallowed?
In our main example, the VD could be described as "temperature measured in celsius, integer values only".

It should also be recognized that a DEC and VD both derive from the same Conceptual Domain (CD). In our example, the Conceptual Domain would simply be "temperature".

Putting our DEC and VD together we get the following definition:

Patient's recorded temperature taken at time of hospital triage, measured in celsius, rounded to the nearest degree (integer values only)

We now have a reasonably comprehensive data definition which can be used to not only explain what the data means to the layperson, but it can also be used as tool to measure data quality.

Another thing you may notice about this definition is the order in which I've explained its meaning, and how I've added "rounded to the nearest degree" as an added measure of clarity. It's always important to start with the DEC, followed by the VD. In other words, start simple, and elaborate later. One of the challenges of formulating data definitions is finding the right balance between simplicity and accuracy. Too simple, and you don't always capture the essence of the data element, or its distinguishing features. To precise, and you confuse people. By choosing how you order your words, you can get the best of both worlds. You should be able to read the first clause and think "okay I get it". If you're doing more detailed analysis you can look more carefully at the second clause.

Is that it then? Is that all you need to capture? No, ideally you should be maintaining a Metadata Registry, which can contain all sorts of other details, such as data derivation information, and even semantic classification schemes (e.g. a hierarchy such as a phylogenetic tree). However, this requires having full time data stewards and heavy amounts of data governance. Most companies simply can't afford that. Furthermore, many companies are adopting JAD or Agile approaches to systems development, and tend to rely on generalist developers to create and maintain data models.

So if there's no centralized registry, where to store the Metadata? I believe the right approach is through tight-coupling of the Metadata the data schema whenever possible. After all, the schema definition constitutes metadata unto itself. This is especially true for relational databases, which form the cornerstone of most IT shops these days. In particular, every major RDBMS (including Microsoft Access) has a comment field for every column, and table. Furthermore, many RDBMSs like MySQL and Oracle, even allow you to query to this Metadata as just another table allowing for both search and reporting against company Metadata. You might say, the Metadata registry was there all along! That said, practically no developers will go out of their way to put data definitions in there. But if you show them how little effort is required, and what can be pulled out of it, and most importantly, how to go about formulating data definitions, then you would be amazed at how quickly you can start building a complete Metadata Registry, basically for free.

Similarly, XSD documents contain and tags which can be used to capture data definitions for data stored in XML. As with an RDBMS, the definitions can be thrown in there as the data is being modeled in XML. Unfortunately, what I've found is that because XML is primarily a data interchange format (as opposed to a data store), the beneficiary of the Metadata is someone else. Nevertheless, it's still important to capture these Metadata since you may get a call from an external party inquiring into the meaning of the data, so there are still selfish reasons for capturing it in the schema document.

As for other data stores where there is no room for documentation (e.g. VSAM files), there are different approaches you can take. One is to put the Metadata into a simple Excel spreadsheet, or Wiki, or Cobol Copybook. The question you need to ask yourself is: "Where will people most likely look for the Metadata?"

Taking things a step further, some vendors- notably IBM and Microsoft - have integrated their data management tools (i.e. RDBMS, ETL tool, and reporting/BI server) so that data lineage can be determined automatically. This makes it easy to determine where a data element on report or cube came from. Knowing this lineage greatly reduces your time for impact analysis when making a change to any system. In fact, I believe this is the only feasible way of addressing a "where is" search requirement for data elements.

Okay, I've answered the first big question on how to capture Metadata on a go-forward basis. Now to address the second question, which is how to capture Metadata from legacy systems.

The first step is to profile the data. The basic steps for data profiling involve:
  1. Obtaining a representative sample of the entire data set for the data element in question.
  2. Obtaining a representative sample of the recently produced data for the data element in question. This is important since the source systems may have changed over time, and so the definition may have changed over time. Depending on what your requirements are, you may only need a data definition for the most current data
  3. Look for outliers in both the entire and recent data set. This is often referred to as boundary analysis. It can either turn up the limits of the value domain, or possibly data quality issues. Either way you probably want to know this if you're enquiring into the meaning of the data.
The second step is to "follow the bread crumbs" to see where the data originated from. This is detective work so there is no doubt there is an art to this step as it may be difficult to obtain and analyze source code. I once saw a situation where one of the data loaders had no source code at all and was completely a blackbox (eventually this had to be rewritten for obvious reasons). The approach I normally take is to "triangulate" around the data, looking for all the clues I can. For example, I'll look at system documentation and business requirement documents. I'll also look at downstream applications to see how they're using the data. Often data is replicated, so sniffing around there can help too.

After data profiling and analyzing the data lineage, you may be in a position to formulate a data definition. At this point I recommend taking your best stab at writing the data definition. You now have an unconfirmed data definition. With this data definition in hand you now need to hunt down at least one of the following persons to confirm or clarify it:
  1. Those that are responsible for producing the data (e.g. customer service rep)
  2. Those that are responsible for consuming the data (e.g. floor manager)
  3. Those that are considered an authority on the data (e.g. business unit owner)
WARNING WARNING DANGER DANGER most people don't have time to answer this question, and e-mail communication won't always work here. It's much better to get the person on the phone, or even better, show up to their desk with coffee and donuts. Treat people with respect, and they will respect you.

One last thing I'd like to finish on. I mentioned earlier that Metadata is used when you least expect it. Well there's at least one unexpected situation which occurs predictably (huh!?!): Database archiving. I recently attended a talk by Jack Olsen, one of the world's leaders in database archiving. Jack pointed out that it is necessary to archive Metadata with data should we hope to interpret it correctly, long after the supporting systems have been retired (makes sense if you think about it). Legal archiving requirements are pushing retention dates farther and farther out. If data has been archived, but cannot be clearly interpreted, it is effectively lost (think missing Rosetta Stone). And if the data is lost, you automatically lose in court.

That's it for this post. In my next post I'd like to delve into data modeling.
Shameless plug: I'm currently looking for data management work. If you know of anyone looking for a data architect or other data/information expertise, please contact me at: (+1-416-315-5514).