Monday, December 18, 2006

Rationale for Data Governance

When you think of Canada’s infrastructure you probably think of physical things like: highways; railways; copper, fibre optic, and coaxial communications lines; gas lines, electrical lines, and power generation; water filtration, and sewage treatment; solid waste management; airports; water ports; post offices; hospitals; schools; and so on. These things are all definitely part of our nation’s infrastructure and constitute a major part of the bedrock that allows people and businesses to build upon. Take away these things, and you’re forced to reinvent them on your own. Looking at the current situations in Afghanistan and Iraq is a constant reminder of how critical infrastructure is for a nation’s stability and prosperity, and how hard life is without them.

In addition to the infrastructures I’ve mentioned above there are also other less tangible national infrastructures that are arguably even more important: constitutional; legal; political; monetary policy and banking; social services & welfare; crime prevention, safety, & policing; and so on. Most people are also aware of these infrastructures (although they may not refer to them as infrastructure), and understand their intrinsic value without question.

Yet there is another piece of our infrastructure that I haven’t mentioned. This piece of our infrastructure allows our government to make informed fact-based decisions at all levels of government. Likewise, this piece of our national infrastructure also allows both local and international businesses to make informed decisions in the way of marketing and execution. What I’m talking about here is our Census and other national Surveys collected and managed by Statistics Canada.

“The total estimated cost of the 2006 census is $567 million spread over 7 years, employing more than 25,000 full and part-time census workers” (source: Wikipedia). Most people’s familiarity with the Census is through newspaper tidbits or soundbites like “Did you know that our population is blah blah blah now” or “Did you know that ethnic minorities constitue blah blah blah percent of our population now”. So for the average person the Census may seem like a giant project to fuel watercooler talk. In fact, not only does the government and most large business rely on the Census to make fact-based decisions, but there is an entire industry built around interpretting and deriving new facts from current and past Census data. Take for example Environics Analytics who apply the theory of “Geodemography” to derive Market Segments that allow businesses to target segments from the thrifty “Single City Renters” to the more affluent “White Picket Fences” to the ultimate “Cosmopolitan Elite”. Another company, CAP Index, uses Census data to predict crime levels for homicide, burglary, auto-theft, assault, and other forms of crime through the application of the theory of “Social Disorganization”, based on an interpretation of Census data. There are also companies like MapInfo that will extrapolate Census data to predict what Census data may look like today, or tomorrow. Another company, Global Insight, will summarize or “flatten” Census data for comparison against other countries.

I believe that fact-based decision making is beginning to captivate the popular imagination. Books like Moneyball which showed how the Oakland A’s went to the top of the MLB simply by making fewer emotional decisions, and shifting to a fact-based data driven approach are arguments that even the non-sports buff can understand and appreciate. Another book “Good to Great”, takes a fact-based approach to dispel myths about the “celebrity CEO” or the “cult of IT” as reasons for sustained corporate success. On the flip-side, Malcolm Gladwell’s Blink may seem to contradict this thinking in that it exemplifies the “intuitive expert”. However, the picture of “intuitive expert” that Gladwell paints is not just anyone. This person has an historical verifiable track-record in making decisions pertaining to a [typically narrow] domain. Thus, if we can find an expert who has a proven track record in making well-defined decisions (e.g. identifying forged art, or identifying marriages in decline), it is fair to let that person make intuitive decisions in this domain based on this person’s fact-based track record. If such an expert cannot be found, we must search elsewhere for our facts.

Intuition is essential in business, but the organization who knows where and when to apply intuition and justify the application of intuition through facts will have a better chance of success than the organization that does so in an unchecked manner. In other words, don’t get your “corporate heroes” that have had a generally successful track-record mixed up with your “intuitive experts”. I’ve gone off on a bit of a tangent here, but the point I’m trying to make is that high quality data is essential to both the government and free enterprise for making strategic decisions.

Okay, so the Census is important. So what?!? Now, I can get to the point I really want to make: The Census is only possible through strict data governance.

Not surprisingly, most people can’t be bothered relinquishing private information, nor can they be bothered filling out forms that take over an hour to complete. So, back in 1918 the Canadian government passed the Statistics Act which as it stands today, makes NOT filling out the Census, or providing false information punishable by up to a $1000 fine and 6 months in prison (source: Statistics Canada). With such legislation in place, it is in ones own best interests to complete the Census.

On the other side of the Census coin, if we look at Census data for Aboriginal communities (aka First Nations communities). By Statistics Canada’s own admission, there is a dearth of quality Census data. In fact, this lack of quality data has been cited specifically as a major obstacle to economic improvement in First Nations communities. As a consequence, the government of Canada is actively addressing this issue through the First Nations Fiscal and Statistical Management Act (FSMA) which was introduced in the house of commons December 2nd 2002. The FSMA calls for a dedicated body referred to as the First Nations Statistical Institute (aka First Nations Statistics [FSN]). This body will be integrated with Statistics Canada, and over time I hope to see some improvement with the quality of Census data coming out of First Nations Reserves. As matter of fact, I was browsing through the business section of The Globe and Mail a few weeks ago and noticed four advertised positions for statisticians to work in the FSN, so clearly things are moving forward, although probably slowly given it’s a government initiative.

I’ve met some [cynical] people who think that this improved data will make no difference to the lives of those on the reserve. While I can’t say that the improved data will make First Nations lives better, I can say with certainty that the lack of quality data is an obstacle to improvement. Furthermore, the Census can highlight communities that are working which could serve as reference models. The Census data can also definitively show what communities need the most assistance. The data can free up many political logjams, as the conversation is allowed to move from highly politicized discussions of the towns themselves (which are typically ego driven) to more rational discussions about what Critical Success Factors (CSFs) or Key Performance Indicators (KPIs) to look for, and what the definition of those CSFs and KPIs are. Since the data itself is currently dubious, those discussions quickly get derailed. But when the data does become reliable, we’ll have the bedrock for such a dialog.

The corporate world is no different. The vast majority of businesses have poor or non-existent data governance. When arguments flare up over the meaning of data, someone is quick to point out flaws in the quality of data itself, and will use this as a “hostage for fortune” to push their own agenda.

Okay, so I’ve talked about the [Canadian] government, and briefly mentioned corporation’s similar woes, but what about the wild-west that is the internet?

Wikipedia, as many of you know by now is the worlds largest “open source” encyclopaedia. Wikipedia is one of the best examples of the new emerging collaborative culture that’s sweeping the internet (aka Web 2.0). I personally love Wikipedia. While I am not an expert in everything, for those things that I do feel I know more than the average person about, I’m astounded by the amount of detail, and even accuracy of information on Wikipedia. However, I am also aware of its flaws. Some of the flaws are obvious, but believe it or not, the biggest issue with Wikipedia is rarely mentioned.

The most highly publicised flaw Wikipedia has been fingered for is maintaining the so-called “Neutral Point of View” or NPOV as Wikipedia calls it for short. Highly politicized subjects tend to create a “tug-of-war” of facts and opinion. At the time of this writing, the list contained 4,994 English articles lacking neutrality out of a total of 1,528,201 articles. So, approximately 0.3% of all articles are “disputed”. It’s actually quite interesting to see what they are. Topics range from the usual suspects: “Abortion”, “Israeli Palestinian violence”, to the more staid “SQL”, or “Audi S3”. You can read about the NPOV here:
http://en.wikipedia.org/wiki/Wikipedia:NPOV_dispute

You can also get a list of all articles with questionable neutrality here:
http://en.wikipedia.org/wiki/Special:Whatlinkshere/Wikipedia:NPOV_dispute

However, to get a better idea of the actual likelihood of someone clicking on an NPOV article, I took a look at Wikipedia’s own statistics. Namely, the 100 most requested pages. Of the 100 most requested pages for the month of December 2006, 88 of these pages are articles (the other 12 are utility pages, like the search page). I took a look at each article and noticed that no fewer than 14 are “padlocked” meaning that they can only be edited by veteran Wikipedia users – no newbies allowed. So, of the 88 most requested articles for Dec. 2006, 16% are strictly governed. I suspect that all of these articles at one time had NPOV issues, since padlocking an article is the best way of stopping the tug-of-war.

So the Neutral Point of View is surely a serious issue, but because Wikipedia has flagged these articles, there is a cue for the reader to approach them with a more critical mindset.
Perhaps a bigger problem is with Wikipedia vandals. People who make phoney edits just for the heck of it. Stephen Colbert famously encouraged his loyal viewers to do this, which they did. This not only resulted in Colbert being banned from Wikipedia, but also a lock-down on his own page, ensuring that only vetted users could modify its contents. Furthermore, Colbert’s stunt was cited as a major reason for starting Citizenddium .
Citizenddium was started by Wikipedia co-founder Larry Sanger as more strictly governed version of Wikipedia. I’ve quoted the following line from this CNET article on Citizendium, which I think is very telling of some of the issues faced by Wikipedia:

But unlike Wikipedia, Citizendium will have established volunteer editors and "constables," or administrators who enforce community rules. In essence, the service will observe a separation of church and state, with a representative republic, Sanger said.

Looks like data governance is only getting stronger here. So much for the wild-west. I can’t say I’m really celebrating this because I realize that Wikipedia wouldn’t be were it is if it started off as a rigid members only club.

However, I haven’t yet mentioned the biggest problem that Wikipedia faces. Namely: vocabulary. You can read more about this issue here: http://meta.wikimedia.org/wiki/Governance
For your convenience, I’ve quoted the most pertinent paragraph:

A common concern is that of our vocabulary, which necessarily expands to deal with professional jargons, but must be readable to more casual users or those to whom English is a second (or less familiar) language. This affects directly who can use, or contribute to, the wikipedia. It's extremely basic. Unless it's dealt with, you aren't able to read this material at all.

Interestingly enough, this particular issue maps almost directly back to the issue of Metadata (or lack thereof) that most corporations are still grappling with (see my previous post on Metadata). Therefore, once again, on this matter if Wikipedia is to properly address this issue, it must do so through improved governance. Sure software will hopefully create workflows to ensure that policies are being executed efficiently, but to be sure this is not an issue that can be solved by software alone.

Clearly, the quality of data hosted by Wikipedia is not at the “bet your business” level for all articles, and that in order for it to get there, more governance, or a complete governance overhaul (such as what Citizendium is doing) is required. In spite of these issues, I would still categorize Wikipedia as huge success in its own rite, and a model that may appear very attractive to a maverick in the corporate world.

However, I should point out one major but non-obvious difference between Wikipedia’s data and corporate data. Wikipedia’s data is essentially “owned” by the same people who go in and physical modify the articles. In other words, The Data Owner is the Data Steward. People are only adding or changing Wikipedia data because they themselves have a vested interest in those data. It is therefore in the Data Owner’s best interest to ensure the data being entered into Wikipedia is as accurate as possible. Furthermore, because all changes are done so on a volunteer basis out of self-interest, there is no “Triple Contraint”. Namely, there are no trade-offs that are required between: cost; quality; and time. Thus, the Data Owner can have [according to her own desires] the highest level of quality maintained at all times. Otherwise, there’s no point in making an edit to the article.

Enterprise data does not share the same luxury as the Wikipedia model. Those that are managing the data are usually in IT and are not the same as the Owners of the data who typically reside in other business units. Therefore, it is important to ensure that whoever makes changes to, or utilizes data, follow strict guidelines to ensure that the Data Owner’s interests are being met, and that quality be maintained. Left to ones own motivations, people will take the path of least resistance which over time will lead to degradations in Data Quality, and Data Interoperability, as the “Triple Contstraint” will force these issues. Taken further out, this will lead to increased costs. Both: tangible (e.g. increase in trouble ticket resolution times); and intangible (irate customers).

To sum up, we know that good governance over our data, not only makes it an asset, but can even be thought of as an investment. The Canadian Census did not happen by accident, but through a rigorous governance model, with real world penalties such as fines and jail time. As a result, Canada enjoys many economic advantages over countries that do not have the same quality of data. On the other hand, it is possible to have acceptable data quality in weakly governed environments, but those environments really only thrive if they are controlled by their Data Owners, and those owners are not bound by the “Triple Constraint”. However, even in the most utopian of environments, there must still be some level of governance to ensure highly consistent levels Data Quality and Data Interoperability (shared and understood vocabulary). If you want high levels of Data Quality and Data Interoperability, you cannot do so without creating policies, assigning roles, and implementing procedures. There are no Silver Bullets, and Data Governance is a necessity.

Wednesday, December 06, 2006

Metadata Defined

Regardless of whether you work in IT or some other department, if you are part of a large organization or enterprise (and sometimes even if you're not), Metadata is rapidly gaining much needed awareness.


Put simply, metadata is "data about data", or your "data definitions". However, Metadata really becomes valuable when those definitions are standardized across the enterprise and are precise enough to show the nuances between similar but semantically different data elements. For example, a "customer name" may seem similar to an "employee name", but these two data elements are semantically different and are likely not interoperable (if they are, the metadata would make this clear). Furthermore, Metadata also provides context to data so users or consumers of the data can answer basic questions like:
1. What information assets do we have?
2. What does the information asset mean?
3. Where is the information asset located?
4. How did it get there?
5. How do I gain access?


Metadata is not a new thing, in fact it has been around as long people have been storing and cataloging information. The Royal Library of Alexandria (3rd century BCE) had an indexing system overseen by Demetrius of Phaleron (a student of Aristotle) which was likely one of the first comprehensive metadata repositories to have existed; Fast forwarding to the future. When IT systems were first deployed in the 1960s, basic data dictionaries also existed to provide basic definitions of the structured data contained within the enterprise, even before the relational database was invented.


Your organization probably has some amount of metadata floating around. However, comprehensive Information Management Programs that maintain and fully leverage metadata are still rare and are typically only found in large financial institutions and governments (at the state/provincial level or federal level). That notwithstanding, successful Information Management Programs that rigorously maintain metadata almost always show huge returns on investment, although those ROI figures are often difficult to nail down and predict. This then begs the following questions:
1. Why is Metadata all of a sudden coming to the forefront now, and not before?
2. Why is it that we mainly see good metadata in large financial institutions and governments, and not as frequently in other verticals or in smaller organizations?
These are both excellent questions, and to understand the answer is to understand why you should at least be thinking about metadata as it relates to your own company. The more you understand about the value of metadata as seen through the prism of your own organization, the easier it will be for you to convince others of its inherent value.
To answer the first question as to why is metadata so important now: Within IT, Metadata can be thought of as being closer to the top of Maslow's pyramid than to the bottom. As it is commonly known, Maslow hypothesized that lower human needs must be addressed before higher needs. For example, there is no point in worrying about self-esteem if you cannot find food and shelter. However, a cursory glance in any bookstore will reveal that most people in Western society are more concerned about self-esteem than they are about obtaining basic food and shelter (not that this is not a concern, just not something that directly occupies our thoughts). Thus, the higher needs always existed, but are not at the forefront of our mind until the lower needs can be satisfied. The same goes for Information Technology. Over the past 10 years alone we have seen the following major changes:
1. Workstation stability is significantly better. You probably don't see your computer crashing (i.e. "blue screen of death") as often as you did 10 years ago. Furthermore, applications are now being rolled out as intranet web applications that don't require installation, and thus do not require "house calls" to fix.
2. Server stability is significantly better. Most modern server applications run on a virtual environment (e.g. Java Virtual Machine or Microsoft’s Common Language Runtime). Thus failures for a particular user remain isolated, and rarely affect other users. Furthermore, most modern systems now come standard with failover technology or can be configured to be part of a grid or cluster. It is even possible to patch or upgrade databases while accepting transactions, with zero downtime!
3. System interfaces are more robust and flexible than ever before, and are typically standards based (e.g. ODBC, SOAP, etc.). Furthermore, most modern interfaces are designed to work over the internet which itself is a significantly more reliable network than previous proprietary point-to-point networks.
Therefore, as users of data we are spending far less of our time calling the helpdesk about "blue screens of death" and crashed servers, and are instead spending more of our time asking questions about the data itself. This in turn also translates into application support also spending more of their time investigating questions pertaining to the meaning and understanding of data. In other words, IT spends less of its time making systems "work", and more of their time investigating the informational aspects of change.

An added problem now is that practically all organizations face the infamous "spreadmart" issue. Namely, users are extracting data from managed IT systems into desktop Excel spreadsheets or MS Access databases and copying these spreadsheets and local databases throughout the organization without also copying the data definitions behind the data (i.e. the Metadata). This creates a massive "broken telephone" situation.
Metadata provides us with the tools to address these problems. Metadata is also the cornerstone to sound Enterprise Architecture, and allows us to manage complexity and change in a cost effective manner.
To answer the second question as to why comprehensive Metadata is typically only found in large financial institutions and governments: First off, this is changing so it would be more accurate to say that financial institutions and governments in fact have the best metadata management practices. The reasons for this can be stated as follows:
1. These institutions are highly regulated and must be able to produce reports on short notice explaining every detail and provide traceability for the information they store and process.
2. Metadata requires strong governance and policy. Although there are a number of software products that can assist in the discovery of metadata, not to mention a large number of products designed to store metadata (i.e. metadata registries/repositories), Metadata requires sound governance through Data Stewardship to ensure that it is consistently managed. Conway's Law states that "the process is the product" and so this is true for Metadata. If we allow Metadata to be managed in an unfettered way ignoring corporate standards, we will never be able guarantee consistency across the enterprise. Financial institutions and governments tend to have very mature governance policies in place already and are experts at upholding governance. Furthermore, their cultures are more "command and control" in nature than other organizations, so there tends to be more buy-in for governance and a lower risk of dissent. That notwithstanding, it is possible for more nimble organizations to adapt governance to their environment and implement Information Management Programs [IMP] which can gain acceptance and provide a significant ROI while not requiring the full scope of IMPs found in more mature organizations. Furthermore, modern Metadata repositories can automate much of the governance through automatic role assignment and workflow.
3. Metadata management has been too expensive for most organizations to afford. Since Metadata software is not yet mainstream, and the know-how to implementing an Information Management Program is scarce, not to mention the human capital required through Data Stewardship, only large organizations with huge IT budgets can realize the economies of scale that Metadata provides. Nevertheless, as awareness increases so will the know-how to deliver Metadata management, and this will be the driving factor to reduce costs - even more so than the drop in Metadata repository software license costs.
Many software vendors - particularly in the data warehousing and Business Intelligence market - already offer integrated Metadata products which provide some value. However, these offerings tend to come up short when attempting to harmonize data definitions across technology domains. For example, Cognos (a popular BI vendor) offers a Metadata repository to manage Metadata for data elements directly used by Cognos. But Cognos falls short of offering true enterprise "where is" search functionality (e.g. "where are all my customers data located") since most information assets are located in other technology domains (e.g. mainframes or remote databases) that are out of reach of Cognos. In other words, if you're just concerned about Cognos reports, the Cognos Metadata Repository will serve you well, but if you're asking broader questions about the nature of, and location of information assets that do not touch Cognos, you will quickly hit a brick wall.
The software industry has developed a number of products to assist you in discovering Metadata, and in some instances even generating Metadata based on analysis of data flow, in particular through detailed analysis of ETL (Extract Transform Load) jobs. While these tools can certainly help answer questions regarding Legacy Systems, they are fundamentally just search engines and cannot actually manage the creation and maintenance of Metadata any more so than Google can manage information on The World Wide Web. Examples of these tools (or tools with this functionality) include:
1. ASG's ASG-Rochade
2. Informatica's SuperGlue
3. Sypherlink's Harvester
4. Metatrieval's Metatrieve
5. Data Advantage Group's MetaCentre
6. IBM's WebSphere MetaStage (which will soon be incorporating Unicorn's Metadata repository)
7. CA's AllFusion Repository for Distributed Systems
From a software perspective, what is really desired is a Metadata repository that can classify data in a precise enough way so as to ensure data interoperability. The most tried-and-true classification scheme is ISO 11179, although the OMG MOF classification scheme is gaining acceptance. The main difference between these schemes is that MOF is more generic and can be used to catalogue unstructured data (e.g. documents, images, e-mail, etc.) as well as structured data, whereas ISO 11179 was designed to address the taxonomy of structured data (i.e. data elements). There is also the Dublin Core standard which is primarily a classification scheme for documents, and is the most popular standard on the web for classifying HTML documents. R. Todd Stephens director of Metadata Services from BellSouth has in fact used the Dublin Core classification scheme with good success, although there are surely interoperability issues he must still face.

A good repository should also provide business context through a built-in Enterprise Architecture registry. Namely, a way of relating data elements to pertinent business entities. A sampling of these entities might be:
1. Business missions
2. Business users.
3. Business calendar cycles
4. Business locations.
5. Business rules.
6. Business processes.
7. Business transactions.
8. Etc.
If you're only interested in an Enterprise Architecture registry (hey, some people are), there are a number of standalone Enterprise Architecture registries which can do this. Most of these registries are centred on managing SOA services. To complicate matters there are also Configuration Management databases (CMDB) which are registries to manage physical IT assets such as servers, network switches, workstations, etc.
I suspect that there will be growing convergence between Metadata Registries, SOA Registries, Enterprise Architecture Registries, and possibly CMDB Registries (there is relatively little overlap between CMDB entities and Enterprise Architecture entities, since computer hardware tends to be opaque as far as the business is concerned). For the time being the following Enterprise Architecture and SOA Registries are available, so you may want to keep an eye on them as they have a lot of overlap with Metadata Registries and may begin to offer Metadata functionality:
1. Troux Technology's Metis
2. IBM's Websphere Registry and Repository
3. BEA's Flashline
4. Infravio's (now Webmethods) X-Registry
5. Mercury Interactive’s (now HP) IT Governance Foundation
Metadata that can be placed within an Enterprise Architecture context helps reduce costs as it shortens the time for impact analysis, as well as shortening employee ramp-up times. Support costs (for investigations) are also greatly reduced.


It is important to note that a structured and normalized repository can provide - for lack of a better term, Introspective Business Intelligence. Namely, in the same way in which you can apply Business Intelligence tools to derive new facts about what your business does (e.g. based on combining customer profile data with product sales data, you may find that the majority of your repeat customers are between 26 and 28 years of age). Similarly, with a well structured Metadata repository you may derive or deduce facts about your business (e.g. based on combining IS/IT systems information with business calendar cycle information, you may be able to determine that the most quiescent time of the year is in August, and therefore plan maintenance activities accordingly. As another example, you may discover that certain data elements are in higher demand than previously considered, and should therefore be moved to higher availability systems, from an Information Lifecycle Management [ILM] perspective).
A great Metadata Repository will also understand Data Stewardship processes and roles and help automate these processes and manage these roles. Roles can be managed through built-in security, and processes can be managed through configurable workflows.
Finally, an excellent Metadata Repository should be extensible and modifiable as it is impossible to predict where the enterprise will go and what business entities will appear in the coming years.
The following Metadata repositories exhibit these desired traits:
1. Whitemarsh's Metabase
2. CA's AllFusion Repository for Distributed Systems
3. Data Foundation's OneData Registry
4. Oracle's Enterprise Metadata Manager

I have spent some time discussing the background and purpose of Metadata, why it is more relevant now than ever before, and how it can be stored and organized. However, I have spent very little time discussing the governance of metadata through Data Stewardship. Data Stewardship is by far the most important aspect of Metadata, and without it you have no consistent way of managing your Metadata. In my next post I will discuss Data Governance.