Monday, December 18, 2006

Rationale for Data Governance

When you think of Canada’s infrastructure you probably think of physical things like: highways; railways; copper, fibre optic, and coaxial communications lines; gas lines, electrical lines, and power generation; water filtration, and sewage treatment; solid waste management; airports; water ports; post offices; hospitals; schools; and so on. These things are all definitely part of our nation’s infrastructure and constitute a major part of the bedrock that allows people and businesses to build upon. Take away these things, and you’re forced to reinvent them on your own. Looking at the current situations in Afghanistan and Iraq is a constant reminder of how critical infrastructure is for a nation’s stability and prosperity, and how hard life is without them.

In addition to the infrastructures I’ve mentioned above there are also other less tangible national infrastructures that are arguably even more important: constitutional; legal; political; monetary policy and banking; social services & welfare; crime prevention, safety, & policing; and so on. Most people are also aware of these infrastructures (although they may not refer to them as infrastructure), and understand their intrinsic value without question.

Yet there is another piece of our infrastructure that I haven’t mentioned. This piece of our infrastructure allows our government to make informed fact-based decisions at all levels of government. Likewise, this piece of our national infrastructure also allows both local and international businesses to make informed decisions in the way of marketing and execution. What I’m talking about here is our Census and other national Surveys collected and managed by Statistics Canada.

“The total estimated cost of the 2006 census is $567 million spread over 7 years, employing more than 25,000 full and part-time census workers” (source: Wikipedia). Most people’s familiarity with the Census is through newspaper tidbits or soundbites like “Did you know that our population is blah blah blah now” or “Did you know that ethnic minorities constitue blah blah blah percent of our population now”. So for the average person the Census may seem like a giant project to fuel watercooler talk. In fact, not only does the government and most large business rely on the Census to make fact-based decisions, but there is an entire industry built around interpretting and deriving new facts from current and past Census data. Take for example Environics Analytics who apply the theory of “Geodemography” to derive Market Segments that allow businesses to target segments from the thrifty “Single City Renters” to the more affluent “White Picket Fences” to the ultimate “Cosmopolitan Elite”. Another company, CAP Index, uses Census data to predict crime levels for homicide, burglary, auto-theft, assault, and other forms of crime through the application of the theory of “Social Disorganization”, based on an interpretation of Census data. There are also companies like MapInfo that will extrapolate Census data to predict what Census data may look like today, or tomorrow. Another company, Global Insight, will summarize or “flatten” Census data for comparison against other countries.

I believe that fact-based decision making is beginning to captivate the popular imagination. Books like Moneyball which showed how the Oakland A’s went to the top of the MLB simply by making fewer emotional decisions, and shifting to a fact-based data driven approach are arguments that even the non-sports buff can understand and appreciate. Another book “Good to Great”, takes a fact-based approach to dispel myths about the “celebrity CEO” or the “cult of IT” as reasons for sustained corporate success. On the flip-side, Malcolm Gladwell’s Blink may seem to contradict this thinking in that it exemplifies the “intuitive expert”. However, the picture of “intuitive expert” that Gladwell paints is not just anyone. This person has an historical verifiable track-record in making decisions pertaining to a [typically narrow] domain. Thus, if we can find an expert who has a proven track record in making well-defined decisions (e.g. identifying forged art, or identifying marriages in decline), it is fair to let that person make intuitive decisions in this domain based on this person’s fact-based track record. If such an expert cannot be found, we must search elsewhere for our facts.

Intuition is essential in business, but the organization who knows where and when to apply intuition and justify the application of intuition through facts will have a better chance of success than the organization that does so in an unchecked manner. In other words, don’t get your “corporate heroes” that have had a generally successful track-record mixed up with your “intuitive experts”. I’ve gone off on a bit of a tangent here, but the point I’m trying to make is that high quality data is essential to both the government and free enterprise for making strategic decisions.

Okay, so the Census is important. So what?!? Now, I can get to the point I really want to make: The Census is only possible through strict data governance.

Not surprisingly, most people can’t be bothered relinquishing private information, nor can they be bothered filling out forms that take over an hour to complete. So, back in 1918 the Canadian government passed the Statistics Act which as it stands today, makes NOT filling out the Census, or providing false information punishable by up to a $1000 fine and 6 months in prison (source: Statistics Canada). With such legislation in place, it is in ones own best interests to complete the Census.

On the other side of the Census coin, if we look at Census data for Aboriginal communities (aka First Nations communities). By Statistics Canada’s own admission, there is a dearth of quality Census data. In fact, this lack of quality data has been cited specifically as a major obstacle to economic improvement in First Nations communities. As a consequence, the government of Canada is actively addressing this issue through the First Nations Fiscal and Statistical Management Act (FSMA) which was introduced in the house of commons December 2nd 2002. The FSMA calls for a dedicated body referred to as the First Nations Statistical Institute (aka First Nations Statistics [FSN]). This body will be integrated with Statistics Canada, and over time I hope to see some improvement with the quality of Census data coming out of First Nations Reserves. As matter of fact, I was browsing through the business section of The Globe and Mail a few weeks ago and noticed four advertised positions for statisticians to work in the FSN, so clearly things are moving forward, although probably slowly given it’s a government initiative.

I’ve met some [cynical] people who think that this improved data will make no difference to the lives of those on the reserve. While I can’t say that the improved data will make First Nations lives better, I can say with certainty that the lack of quality data is an obstacle to improvement. Furthermore, the Census can highlight communities that are working which could serve as reference models. The Census data can also definitively show what communities need the most assistance. The data can free up many political logjams, as the conversation is allowed to move from highly politicized discussions of the towns themselves (which are typically ego driven) to more rational discussions about what Critical Success Factors (CSFs) or Key Performance Indicators (KPIs) to look for, and what the definition of those CSFs and KPIs are. Since the data itself is currently dubious, those discussions quickly get derailed. But when the data does become reliable, we’ll have the bedrock for such a dialog.

The corporate world is no different. The vast majority of businesses have poor or non-existent data governance. When arguments flare up over the meaning of data, someone is quick to point out flaws in the quality of data itself, and will use this as a “hostage for fortune” to push their own agenda.

Okay, so I’ve talked about the [Canadian] government, and briefly mentioned corporation’s similar woes, but what about the wild-west that is the internet?

Wikipedia, as many of you know by now is the worlds largest “open source” encyclopaedia. Wikipedia is one of the best examples of the new emerging collaborative culture that’s sweeping the internet (aka Web 2.0). I personally love Wikipedia. While I am not an expert in everything, for those things that I do feel I know more than the average person about, I’m astounded by the amount of detail, and even accuracy of information on Wikipedia. However, I am also aware of its flaws. Some of the flaws are obvious, but believe it or not, the biggest issue with Wikipedia is rarely mentioned.

The most highly publicised flaw Wikipedia has been fingered for is maintaining the so-called “Neutral Point of View” or NPOV as Wikipedia calls it for short. Highly politicized subjects tend to create a “tug-of-war” of facts and opinion. At the time of this writing, the list contained 4,994 English articles lacking neutrality out of a total of 1,528,201 articles. So, approximately 0.3% of all articles are “disputed”. It’s actually quite interesting to see what they are. Topics range from the usual suspects: “Abortion”, “Israeli Palestinian violence”, to the more staid “SQL”, or “Audi S3”. You can read about the NPOV here:

You can also get a list of all articles with questionable neutrality here:

However, to get a better idea of the actual likelihood of someone clicking on an NPOV article, I took a look at Wikipedia’s own statistics. Namely, the 100 most requested pages. Of the 100 most requested pages for the month of December 2006, 88 of these pages are articles (the other 12 are utility pages, like the search page). I took a look at each article and noticed that no fewer than 14 are “padlocked” meaning that they can only be edited by veteran Wikipedia users – no newbies allowed. So, of the 88 most requested articles for Dec. 2006, 16% are strictly governed. I suspect that all of these articles at one time had NPOV issues, since padlocking an article is the best way of stopping the tug-of-war.

So the Neutral Point of View is surely a serious issue, but because Wikipedia has flagged these articles, there is a cue for the reader to approach them with a more critical mindset.
Perhaps a bigger problem is with Wikipedia vandals. People who make phoney edits just for the heck of it. Stephen Colbert famously encouraged his loyal viewers to do this, which they did. This not only resulted in Colbert being banned from Wikipedia, but also a lock-down on his own page, ensuring that only vetted users could modify its contents. Furthermore, Colbert’s stunt was cited as a major reason for starting Citizenddium .
Citizenddium was started by Wikipedia co-founder Larry Sanger as more strictly governed version of Wikipedia. I’ve quoted the following line from this CNET article on Citizendium, which I think is very telling of some of the issues faced by Wikipedia:

But unlike Wikipedia, Citizendium will have established volunteer editors and "constables," or administrators who enforce community rules. In essence, the service will observe a separation of church and state, with a representative republic, Sanger said.

Looks like data governance is only getting stronger here. So much for the wild-west. I can’t say I’m really celebrating this because I realize that Wikipedia wouldn’t be were it is if it started off as a rigid members only club.

However, I haven’t yet mentioned the biggest problem that Wikipedia faces. Namely: vocabulary. You can read more about this issue here:
For your convenience, I’ve quoted the most pertinent paragraph:

A common concern is that of our vocabulary, which necessarily expands to deal with professional jargons, but must be readable to more casual users or those to whom English is a second (or less familiar) language. This affects directly who can use, or contribute to, the wikipedia. It's extremely basic. Unless it's dealt with, you aren't able to read this material at all.

Interestingly enough, this particular issue maps almost directly back to the issue of Metadata (or lack thereof) that most corporations are still grappling with (see my previous post on Metadata). Therefore, once again, on this matter if Wikipedia is to properly address this issue, it must do so through improved governance. Sure software will hopefully create workflows to ensure that policies are being executed efficiently, but to be sure this is not an issue that can be solved by software alone.

Clearly, the quality of data hosted by Wikipedia is not at the “bet your business” level for all articles, and that in order for it to get there, more governance, or a complete governance overhaul (such as what Citizendium is doing) is required. In spite of these issues, I would still categorize Wikipedia as huge success in its own rite, and a model that may appear very attractive to a maverick in the corporate world.

However, I should point out one major but non-obvious difference between Wikipedia’s data and corporate data. Wikipedia’s data is essentially “owned” by the same people who go in and physical modify the articles. In other words, The Data Owner is the Data Steward. People are only adding or changing Wikipedia data because they themselves have a vested interest in those data. It is therefore in the Data Owner’s best interest to ensure the data being entered into Wikipedia is as accurate as possible. Furthermore, because all changes are done so on a volunteer basis out of self-interest, there is no “Triple Contraint”. Namely, there are no trade-offs that are required between: cost; quality; and time. Thus, the Data Owner can have [according to her own desires] the highest level of quality maintained at all times. Otherwise, there’s no point in making an edit to the article.

Enterprise data does not share the same luxury as the Wikipedia model. Those that are managing the data are usually in IT and are not the same as the Owners of the data who typically reside in other business units. Therefore, it is important to ensure that whoever makes changes to, or utilizes data, follow strict guidelines to ensure that the Data Owner’s interests are being met, and that quality be maintained. Left to ones own motivations, people will take the path of least resistance which over time will lead to degradations in Data Quality, and Data Interoperability, as the “Triple Contstraint” will force these issues. Taken further out, this will lead to increased costs. Both: tangible (e.g. increase in trouble ticket resolution times); and intangible (irate customers).

To sum up, we know that good governance over our data, not only makes it an asset, but can even be thought of as an investment. The Canadian Census did not happen by accident, but through a rigorous governance model, with real world penalties such as fines and jail time. As a result, Canada enjoys many economic advantages over countries that do not have the same quality of data. On the other hand, it is possible to have acceptable data quality in weakly governed environments, but those environments really only thrive if they are controlled by their Data Owners, and those owners are not bound by the “Triple Constraint”. However, even in the most utopian of environments, there must still be some level of governance to ensure highly consistent levels Data Quality and Data Interoperability (shared and understood vocabulary). If you want high levels of Data Quality and Data Interoperability, you cannot do so without creating policies, assigning roles, and implementing procedures. There are no Silver Bullets, and Data Governance is a necessity.

No comments: