Saturday, September 19, 2009

What Does This Mean? Cost effective Metadata Management

It's been a long while since I've updated this blog. I've been very busy with a number of other projects. But with the wife and kids away this weekend I've found some time to get some ideas out that I've been sitting on for the past few weeks.

This blog post is on Metadata. If you're not sure what Metadata is, you can read an older post I published a couple years ago here. My goal is to go over the challenges of retrieving metadata, how they can be overcome, and how to prioritize one's efforts.

I have yet to walk into an organization that has any kind of Metadata Management Strategy in place. I occassionally hear about organizations - usually very large and mature - that do have a handle on their Metadata, but I myself have yet to walk into an organization where even 25% of the data elements are documented. Even most data warehouses lack good Metadata. The problem with Metadata is that it's not seen as being on any critical path, and his hence below the "priority threshold" and never gets done. Yes, some IT manager knows it should get done, but there's always something more important to do, no matter what's going on.

Building a business case to source metadata is nearly impossible, since it's uses are almost always unexpected. I recently read an article from one of those big research organizations explaining how to build a business case for Master Data Management. The article was mainly predicated on improving data quality for the purpose of optimizing Direct Mail campaigns (read: junk mail), with the goal of reducing the number of wasted mail-outs. I took a look at the companies interviewed for the article and as expected they're mainly just big software vendors. My experience with Metadata is that it is most useful when you least expect it, and most companies end up scrambling to figure out what their data means, or worse, suffer the consequences.

What to do? Well, there's two problems here really: First, how do I ensure I'm properly capturing Metadata on a go-forward basis; Second, how do I retroactively retrieve Metadata for legacy systems?

Addressing the first question I will describe a realistic approach which can be applied to organizations of any size (including a 1 person company), and which do NOT incur additional costs of development, or require the outlay for new systems and staff. This approach is in part based on a distilled version of the ISO 11179 standard for Metadata Registries. I'm a big fan of the ISO 11179 standard since it gets to the essence of the problem, and is not bogged down in tools and technologies. It allows for a wide degree of interpretation which can be as robust or simple as your needs require. Most importantly, it helps us formulation the important questions we should be asking about the data.

At the heart of ISO 11179 is the Data Element. For most organizations this is just a column in a table, but it can an XML attribute, an HTML meta tag, or a field in a VSAM file. The ISO 11179 approach breaks down data elements into what is referred to as a "Data Element Concept", and a "Value Domain" (VD). The Data Element Concept (DEC) describes the semantics of the data, which should be expressed in "plain English" terms. The DEC is not concerned with the nitty gritty representation of the information, just what it means to a layperson. For example, I might have a data element which contains a patient's recorded temperature, taken when they entered the hospital. So the DEC could be written as "patient's recorded temperature at time of hospital triage".

However, this is only part of the picture. How this data is encoded and reprented is also critical to our understanding - primarly to support data interoperability. This is where the Value Domain comes in. When descrbing our VD, we need to ask the following questions:

  1. Are there a discrete set of permitted values (known as an Enumerated Value Domain), or must the VD be described as range of values (known as a Described Value Domain)?
  2. If the VD is an Enumerated Value Domain, what are each of the Value Meanings behind each permitted value? List them all.
  3. If the VD is not Enumerated, but rather "Described", we need to know how the data is encoded and what its valid range of values are. Depending on the data in question, different encodings are possible. I'll give you some common examples: for dates and times, what timezone are we using, what our time measurement unit is (e.g. 24 hours, 12 hour, or POSIX time), and how accurate is the time? For monetary values, we should know what currency are we transacting in? For physical measures, what is our dimensionality, unit of measure, accuracy, and upper and lower limits? For text based fields, what characters or patterns are allowed or disallowed?
In our main example, the VD could be described as "temperature measured in celsius, integer values only".

It should also be recognized that a DEC and VD both derive from the same Conceptual Domain (CD). In our example, the Conceptual Domain would simply be "temperature".

Putting our DEC and VD together we get the following definition:

Patient's recorded temperature taken at time of hospital triage, measured in celsius, rounded to the nearest degree (integer values only)

We now have a reasonably comprehensive data definition which can be used to not only explain what the data means to the layperson, but it can also be used as tool to measure data quality.

Another thing you may notice about this definition is the order in which I've explained its meaning, and how I've added "rounded to the nearest degree" as an added measure of clarity. It's always important to start with the DEC, followed by the VD. In other words, start simple, and elaborate later. One of the challenges of formulating data definitions is finding the right balance between simplicity and accuracy. Too simple, and you don't always capture the essence of the data element, or its distinguishing features. To precise, and you confuse people. By choosing how you order your words, you can get the best of both worlds. You should be able to read the first clause and think "okay I get it". If you're doing more detailed analysis you can look more carefully at the second clause.

Is that it then? Is that all you need to capture? No, ideally you should be maintaining a Metadata Registry, which can contain all sorts of other details, such as data derivation information, and even semantic classification schemes (e.g. a hierarchy such as a phylogenetic tree). However, this requires having full time data stewards and heavy amounts of data governance. Most companies simply can't afford that. Furthermore, many companies are adopting JAD or Agile approaches to systems development, and tend to rely on generalist developers to create and maintain data models.

So if there's no centralized registry, where to store the Metadata? I believe the right approach is through tight-coupling of the Metadata the data schema whenever possible. After all, the schema definition constitutes metadata unto itself. This is especially true for relational databases, which form the cornerstone of most IT shops these days. In particular, every major RDBMS (including Microsoft Access) has a comment field for every column, and table. Furthermore, many RDBMSs like MySQL and Oracle, even allow you to query to this Metadata as just another table allowing for both search and reporting against company Metadata. You might say, the Metadata registry was there all along! That said, practically no developers will go out of their way to put data definitions in there. But if you show them how little effort is required, and what can be pulled out of it, and most importantly, how to go about formulating data definitions, then you would be amazed at how quickly you can start building a complete Metadata Registry, basically for free.

Similarly, XSD documents contain and tags which can be used to capture data definitions for data stored in XML. As with an RDBMS, the definitions can be thrown in there as the data is being modeled in XML. Unfortunately, what I've found is that because XML is primarily a data interchange format (as opposed to a data store), the beneficiary of the Metadata is someone else. Nevertheless, it's still important to capture these Metadata since you may get a call from an external party inquiring into the meaning of the data, so there are still selfish reasons for capturing it in the schema document.

As for other data stores where there is no room for documentation (e.g. VSAM files), there are different approaches you can take. One is to put the Metadata into a simple Excel spreadsheet, or Wiki, or Cobol Copybook. The question you need to ask yourself is: "Where will people most likely look for the Metadata?"

Taking things a step further, some vendors- notably IBM and Microsoft - have integrated their data management tools (i.e. RDBMS, ETL tool, and reporting/BI server) so that data lineage can be determined automatically. This makes it easy to determine where a data element on report or cube came from. Knowing this lineage greatly reduces your time for impact analysis when making a change to any system. In fact, I believe this is the only feasible way of addressing a "where is" search requirement for data elements.

Okay, I've answered the first big question on how to capture Metadata on a go-forward basis. Now to address the second question, which is how to capture Metadata from legacy systems.

The first step is to profile the data. The basic steps for data profiling involve:
  1. Obtaining a representative sample of the entire data set for the data element in question.
  2. Obtaining a representative sample of the recently produced data for the data element in question. This is important since the source systems may have changed over time, and so the definition may have changed over time. Depending on what your requirements are, you may only need a data definition for the most current data
  3. Look for outliers in both the entire and recent data set. This is often referred to as boundary analysis. It can either turn up the limits of the value domain, or possibly data quality issues. Either way you probably want to know this if you're enquiring into the meaning of the data.
The second step is to "follow the bread crumbs" to see where the data originated from. This is detective work so there is no doubt there is an art to this step as it may be difficult to obtain and analyze source code. I once saw a situation where one of the data loaders had no source code at all and was completely a blackbox (eventually this had to be rewritten for obvious reasons). The approach I normally take is to "triangulate" around the data, looking for all the clues I can. For example, I'll look at system documentation and business requirement documents. I'll also look at downstream applications to see how they're using the data. Often data is replicated, so sniffing around there can help too.

After data profiling and analyzing the data lineage, you may be in a position to formulate a data definition. At this point I recommend taking your best stab at writing the data definition. You now have an unconfirmed data definition. With this data definition in hand you now need to hunt down at least one of the following persons to confirm or clarify it:
  1. Those that are responsible for producing the data (e.g. customer service rep)
  2. Those that are responsible for consuming the data (e.g. floor manager)
  3. Those that are considered an authority on the data (e.g. business unit owner)
WARNING WARNING DANGER DANGER most people don't have time to answer this question, and e-mail communication won't always work here. It's much better to get the person on the phone, or even better, show up to their desk with coffee and donuts. Treat people with respect, and they will respect you.

One last thing I'd like to finish on. I mentioned earlier that Metadata is used when you least expect it. Well there's at least one unexpected situation which occurs predictably (huh!?!): Database archiving. I recently attended a talk by Jack Olsen, one of the world's leaders in database archiving. Jack pointed out that it is necessary to archive Metadata with data should we hope to interpret it correctly, long after the supporting systems have been retired (makes sense if you think about it). Legal archiving requirements are pushing retention dates farther and farther out. If data has been archived, but cannot be clearly interpreted, it is effectively lost (think missing Rosetta Stone). And if the data is lost, you automatically lose in court.

That's it for this post. In my next post I'd like to delve into data modeling.
---
Shameless plug: I'm currently looking for data management work. If you know of anyone looking for a data architect or other data/information expertise, please contact me at: neil@hepburndata.com (+1-416-315-5514).

Thursday, June 05, 2008

Creative Destruction: Column based Data Warehouses. The next generation of data warehouse technologies has arrived.

I first heard about column-based data warehouses in the Feb. 25th print edition of Information Week. The article interviewed Michael Stonebraker (Stonebraker was one of the original architects of the relational database, and released the first low cost RDBMSs under the company Ingres) , and has fully embraced this new architecture which takes a fundamentally different approach to storing and retrieving structured tabular data.

It's hard to exaggerate the impact column-based database design is likely to have on the DBMS industry. You may have recalled the hype around object-oriented databases back in the 90s, but column-oriented databases are qualitatively different for the following two reasons:

  1. There is a pent-up demand for OLAP databases to reduce storage demands, and improve query performance.
  2. Column-based architecture is primarily a back-end change, and generally does not affect application or data architecture. It's a bit like going from a 32-bit OS to a 64-bit OS, but with significantly greater performance returns.

But what is the difference between traditional record-based DBMSs and column-based DBMSs? The difference is quite simple to understand, and according to Stonebraker can yield 50-fold performance increases over traditional record-based RDBMSs. While I have yet to test a column-based database myself, it's not hard to understand why such dramatic performance increases can be had. The reason is thus: Traditional databases (like the ones you're probably using now) treat records as a basic element of storage and retrieval. It's not possible to access an attribute of a record, without first retrieving the entire record. On the flip side, column-based architectures align data to the column (a.k.a. field or attribute). As such, when querying, only the data that is required to be analyzed, is retrieved. Furthermore, because data from a column perspective tends to be similar, column-based warehouses can also enjoy significant compression advantages. Many people [incorrectly] believe that compression always impedes access times. However, more often than not, compression can actually speed up retrieval times since there is less data to be moved from the data store through the bus.

This may seem a little academic, so let me illustrate with an example. Imagine I was a bank, and had a master customer table which stores the most pertinent attributes for each customer. For this example, lets say this table contains 100 columns. If I want to produce a simple report showing all customer accounts that have an excess of $10,000 in their savings account, along with their actual balance. Using current architectures, my query would retrieve all 100 columns for each customer that matched those search criteria, and then format the results to only show me the account number and balance. Let's say that there are 1,000,000 customers that match those criteria. As such, I'm effectively copying 100 columns for those million customer records into temporary storage (usually memory), so as to display only two columns (account number and account balance). A column-based data warehouse would instead take a different approach, and only retrieve the two columns which are required, ignoring the other 98 columns right from the get-go. In other words, I'm moving one fiftieth (1/50th) of the data. Furthermore, because account balances tend to be within a small range, and are compressed as a single unit, I am moving even less data through my bus. The end result is a dramatic performance increase. We're talking about the difference between a query executing in under an hour versus the same query taking more than a day to complete. With such huge discrepancies in performance, there are real business competitive advantages to be had here... for now.

I decided to take a quick look at the actual vendors selling these new column-based. I've summarized my key findings for six major vendors:
  1. Calpont's CNX Data Warehouse Platform is a drop-in solution for Oracle. Namely, the solution can be deployed as a standalone DBMS or as an optimization/acceleration layer into an existing DBMS environment. Currently Oracle is supported, with planned support for DB2.
  2. Infobright's Brighthouse boasts superior compression. Namely, Infobright claims it can achieve a 40:1 compression ratio. Contrast this with Oracle 11g, which uses record based storage, advertises a 2:1 compression ratio (although I've heard that depending on the data, it can actually achieve between a 3:1 and 4:1 ratio in tests). Furthermore, the company claims it leverages MySQL "making seamless use of the mature connectors, tools, and resources associated with this widely deployed open source database". From this it sounds like Brighthouse is effectively a forked version of MySQL. Also worth noting for my local readers is the fact that the business is based here in Toronto.
  3. ParAccel's Analytic Database boasts a Shared-nothing, MPP (massively parallel processing) architecture. MPP architectures are normally associated with data warehouse appliances, such as Netezza, but there's no reason why an appliance solution is required, so it's nice to finally see a vendor selling this technology.
  4. Sand Technology's SAND/DNA Analytics solution claims that no indexing or specialized schemas are required, and that this is a unique feature. I'm not sure about that last claim, but they are certainly emphasizing ease-of-use as a major selling feature.
  5. Sybase also has a columnar DB: Sybase IQ. Apart from being a recognized and stable vendor, Sybase IQ boasts the first petabyte benchmarks. Interestingly, their whitepaper discusses column-based encryption, which I've never seen before. As Sybase points out, this is ideal for data aggregators with mult-client services. However, I would say that Sybase's brand is probably their biggest advantage for the PHB crowd.
  6. Vertica is Michael Stonebraker's company. I particularly like their marketing materials as they provide real benchmarks on their web site here: http://www.vertica.com/benchmarks. They also emphasize that their technology runs on "green" grids of off-the-shelf servers, and they even have a hosted "cloud" solution. Personally, I'm a big fan of hosted solutions, and after all the catastrophes I've witnessed, I would argue that they're lower risk. But there are those that prefer to drive than fly since it gives them a sense of control over the situation, statistics be damned.
It will be interesting to see what shakes out over time. I suspect that every DB vendor is currently working on a column-based solution of their own, so waiting is certainly an option.

Monday, June 02, 2008

On a lighter note, how do we change the culture of software development

I just noticed this ad on the back of a recent print edition of Information Week. I'd seen these ads before (mainly on CNET's web site), but not in print form in a magazine geared towards data management professionals.

The ad campaign looks like it's targeting 12 year old boys, but I suppose it must be targeting the 12 year old boy within us all... or geeky software developers. But before I start shooting fish out of a barrel, I thought I share with you a very strange subtext to this mini-narrative. If you look closely into the crowd scene there is in fact one character who stands out. This of course is the busty bird woman I suppose our hero is fighting to impress. In case you missed it, here's the zoomed in version below.
If anyone has any theories as to why this particular bird woman was chosen (maybe she is in fact concerned about the cow-man in the magic bubble) - I'd really love to know!

But for me, what this ad is is a reminder of how haphazard IT decisions tend to be, and how many still view IT as a crafts-based practice. It's a bit like those sugar cereal commercials that would run during Saturday morning cartoons, as the parental pressure points are well understood.

Most developers tend to be intelligent, analytical, and creative types. If they're lucky, they will work for a dynamic software company that embraces their talents in all phases of product management and design. However, most are not so lucky and end up working for bureacratic IT departments with similar expectations. The developers role here, is to merely take requirements for a System Design Specification, code them in a development environment, and do a little unit testing to ensure the requirements are satisfied. That's all folks.

Yet so many developers I meet want to take on business analysis (well, the fun parts), project management (the fun parts), human factors (the fun parts). Sometimes these guys even want to take on data modeling (but usually treat the DB like a bit bucket for their in-memory data structures). These guys will also tell you about some new gizmo or technology which is on the cusp of solving all our problems, or will introduce us to the modern age. Of course all that Change Management, training, data interoperability, service management, risk management, quality management, and all that other stuff is just pointless busywork that gets in the way of true innovation.

But I actually relate to these guys and totally know where they're coming from. When they show up for their first week on the job, it's all sunshine and lollilops with so much optimism and hope in the air. I try my best to foster this passion and excitement, but also explain that large organizations tend to have role-based cultures, and don't always appreciate talented and knowledgable individuals. I describe career paths which include business analysts, project management, and application architecture. But it's never as exciting or free as what they have in mind.

I personally remember a time not so long ago where systems were being developed on unstable resource constrained platforms with low level languages (e.g. Windows 95 and C++). System stability was a major issue, and talented developers who could analyze a core dump file, or who knew the inner works of memory management, or who could create helper systems to better recover from instabilities were invaluable. To this day I would reckon there is still value to these talents. However, most people can't really describe these talents, so their importance is greatly diminished now that Windows is a stable OS, and that software development has been highly abstracted.

But these killer developers were more than just killer debuggers, they were full on rennaisance men and women, capable of tracing a low level _if_ statement all the way back up to a business rule, and even suggesting new business processes to accomodate hardware limitations (this still happens). These were the only people who had complete line-of-site to all aspects of the business, and these were the people who held the true power.

My point is, with no new problems for these people to solve, they now must bring the business back down to their level of tools and technologies. So, I reckon that it is perhaps a cultural issue we must solve first, before we can tackle the obvious problems at hand. In the meantime, the vendors will be selling more sugar cereal. Can't get enough of those Sugar Smacks!

Tuesday, May 20, 2008

Unstructured Data and the Hunt for the Elusive Customer Service KPI / Looking for Work

The culture of data and analytics is starting to take foot among the general population. Books like "Competing on Analytics" and "Supercrunchers" herald a new era of business whereby every little decision is vetted through intense fact-based scrutiny.

Well, it would seem to be that way. The truth is, most processes within most businesses (even the businesses discussed in the aforementioned books) are managed using crude or imprecise measures. The classic example for me is the new IT project. Such projects routinely introduce new business processes, data elements, and business rules. The success or failure of such projects is typically measured based on whether or not the project delivered on time and on budget. The operational ramifications are rarely considered. However, interestingly post-implementation costs [which account for at least 80% of the overall project's costs] are rarely measured. Most people would say that the main reason for this is that it would require periodic follow-up assessments (i.e. "busywork"), and that the original project team has long disbanded. I partially agree with this, but I would argue that there is a bigger issue here.

People (especially those in senior positions) are loathe to be measured. Ask a VP, director, or manager if she likes the idea of measuring her team's performance and she would say "yes!". Ask the same person if she would like to be measured, and you'll get a long winded answer as to why her performance can't be gauged using numeric measures, and should be based on a "360 review" with all manner of testimony and exhibits. Indeed, today's performance reviews are are more like going on trial, rather than a quantifiable measure of performance. It's okay to subject others to KPIs - just not me!

None of this should come as any surprise. But what is interesting is that the new generation of knowledge workers wants to be measured by KPIs, and they want better KPIs to be measured against. How do I know this? I know this because I routinely talk to people who are on the front line of customer service, and I ask them as many questions as I possibly can. Here is what I've learned folks: The generation that grew up with the Internet (usually under the age of 28), and has been asked to work with a computer and a telephone quickly realizes that their job is being measured by a few simple KPIs, and those KPIs can be quickly gamed. This is a generation that looks for inefficiencies on eBay, that has developed methods for finding free music and videos, that can quickly fact check for discrepancies when BS is suspected. These people are playing massive multi-player on-line games, are on every major social network, and are relentlessly logical and efficient when working with rules-based systems (i.e. companies).

It should come as no surprise that the new generation of customer services reps, and other front-line knowledge workers are quick to find the path of least resistance when approaching their job. This is their comfort zone.

The bad news is that the current KPIs that measure these people, are crude and blunt instruments. Taking customer service as an example. There are two basic KPIs which gets used: The first, "Average Handle Time", or the time it takes to get the customer off the phone. The second, "Number of Call-backs", or the degree to which the customer had his issue "resolved". That's basically it. Of course, most companies perform random audits to keep reps on their toes. While this "boogeyman" style of management keeps the train on its rails, it hardly provides any goals to aspire for. Even a callback can made be for any reason under the sun, and it may even be a satisfied customer calling back to spend more money (this contradiction between reality and metrics is often referred to a the "99 foot man paradox"). As for "Average Handle Time"? Well, I once heard a story about a bunch of call centre reps who had a nice scam going where they would simply hang-up on every incoming call (effectively deflecting the caller to other CSRs). Until they were caught, they were being held up as model CSRs for their lightening efficiency.

So how does one actually measure customer service so that someone "gaming" the KPI is forced to provide competent customer service at a reasonable cost? The answer depends a great deal on the company's missions. However, with today's technology there are a lot of options available, especially given that it's now standard to record each and every inbound call. Furthermore, the latest voice recognition software does an impressive job of recognizing the majority of words and phrases. Focusing on this data set alone, we can start pulling out some interesting KPIs. We would first need to convert these unstructured data to structured data sets. This is probably our most difficult task (on a that note, Bill Inmon's "Tapping into Unstructured Data" is one of the better books on this subject). Once we have a handle on our CSR/customer conversation data, we can start mining for certain terms that would indicate satisfaction or dissatisfaction. I'm aware that it's even possible to capture a customer's mood and emotion through vocal tone analysis. In theory, you should even be able to separate out the CSR's tone and words from the customer's tone and words, for even more fine grained analyses.

I am not saying that it would be easy to establish a KPI to objectively measure customer satisfaction. Rather, I am saying that it's not hard to improve upon our current KPIs.

But what's the point? First off, customer service is generally pretty bad these days. This may even have something to do with CSRs gaming the existing KPIs, especially "Average Handle Time". It also has to do with the complexity of services being offered these days, and the natural frustration that goes along with an excess of business rules that we can't possibly comprehended (there is a greater problem here, that I don't have time to get into in this post). My point in all of this is to say that we are headed towards a culture of analytics whether we like it or not, and if our KPIs have flaws in them, then they will be manipulated against us by our customers and employees. The world we live in is complex and nuanced, and one of our best tools at managing this is through simplification through the use of comprehensive indexes. We just need to get better at designing our KPIs, and ensure they provide us with maximum goal congruence.

---

On a completely unrelated note, I have just rolled off a big project, and am looking for new work. My area of expertise is in data management and enterprise architecture, but I'm also very nuts-and-bolts, and enjoy doing everything from: application support; software development; data modeling; requirements gathering; system sourcing and selection; process architecture; change management; external data procurement and aligntment; data warehousing & BI; metadata management; data governance policy; IT strategy; marketing analytics; and pretty much anything else technology or information related you can think of.

I'm based out of Toronto, but am willing to travel and work anywhere in the world where there's interesting work. I am incorporated, and prefer contract work, but would also consider full time work if there's a good fit.

If you know of anything that you think I might be interested in, feel free to contact me at: neil@hepburndata.com

Thursday, April 10, 2008

David Letterman can teach us a thing or two about BI

I've been working a lot these days with data visualizations and presentation reports. I must admit that I've learned a thing or two about how people approach data, both from the IT side and the business side. However, after looking at dozens and dozens of data visualizations and executive reports, I have realized that there are effectively two kinds of reports that you can present to an executive, and we should approach and understand them accordingly.

The first kind of report is what I describe as an "entertaining report". These reports rely heavily on data visualizations, and while they can and should convey information. Their primary purpose is to grab your attention (i.e. entertain you), over and above driving decisions. Since human beings are instinctively visual beasts, we have a soft spot for these types of reports. We rarely know what to do with the information we see in these fancy presentations, but we love it all the same as it speaks to us emotionally. The old saying "seeing is believing" is as true now as it has ever been.

The second kind of report is what I would describe as "decision driving". These reports tend to be bland ordered lists, with numbers. However, these reports are not only the most important in driving decisions, but due to their abstract nature (and lack of understanding of the decision-makers predicament) are very difficult to get right the first time. In fact, these reports tend to be an after-thought since we tend to occupy our imagination with the more wonderous data visualizations, and would rather avoid trying to understand the messy world that the decision making manager has to live in. In fact, I'm sure I've even seen some IT folk sneer at the decision makers for not appreciating their glorious art. I've probably sneered myself at one time.

Going one step further, if we ask ourselves how decisions typically get enacted in business, and look at how people take on decisions, we can see that there is s desire to streamline decision making. Managers are expected as part of their role to make decisions on a regular basis. However, because new decisions represent risk, this in turn leads to stress. So, if we can help managers make better decisions without increasing their stress, this is what we should strive for.

I believe that the top 10 (or bottom 10) list is an excellent framework for streamlining decision making. In fact it's so popular, that it is this tool we use to manage our own lives. I maintain my own to-do lists each day. If I need to go grocery shopping, I always have a shopping list in hand. If I need to get my personal spending down, I take a look at my biggest expenses and attack those in order. In other words, the top 10 list provides a framework for grouping decisions together, and therefore making each subsequent decision easier to tackle. Furthermore, since we already know that list items get easier as we go down the list, the entire set of decisions seems less daunting since we can get into a groove and track our progress.

But let's do a thought experiment to give you a better idea of where I'm coming from. Let's say you were the mayor of Toronto, and you had pledged to reduce crime. You might think to first get a grasp of where all the crime is happening. You hire a consultant to explain this to you. The consultant comes back one month later with an impressive heat-map of the City of Toronto showing in excruciating detail where all the crime hotspots are. As the city mayor you recognize all the neighbourhoods, and probably aren't too surprised by what you see. However, you will feel wiser seeing this map as you can now visualize where the crime is taking place (well you might think you can visualize it). Great! Now it's decision time. You need to make some hard spending decisions as to where you want to allocate social spending programs, improve community safety, and boost law enforcement. Is this map sufficient for you to sign into budget these decisions? Perhaps I as mayor could request more heat-maps showing different types of crime like homicide or grand larceny? Maybe an animated time-seriesed map showing the spread of crime might better help? Do you feel confident allocating millions of dollars based on moving blotches on a map, even knowing that those blotches are confiding the truth? Probably not.

I suspect at this point you will want to start generating good old fashioned lists. You might want to see: Top ten neighbourhoods, as ranked by a blended crime index. Or maybe, top ten neighbourhoods, as ranked by velocity of increase in crime over the past 4 years.

There is no shortage of these top 10 lists you could produce, but the whole time you're dealing with unamiguously ranked neighbourhoods, supported by hard numbers. As a compromise, I might say that you could add a simple bar chart visualization to help make some numeric comparisons a bit easier. Either way, you will need to boil things down to a list of some sort, since you will need to verbally articulate the decision you made. What sounds better: "I have allocated an increase spending to: Jane & Finch; Rexdale; and Regent Park, as they currently have the highest indexed crime per capita for the past 5 years standing, according to Statistics Canada". Or, would you rather say: "If you could have seen the map I saw, you would know to allocate funding to Jane & Finch; Rexdale; and Regent Park". Yes, if you're lucky, you might get to hold up the map, but then you would be forced to explain its legend. And if the three neighbourhoods are visually similar to a few other neighbourhoods from a heat-map persepective, then you might be in the awkward situation of squinting your eyes and saying "Well, in my opinion, this blotch looks slightly larger than that blotch". However, if there was a clear winner, then maybe the map wouldn't be so bad? But if there was a clear winner you could state that more clearly in verbal terms.

Show me a data visualization, and I'll show you a top 10 list which does a better job at driving decisions.

However, with all that said, you might be led to believe there are some exceptions to what I am saying. For example, experienced meteorologists are able to make reasonably accurate predictions by visually studying animated satellite imagry of weather patterns. Touche! But I'm not sure if I would even categorize this as BI, since the data never got beyond the video stage into a "fact-based" database. The same goes for military intelligence studying satellite imagry. Once again, the photos are being analyzed as-is for enemy presence.

What I am saying is that the name of the game is to figure out what the most ideal top 10 (actually top 5 might be better due to limits on our capacitative memory) lists are to drive decisions, and you will have saved everyone time, and make managers lives so much easier. However, the hard part is getting into the head of the decision maker. If you cannot understand what the decision maker is confronted with, you will just be throwing darts at a board. Who knows, maybe you'll get lucky.

Thank you David Letterman. You know us so well!

Saturday, December 01, 2007

The Secret... to Getting Things Done... in Data Management, for Dummies and Idiots... in Three Simple Steps

I've decided to hop on the dumbing down bandwagon for once. But instead of just offering your typically mind-numbing advice, and providing a bunch of pointless examples to prove my point. I'm going to try to address the skeptics head-on.

So what is this "secret", and these "three simple steps" you might ask? I will get right to the point, and then substantiate my arguments for the skeptics. That said, as with anything there are always "exceptions that prove the rule", and if you can point those out (and I encourage you to do so), then I'll try to address those arguments as well. Without further ado, here is "the secret".

  1. Keep one, and only one copy of your data in an enterprise class RDBMS.
  2. Normalize your entities to third normal form, or Boyce-Codd normal form.
  3. Maintain operational data definitions for each data element, ensuring that the definitions. are understood to mean the same thing by all stakeholders.
That's it. Simple right? Honestly speaking, if you can pull off what I've just described you will have set the foundation for perfect or near-perfect data quality (which should be the goal of all data management professionals). Well, not so fast. This is actually a lot harder than it looks, even for a green field application, let alone for legacy systems. I will point out some of the pitfalls and some of the solutions to those gotchas, starting from easiest to hardest.

Problem #1: Keeping only a single copy of the data has the following two major drawbacks:
  1. Data warehouses need to be kept separate from operational data stores, for performance reasons, storage reasons, structural reasons, and auditability reasons.
  2. Putting all your "eggs in once basket" is inherently risky, and leads to a single point of failure.
These are both valid points, but I will argue that the technology has more or less arrived that invalidates (or is on its way to invalidating) this argument. Generally speaking, we should be striving to let our RDBMS do the data management work, and not our IT professionals. The second something has to be managed by a human being is the second that you can expect quality issues. All these tools you see, such as ETL, data discovery, data cleansing, etc. are basically a way of managing the copying of data. If you can get rid of the copying (or put it in a black box), then you can eliminate the management.

Getting back to reality: First, it is possible with all major RDBMSs (i.e. DB2, Oracle, and SQL Server) to limit a user's resource, and protect another user's resources. In other words, it's possible to provide guarantees of service to the operational systems by denying resources to the reporting systems. If resources becomes a regular problem it is now relatively straightforward [through clustering technology] to swap in more memory and processing power. Second, as far as storage is concerned: Storage Area Network [SAN] and Information Lifecycle Management [ILM], and table partitioning technologies have got to the point where, with proper configuration, tables can grow indefinitely without impacting operational performance. In fact, in Oracle 11g, you can now configure child tables to partition automatically based on parent table partition configuration (before 11g you would have to explictitly manage the child table partitions). While we're on the topic of 11g, it is also now possible to Flashback forever. Namely it's possible to see historical snapshots of the database as far back as you want to go. That said, I don't believe this feature has been optimized for timeseries reporting. So, this may be the last technology holdout to support the case of building a data warehouse. Nevertheless, this is something I'm sure this is a problem the RDBMS vendors can surely solve. Third, as far as structure is concerned, it is not necessary, and in fact can be counter-productive to de-normalize data into star/snowflake schemas. I'll address this in my rebuttal to "problem #2". Fourth, auditability is indeed a big part of the ETL process. But we wouldn't need audit data if we didn't need to move it around to begin with, so it's a moot point. If you want to audit access to information, this can easily be done with current trace tools built into all the major RDBMSs.

On the problem of "putting your eggs in one basket". While I can see that this has an appeal that can be appreciated by the business, it's really a facile argument. Simply put, the more databases you need to manage, and the more varieties of databases you have to manage, results in less depth of knowledge for each individual database, and weaker regimes for each database. If you had all your data in a single database, you could then spend more of your time understanding that database, and implement the following:
  • Routine catastrope drills (i.e. server re-builds, and restores)
  • Geographically distributed failover servers
  • Rolling upgrades (I know this is possible with Oracle)
  • Improved brown-out support
In theory, if you had a lean and mean DB regimes, and well trained DBAs you could recover from a fire in your data centre in a matter of hours, and experience nothing more than a brown-out. However, in reality, DBAs and other resources are spread too thin to be able to do this, and in times of this kind of catastrophe you would be lucky to recover your data and have your applications back on-line within a week.

Problem #2: Keeping data in 3NF or Boyce-Codd Normal form is difficult to work with, and performs poorly in a data warehouse.

These are also valid points. First, I will agree that people employed in writing BI reports or creating cubes (or whatever reporting tool they're using), prefer to work with data in a star or snowflake schema. Organizing data in this format makes it easy to separate qualitative information from quantitative information, and lowers the ramp-up time for new hires. However, whenever you transform data you immediately jeopardize its quality, and introduce the possibility of error. But more importantly you eliminate information and options in the process. The beauty of normalized data, is that it allows you to pivot from any angle your heart desires. Furthermore, things like key constraints constitute information unto themselves, and by eliminating that structure it's harder to make strong assertions about the data itself. For example, an order # may have a unique constraint in its original table, but when it's mashed up into a customer dimension, it's no longer explicitly defined that that field must be unique (unless someone had the presence of mind to define it that way, but once again human error creeps in). Second, as far as performance is concerned, I agree that generally speaking de-normalized data will respond quicker. However, it is possible to define partions, cluster indexes and logical indexes, so as to achieve the exact same "Big O" Order. Therefore the differences in performance are linear, and can be solved by adding more CPUs and memory, and thus overall scalability is not affected.

Problem #3: It is impossible to get people to agree on one version of "the truth".

While all my other arguments were focused on technology, this problem is clearly in the human realm, and I will submit that this problem with never ever be definitively solved. Ultimately it comes down to politics and human will. One person's definition of "customer" may make them look good, but make another person look bad. Perception is reality, and the very way in which we concoct definitions impacts the way the business itself is run. A good friend of mine [Jonathan Ezer] posited that there is no such thing as "The Truth", and it is inherently subjective. As an example, he showed how the former planet Pluto, is no longer a planet through a voting process. Yes, there is still a hunk of ice/rock floating up there in outer space, but that's not what the average person holds to be important. Fortunately, only scientists were allowed to vote, but if astrologers were invited, they would surely have rained on the scientists parade. Okay, so this sounds a little philosophical and airy fairy. But consider this: According to Peter Aiken, the IRS took 13 years to arrive at a single definition of the word "child", including 2.5 years to move through legislation. That said, what advice I can offer to break through these kinds of logjams, and which I've also heard from other experienced practitioners is to leverage definitions that have already been vetted by external organizations. Going back to the IRS example, we now have a thorough operational definition that provides us with a good toehold. Personally, I try to look for ISO or (since I'm Canadian) StatCan definitions of terms, if the word is not clear. Another great place is Wikipedia or its successor Citizendium. Apart from that, all I can suggest is to be as diplomatic and patient as possible.

Am I wrong? Am I crazy? Or does data management need to be complicated. Well, I guess the software vendors like it that way.

-Neil.

Tuesday, August 14, 2007

Social Networks for the Enterprise

I've been meaning to put this entry down for weeks now, but in each passing week it seems I come to new realizations about social networks and the direction they are going. It's been about two weeks since I've had any new thoughts on the matter, and for now my thinking as settled down.

But before getting into my thoughts on social networks and how I feel they might be applied to the enterprise, I'd like to share a little story with you. A couple weeks ago I sat down for lunch at the local food-court down the road. An old neighbour of mine - Murray - who I hadn't seen in a while, saw me and sat down at my table. Within no time Murray (who is a senior manager for a large Canadian insurance company) started to vent about Facebook. Facebook as you may have heard, is the hottest social networking site out there, and in particular is most popular in Toronto with over 700,000 Torontonians, and growing. Murray brought up the fact that Facebook may have a huge valuation of over $10 billion, but is in fact costing companies significantly more than that in lost productivity. While I'd heard about companies (and even the Ontario government) banning access to Facebook, it really dawned on me as to what a time sucker this thing is. I was tempted to bring up the fact that maybe there were other issues surrounding employee managment, and would argue that this is the overarching problem. However, since that conversation I've read that about half of all corporations now ban access to Facebook.

With all this hype surrounding social networks it's inevitable that people are writing about how they might be applied to the Enterprise. So far, I haven't read anything that has really impressed me (I've just read this rah rah stuff on ZDNet about how it helped a bunch of people solve problems faster, but didn't explain how or why). So I thought I might build on a post I wrote in the past on Wikis in the enterprise, and relate this to social networks, but also [and more importantly] discuss the differences in social dynamics between consumer social networks and enterprise social networks. Actually, I would say that this is really what has not been discussed in enough detail by the general media: What is the nature of our relations in enterprises versus the nature of our relations with friends and family.


On that note let me first discuss what I am seeing happening on MySpace and Facebook. I am assuming you the reader have a somewhat cursory knowledge of these services. If you have no idea what these services are used for or why they are important, I suggest you do some research on these sites, and then return to this blog entry.

Moving on, MySpace was the first major social networking site to capture the popular imagination. There were sites before this (Six Degrees comes to mind), but MySpace became a hit for the following reasons:

  1. It was targeted to, and appealed directly to teenagers: Probably the most socially self-concious group that exists. This has changed somewhat due to concerns over sexual predators.
  2. It was completely open. Anyone could see anybody else's MySpace page without having to register or login.
  3. It was a platform unto itself. While building a MySpace page is mainly a "fill in the blanks" exercise, users are invited to add "widgets" and from there "pimp out" their MySpace page. Of course this spawned a widget cottage industry, which in turn makes the MySpace platform more desirable to its users.
Facebook on the otherhand succeeded mainly for these reasons:
  1. It was targeted to college students: Probably the second most self-concious group that exists.
  2. It was not so open, which made it more conducive to posting private details. Namely, users could feel more confident about posting personal photographs because the security measures were in place to ensure that only certain people ("friends") could see photos and other personal details.
  3. It included a news feed which allows you to see all your friends updates. This is perhaps the most powerful [and originally controversial] feature of Facebook, and the one feature that has generated the most stickiness.
  4. It also is a platform like MySpace. However, it's an arguably more powerful platform since the underlying capabilities of Facebook are more robust, especially the security.
Both MySpace and Facebook have their strengths and weaknesses, but in their curent state, I don't see either of them as being an ideal fit for the Enterprise. The other social network I didn't mention is LinkedIn, which I won't get into, but I also feel that this too is ill suited for the Enterprise.

In order to understand why this is, you have to ask yourself the following question: What is the nature of relationships in the Enterprise, and how are they different from relationships in mainstream social networks?

In a nutshell, I would say that the answer is thus: Normal social networks are typically defined by relationships that both parties willingly desire. In the Enterprise, relationships tend to be dictated by the Enterprise, and are thus of a utilitarian nature. While it's nice to work with people we're friends with, this isn't always going to be the case. However, if we can make these utilitarian relationships friendlier this is always a good thing. So, I would propose that any social network for the Enterprise be cognizant of the nature of the relationship, but also facilitate warmer connections. In that regard, divulging a certain amount of personal information is not a bad thing, but should be managed with a greater amount of astuteness, which should take its queue from what is normally discussed by the watercooler, or what would normally be posted on a cubicle wall (e.g. photos of spouse and kids).

So, fleshing out the nature of relationships I will describe the following types of Enterprise relationships that I am aware of, and how I think information should be managed with respect to these relationships. Since this is my fantasy, I will assume that the enterprise has Wikified itself, in the manner that I described a few months ago. The basic types of relationships, the types of information that should be accessible through those relationships, and how those information should be secured are as follows:
First: Operational versus Project relationships.
Operational relationships are ongoing and indefinite. Pretty much everyone has a relationship with HR. Furthermore, everyone has a relationship with the helpdesk. In some cases you will want to maintain personal relationships (HR is a good candidate here), and in other cases you will want to maintain a relationship with a proxy (the helpdesk is a good candidate here). For operational relationships, you don't really need to have very much insight into the documents and data that these entities rely on, and for the most part you would just have their contact details, and a few other things that these persons may make public. For example, HR could post (or link to) information about Insurance companies, company dress code policy, benefits, etc. But you don't need to know which IS/IT systems they are using to manage your benefits as this does not concern you.

Project relationships on the otherhand are temporal, but tend to require greater line-of-site to knowledge. So, as opposed to our relationship with HR, where we don't really need to know HOW they do their job, in the case of project relationships this line-of-sight is usually a good thing. As an example: If I'm on a project working with a team of: software developers; quality assurance professionals; business analysts; systems analysts; and project managers. It would save me time to be able to see what they're up to. Speaking in concrete terms, this means I would like to see what documents they are using (i.e. what Wikis they are frequently accessing), what databases they are connecting to, and generally what they are up to (I am also thinking of a Twitter RSS feed here - btw, Twitter on its own has the potential to be an extremely powerful management tool). I don't need to know everything about their life, just everything that they are doing NOW.

Second: Hierarchical relationships. The Enterprise always has been and always will be hierarchical in nature. Yes, we all aspire to the "flat" egalitarian Enterprise, but frankly speaking this simply goes against human nature. It will never happen as long as hairless apes run the world. However, we can manage it. Namely, it should be simple for our Enterprise social network to apply the correct security and privacy settings based on hierarchy. I should be able to see everything my subordinate is up to, but not so much as what my boss is up to. It's all right if she can see what I'm up to though. It sounds a bit cynical, but this is no different from out Enterprises curently function. As for peers, this gets a bit tricky and should be handled on a case by case basis.

Third: Intra-department versus inter-department versus inter Enterprise relationships. I don't have any hard and fast answers here, but this is definitely something that should be considered. Things of course get tricky when you're talking about relationships that go outside the Enterprise. Typically these would be vendor relationships, and typically from a knowledge management perspective, this is by default a one-way street. Namely, the Enterprise should collect information about the vendor, but be hesitant to share anything with them through a social network. While I can see a time where social networks cross over Enterprises, it's hard to say if this is a priority. To be sure, there is operational information that is routinely shared. For example, a shipping company would keep its customers informed about the status of packages and deliveries. But this hardly has anything to do with insight about any particular person within either Enterprise.

This is just a sketch of how a social network could be implemented in an Enterprise, and if nothing else some of the things that an Enterprise architect should be mindful of. At the very least, it should break down barriers of communication, and although I mentioned earlier that hierarchies are inevitable, they also can get in the way and ironically dehumanize us. As a simple start, if more large organizations had personal pages where people could add a few photos, say a few things about themselves, and post links to frequently referenced documents, it would make the place a lot less intimidating, and much easier for new hires or new transfers.
---
On a completely different note, I was contacted by Michael who writes the Data Governance Blog: http://datagovernanceblog.com/
Michael had some nice things to say about my own blog and I am very flattered and appreciative of that. Although I don't blog that often, one of my main goals has been to connect with likeminded individuals out there who see Enterprise Architecture and Data Management as a professional discipline, and who also understand that the discussion is not about Microsoft or Cognos or IBM or any other silver bullet manufacturer, but is something much more nuanced and sophisticated than any of these tech vendors would portray the problem as being. So, I am more than happy to hear from any others out there who see things the same way I do, or enjoys healthy debate.

For my next blog entry, I've got something a bit more abstract - but with real consequences planned. I am partially basing it on a lecture by my good friend Jonathan Ezer.