Saturday, December 01, 2007

The Secret... to Getting Things Done... in Data Management, for Dummies and Idiots... in Three Simple Steps

I've decided to hop on the dumbing down bandwagon for once. But instead of just offering your typically mind-numbing advice, and providing a bunch of pointless examples to prove my point. I'm going to try to address the skeptics head-on.

So what is this "secret", and these "three simple steps" you might ask? I will get right to the point, and then substantiate my arguments for the skeptics. That said, as with anything there are always "exceptions that prove the rule", and if you can point those out (and I encourage you to do so), then I'll try to address those arguments as well. Without further ado, here is "the secret".

  1. Keep one, and only one copy of your data in an enterprise class RDBMS.
  2. Normalize your entities to third normal form, or Boyce-Codd normal form.
  3. Maintain operational data definitions for each data element, ensuring that the definitions. are understood to mean the same thing by all stakeholders.
That's it. Simple right? Honestly speaking, if you can pull off what I've just described you will have set the foundation for perfect or near-perfect data quality (which should be the goal of all data management professionals). Well, not so fast. This is actually a lot harder than it looks, even for a green field application, let alone for legacy systems. I will point out some of the pitfalls and some of the solutions to those gotchas, starting from easiest to hardest.

Problem #1: Keeping only a single copy of the data has the following two major drawbacks:
  1. Data warehouses need to be kept separate from operational data stores, for performance reasons, storage reasons, structural reasons, and auditability reasons.
  2. Putting all your "eggs in once basket" is inherently risky, and leads to a single point of failure.
These are both valid points, but I will argue that the technology has more or less arrived that invalidates (or is on its way to invalidating) this argument. Generally speaking, we should be striving to let our RDBMS do the data management work, and not our IT professionals. The second something has to be managed by a human being is the second that you can expect quality issues. All these tools you see, such as ETL, data discovery, data cleansing, etc. are basically a way of managing the copying of data. If you can get rid of the copying (or put it in a black box), then you can eliminate the management.

Getting back to reality: First, it is possible with all major RDBMSs (i.e. DB2, Oracle, and SQL Server) to limit a user's resource, and protect another user's resources. In other words, it's possible to provide guarantees of service to the operational systems by denying resources to the reporting systems. If resources becomes a regular problem it is now relatively straightforward [through clustering technology] to swap in more memory and processing power. Second, as far as storage is concerned: Storage Area Network [SAN] and Information Lifecycle Management [ILM], and table partitioning technologies have got to the point where, with proper configuration, tables can grow indefinitely without impacting operational performance. In fact, in Oracle 11g, you can now configure child tables to partition automatically based on parent table partition configuration (before 11g you would have to explictitly manage the child table partitions). While we're on the topic of 11g, it is also now possible to Flashback forever. Namely it's possible to see historical snapshots of the database as far back as you want to go. That said, I don't believe this feature has been optimized for timeseries reporting. So, this may be the last technology holdout to support the case of building a data warehouse. Nevertheless, this is something I'm sure this is a problem the RDBMS vendors can surely solve. Third, as far as structure is concerned, it is not necessary, and in fact can be counter-productive to de-normalize data into star/snowflake schemas. I'll address this in my rebuttal to "problem #2". Fourth, auditability is indeed a big part of the ETL process. But we wouldn't need audit data if we didn't need to move it around to begin with, so it's a moot point. If you want to audit access to information, this can easily be done with current trace tools built into all the major RDBMSs.

On the problem of "putting your eggs in one basket". While I can see that this has an appeal that can be appreciated by the business, it's really a facile argument. Simply put, the more databases you need to manage, and the more varieties of databases you have to manage, results in less depth of knowledge for each individual database, and weaker regimes for each database. If you had all your data in a single database, you could then spend more of your time understanding that database, and implement the following:
  • Routine catastrope drills (i.e. server re-builds, and restores)
  • Geographically distributed failover servers
  • Rolling upgrades (I know this is possible with Oracle)
  • Improved brown-out support
In theory, if you had a lean and mean DB regimes, and well trained DBAs you could recover from a fire in your data centre in a matter of hours, and experience nothing more than a brown-out. However, in reality, DBAs and other resources are spread too thin to be able to do this, and in times of this kind of catastrophe you would be lucky to recover your data and have your applications back on-line within a week.

Problem #2: Keeping data in 3NF or Boyce-Codd Normal form is difficult to work with, and performs poorly in a data warehouse.

These are also valid points. First, I will agree that people employed in writing BI reports or creating cubes (or whatever reporting tool they're using), prefer to work with data in a star or snowflake schema. Organizing data in this format makes it easy to separate qualitative information from quantitative information, and lowers the ramp-up time for new hires. However, whenever you transform data you immediately jeopardize its quality, and introduce the possibility of error. But more importantly you eliminate information and options in the process. The beauty of normalized data, is that it allows you to pivot from any angle your heart desires. Furthermore, things like key constraints constitute information unto themselves, and by eliminating that structure it's harder to make strong assertions about the data itself. For example, an order # may have a unique constraint in its original table, but when it's mashed up into a customer dimension, it's no longer explicitly defined that that field must be unique (unless someone had the presence of mind to define it that way, but once again human error creeps in). Second, as far as performance is concerned, I agree that generally speaking de-normalized data will respond quicker. However, it is possible to define partions, cluster indexes and logical indexes, so as to achieve the exact same "Big O" Order. Therefore the differences in performance are linear, and can be solved by adding more CPUs and memory, and thus overall scalability is not affected.

Problem #3: It is impossible to get people to agree on one version of "the truth".

While all my other arguments were focused on technology, this problem is clearly in the human realm, and I will submit that this problem with never ever be definitively solved. Ultimately it comes down to politics and human will. One person's definition of "customer" may make them look good, but make another person look bad. Perception is reality, and the very way in which we concoct definitions impacts the way the business itself is run. A good friend of mine [Jonathan Ezer] posited that there is no such thing as "The Truth", and it is inherently subjective. As an example, he showed how the former planet Pluto, is no longer a planet through a voting process. Yes, there is still a hunk of ice/rock floating up there in outer space, but that's not what the average person holds to be important. Fortunately, only scientists were allowed to vote, but if astrologers were invited, they would surely have rained on the scientists parade. Okay, so this sounds a little philosophical and airy fairy. But consider this: According to Peter Aiken, the IRS took 13 years to arrive at a single definition of the word "child", including 2.5 years to move through legislation. That said, what advice I can offer to break through these kinds of logjams, and which I've also heard from other experienced practitioners is to leverage definitions that have already been vetted by external organizations. Going back to the IRS example, we now have a thorough operational definition that provides us with a good toehold. Personally, I try to look for ISO or (since I'm Canadian) StatCan definitions of terms, if the word is not clear. Another great place is Wikipedia or its successor Citizendium. Apart from that, all I can suggest is to be as diplomatic and patient as possible.

Am I wrong? Am I crazy? Or does data management need to be complicated. Well, I guess the software vendors like it that way.

-Neil.

Tuesday, August 14, 2007

Social Networks for the Enterprise

I've been meaning to put this entry down for weeks now, but in each passing week it seems I come to new realizations about social networks and the direction they are going. It's been about two weeks since I've had any new thoughts on the matter, and for now my thinking as settled down.

But before getting into my thoughts on social networks and how I feel they might be applied to the enterprise, I'd like to share a little story with you. A couple weeks ago I sat down for lunch at the local food-court down the road. An old neighbour of mine - Murray - who I hadn't seen in a while, saw me and sat down at my table. Within no time Murray (who is a senior manager for a large Canadian insurance company) started to vent about Facebook. Facebook as you may have heard, is the hottest social networking site out there, and in particular is most popular in Toronto with over 700,000 Torontonians, and growing. Murray brought up the fact that Facebook may have a huge valuation of over $10 billion, but is in fact costing companies significantly more than that in lost productivity. While I'd heard about companies (and even the Ontario government) banning access to Facebook, it really dawned on me as to what a time sucker this thing is. I was tempted to bring up the fact that maybe there were other issues surrounding employee managment, and would argue that this is the overarching problem. However, since that conversation I've read that about half of all corporations now ban access to Facebook.

With all this hype surrounding social networks it's inevitable that people are writing about how they might be applied to the Enterprise. So far, I haven't read anything that has really impressed me (I've just read this rah rah stuff on ZDNet about how it helped a bunch of people solve problems faster, but didn't explain how or why). So I thought I might build on a post I wrote in the past on Wikis in the enterprise, and relate this to social networks, but also [and more importantly] discuss the differences in social dynamics between consumer social networks and enterprise social networks. Actually, I would say that this is really what has not been discussed in enough detail by the general media: What is the nature of our relations in enterprises versus the nature of our relations with friends and family.


On that note let me first discuss what I am seeing happening on MySpace and Facebook. I am assuming you the reader have a somewhat cursory knowledge of these services. If you have no idea what these services are used for or why they are important, I suggest you do some research on these sites, and then return to this blog entry.

Moving on, MySpace was the first major social networking site to capture the popular imagination. There were sites before this (Six Degrees comes to mind), but MySpace became a hit for the following reasons:

  1. It was targeted to, and appealed directly to teenagers: Probably the most socially self-concious group that exists. This has changed somewhat due to concerns over sexual predators.
  2. It was completely open. Anyone could see anybody else's MySpace page without having to register or login.
  3. It was a platform unto itself. While building a MySpace page is mainly a "fill in the blanks" exercise, users are invited to add "widgets" and from there "pimp out" their MySpace page. Of course this spawned a widget cottage industry, which in turn makes the MySpace platform more desirable to its users.
Facebook on the otherhand succeeded mainly for these reasons:
  1. It was targeted to college students: Probably the second most self-concious group that exists.
  2. It was not so open, which made it more conducive to posting private details. Namely, users could feel more confident about posting personal photographs because the security measures were in place to ensure that only certain people ("friends") could see photos and other personal details.
  3. It included a news feed which allows you to see all your friends updates. This is perhaps the most powerful [and originally controversial] feature of Facebook, and the one feature that has generated the most stickiness.
  4. It also is a platform like MySpace. However, it's an arguably more powerful platform since the underlying capabilities of Facebook are more robust, especially the security.
Both MySpace and Facebook have their strengths and weaknesses, but in their curent state, I don't see either of them as being an ideal fit for the Enterprise. The other social network I didn't mention is LinkedIn, which I won't get into, but I also feel that this too is ill suited for the Enterprise.

In order to understand why this is, you have to ask yourself the following question: What is the nature of relationships in the Enterprise, and how are they different from relationships in mainstream social networks?

In a nutshell, I would say that the answer is thus: Normal social networks are typically defined by relationships that both parties willingly desire. In the Enterprise, relationships tend to be dictated by the Enterprise, and are thus of a utilitarian nature. While it's nice to work with people we're friends with, this isn't always going to be the case. However, if we can make these utilitarian relationships friendlier this is always a good thing. So, I would propose that any social network for the Enterprise be cognizant of the nature of the relationship, but also facilitate warmer connections. In that regard, divulging a certain amount of personal information is not a bad thing, but should be managed with a greater amount of astuteness, which should take its queue from what is normally discussed by the watercooler, or what would normally be posted on a cubicle wall (e.g. photos of spouse and kids).

So, fleshing out the nature of relationships I will describe the following types of Enterprise relationships that I am aware of, and how I think information should be managed with respect to these relationships. Since this is my fantasy, I will assume that the enterprise has Wikified itself, in the manner that I described a few months ago. The basic types of relationships, the types of information that should be accessible through those relationships, and how those information should be secured are as follows:
First: Operational versus Project relationships.
Operational relationships are ongoing and indefinite. Pretty much everyone has a relationship with HR. Furthermore, everyone has a relationship with the helpdesk. In some cases you will want to maintain personal relationships (HR is a good candidate here), and in other cases you will want to maintain a relationship with a proxy (the helpdesk is a good candidate here). For operational relationships, you don't really need to have very much insight into the documents and data that these entities rely on, and for the most part you would just have their contact details, and a few other things that these persons may make public. For example, HR could post (or link to) information about Insurance companies, company dress code policy, benefits, etc. But you don't need to know which IS/IT systems they are using to manage your benefits as this does not concern you.

Project relationships on the otherhand are temporal, but tend to require greater line-of-site to knowledge. So, as opposed to our relationship with HR, where we don't really need to know HOW they do their job, in the case of project relationships this line-of-sight is usually a good thing. As an example: If I'm on a project working with a team of: software developers; quality assurance professionals; business analysts; systems analysts; and project managers. It would save me time to be able to see what they're up to. Speaking in concrete terms, this means I would like to see what documents they are using (i.e. what Wikis they are frequently accessing), what databases they are connecting to, and generally what they are up to (I am also thinking of a Twitter RSS feed here - btw, Twitter on its own has the potential to be an extremely powerful management tool). I don't need to know everything about their life, just everything that they are doing NOW.

Second: Hierarchical relationships. The Enterprise always has been and always will be hierarchical in nature. Yes, we all aspire to the "flat" egalitarian Enterprise, but frankly speaking this simply goes against human nature. It will never happen as long as hairless apes run the world. However, we can manage it. Namely, it should be simple for our Enterprise social network to apply the correct security and privacy settings based on hierarchy. I should be able to see everything my subordinate is up to, but not so much as what my boss is up to. It's all right if she can see what I'm up to though. It sounds a bit cynical, but this is no different from out Enterprises curently function. As for peers, this gets a bit tricky and should be handled on a case by case basis.

Third: Intra-department versus inter-department versus inter Enterprise relationships. I don't have any hard and fast answers here, but this is definitely something that should be considered. Things of course get tricky when you're talking about relationships that go outside the Enterprise. Typically these would be vendor relationships, and typically from a knowledge management perspective, this is by default a one-way street. Namely, the Enterprise should collect information about the vendor, but be hesitant to share anything with them through a social network. While I can see a time where social networks cross over Enterprises, it's hard to say if this is a priority. To be sure, there is operational information that is routinely shared. For example, a shipping company would keep its customers informed about the status of packages and deliveries. But this hardly has anything to do with insight about any particular person within either Enterprise.

This is just a sketch of how a social network could be implemented in an Enterprise, and if nothing else some of the things that an Enterprise architect should be mindful of. At the very least, it should break down barriers of communication, and although I mentioned earlier that hierarchies are inevitable, they also can get in the way and ironically dehumanize us. As a simple start, if more large organizations had personal pages where people could add a few photos, say a few things about themselves, and post links to frequently referenced documents, it would make the place a lot less intimidating, and much easier for new hires or new transfers.
---
On a completely different note, I was contacted by Michael who writes the Data Governance Blog: http://datagovernanceblog.com/
Michael had some nice things to say about my own blog and I am very flattered and appreciative of that. Although I don't blog that often, one of my main goals has been to connect with likeminded individuals out there who see Enterprise Architecture and Data Management as a professional discipline, and who also understand that the discussion is not about Microsoft or Cognos or IBM or any other silver bullet manufacturer, but is something much more nuanced and sophisticated than any of these tech vendors would portray the problem as being. So, I am more than happy to hear from any others out there who see things the same way I do, or enjoys healthy debate.

For my next blog entry, I've got something a bit more abstract - but with real consequences planned. I am partially basing it on a lecture by my good friend Jonathan Ezer.

Thursday, May 17, 2007

SOA without IT governance = good luck

Before getting into my post, I wanted to mention an interview I read in this month's Wired with Eric Schmidt, CEO of Google. I want to share with you a small excerpt:

Wired: Google’s revenue and employee head count have tripled in the last two years. How do you keep from becoming too bureaucratic or too chaotic?

Schmidt: It’s a constant problem. We analyze this every day, and our conclusion is that the best model is still small teams running as fast as they can and tolerating a certain lack of cohesion. Attempting to provide too much order dries out the creativity. What’s needed in a properly functioning corporation is a balance between creativity and order.
But we’ve reined in certain things. For example, we don’t tolerate the kind of “Hey, I want to have my own database and have a good time” behavior that was effective for us in the past.

Very interesting... Of all the examples the CEO of Google could come up with in terms of governance, is basically data governance. I think this is an excellent thing to mention when developers get in a hissy about how they're using data. Even the almighty Google adheres to a data governance policy, and the CEO is 100% supportive. Which leads into my blog post, about maintaining SOA services. Something tells me that Google probably does a decent job of governing their web services.

Now onto my point...

The SOA revolution is on in full force. It's the shiniest silver bullet to come around in a long time, and to be sure it has some real benefits that cannot be ignored. Unfortunately, I will be surprised if any companies out there that don't already have a strong IT governance in place will be able to succeed in achieving their desired ROIs. Of course slick new technology doesn't need a business case, as most CIOs are shamed into implementing a SOA program even if there is no specific need - it simply becomes "commons sense".

Before launching into my critique I must point out that I am a huge supporter of the SOA approach. Web services, like those offered by Google, Amazon.com, Yahoo!, eBay, and others(check out: programmableweb.com for a comprehensive directory of web services) are without a doubt a standard that's here to stay. Developing future applications using a SOA model clearly makes a lot of sense.

From a corporate IT perspecitve, the SOA value proposition is two-fold: First it allows for re-usability like never before. In this respect, SOA's direct antecedent is software components (e.g. COM components, or EJBs); Second, SOA makes building distributed systems a whole lot easier. In this respect SOA's direct antecedant is a mishmash of all sorts of technology (e.g. message passing, RPC [which ODBC uses], store-and-forward, etc.).

Now here's the rub. If you're going to switch to building things using a SOA approach, you're probably just going to start building services for new applications. Those applications in turn will be funded by projects, which will be managed by a project managers' whose responsibilities are to the success of the project, and not for the success of IT infrastructure. As the PMI likes to remind us: "Never goldplate". Full disclosure: I am PMI certified. Okay, so what does this mean? This means that while it is possible to build re-usable services. In all likelihood, they will be built for a specific application. Fair enough, when the next project that comes around that needs something slightly differently, we can just extend those services, while at the same time extending the value of those services. Not so fast! The project manager on the second project will likely have to decide: Is it cheaper to extend a live service, or just take the original source code, and extend that instead, creating a brand-new service that is all but identical to the original service. Well, in spite of all the best intensions, most PMs will quickly cost out the price to regression test the current application(s) using the existing interface (not to mention the logistical headache) and will take the path of least resistance by building a nearly identical new service interface. Eventually over time what you get is a balkanized set of services which IT will constantly talk about "re-factoring" or "consolidating", but in reality there's very little discretionary money to complete a major project like that. Instead, what will happen is there will be some kind of required change that will impact all services. At that point IT will have to decide whether or not to consolidate or fix each one-by-one. More often than not, it will be the one-by-one fix that you will see. The costs of fixing each of these services will greatly outweigh the original investments to consolidate services, but it will just be a constant headache that cannot be solved without a major infrastructure overhaul which some IT disaster may eventually justify.

You will of course point out that this type of IT sprawl is really just a lack of IT governance. Of course it is lack of governance. The point is: The discipline required to manage the reusability of web services is no different than the discipline to manage the reusability of data, which in turn requires metadata management, which in turn requires solid data governance, which in turn requires solid IT governance.

To sum up: Implementing a SOA strategy, without any success managing data [and hence metadata], is like boarding a ship with an incompetent navigator. Will you get to your destination? Sure, but it'll take you a lot longer, and cost you a lot more.

Friday, April 06, 2007

Why basic IT services must be commoditized before we move to the next level

I recently got an e-mail from Gord, a former colleague who just came back from a job interview. Gord was lamenting the fact that in spite of all the talk surrounding data stewardship, metadata, and business intelligence. The reality is that most companies are still interviewing people based on product specific technical expertise.

While on one hand it is becoming more and more apparent that “IT failures” are less issues of technology working, and more issues of poor business alignment, companies when hiring are not specifically requesting these skills, or looking into this track record. So, while an interviewer interviewing for a DBA position could ask the questions:
“How do you ensure that the data modeling policy is being followed? How do you deal with non-compliance?” or;
“Have you ever worked in an environment with a focus on metadata management? Can you describe the challenges, and how you dealt with them?”

Instead, the main questions are these:

“Have you ever completed a major database upgrade project?”
“How do you configure Real Application Clusters in Oracle 10g?”
“Describe a robust back-up regime.”

Reading back the questions, it’s clear that the former set of questions are mushy and don’t have clear-cut right and wrong answers. The latter set of questions are point-blank, and while there may be different ways to answer them, the answers can be easily validated.

But there should be other observations from looking at these questions. The answer to the first set of questions should give you an idea of how business minded the DBA is. While the answers to the second set of questions will give you no such insight; but they will tell you how competent the DBA is at physically managing the database.

For the time being, DBAs that can, say, perform a rolling upgrade with zero downtime is quite the hero indeed. But on the same note, why must that DBA be confined to a single company? Isn’t that ability applicable to EVERY company, regardless of their line-of-business. I mean, if you can ensure that behind the scenes your DB is running flawlessly, why are these trades being haphazardly being reproduced in every IT department. And by the way, I’m not just singling out DBAs here, I would say that at least half of all IT roles have little or no direct linkage to business activity.

Of course, we hear about how these roles are becoming less relevant due to increasing automation, but I don’t buy this. Any system that is automated still requires people to monitor them, as well as people to fix them. Simply put, the DBA, the network technician, the Java/VB/C# developer, and all of those other roles which are not in themselves expressions of the business, are still necessary.
Where I believe the reluctance for change is in outsourcing and in sharing of resources across companies.

Ironically, the companies that have a better understanding and a greater need for things like data governance and business rules management are also the least likely to let go of control over these basic services. The main reason for this is stability, security, and privacy, and an overall greater dependency on IT to automate their business.

On the flipside, smaller businesses, especially those that are growing quickly, are more likely to take a risk with a hosted solution. An IT debacle at a small transport company is far less likely to get pasted all over the following day’s headlines than a screw-up at a major bank. Another difference is that smaller businesses that go with a hosted solution may also see it as more risk averse. By sailing on the same ship as dozens or even hundreds or thousands of other businesses, there is the comfort in that there is “safety in numbers”. As sound as this argument is, large companies are simply not as trusting and often see themselves as more important than any of the hosting vendor’s clients, and therefore more important than the even the sum of its clients. In my own opinion I think there is some validity to this argument, so I’ll leave it at that.

With that said, there is still a gaping hole left to fill. Namely, the vast majority of hosting vendors are what are known as ASPs or Application Service Providers. ASPs typically offer hosted versions of popular workgroup applications, such as accounting applications, reservation systems, and so on. While these applications are all configurable, by comparison to a custom built application, they are extremely rigid and can quickly calcify even a small business’ operations by forcing the business to operate in a predetermined fashion.

At the other end of the extreme are what I would describe as “raw” hosted severs. These are companies which will host an Oracle DB for you, and possibly provide some basic DBA assistance. While these services are definitely a step in the right direction there is little preventing you from shooting yourself in the foot.

So I say we are in need of higher level services that are easily configurable, but are abstracted to a business level.

The solution of course is more generic web services. We are beginning to see these pop-up, but from what I can tell, they are still in their infancy. I must admit, I have not done a recent survey, so I can’t tell you if the services I describe below exist yet, but if they do, I urge you first and foremost to bring them to my attention (free plugs), and secondly to review them yourself. Thus, the services I have in mind are:
1. A hosted business rules management (BRM) solution
2. A hosted business process management (BPM) solution, with business activity montoring (BAM), workflow management, transaction management, and global scheduling
3. A hosted relational database solution, with ETL functionality, and metadata management
4. A hosted forms solution (I know a few of these already exist, and I use them)
5. A hosted reporting solution (although I can’t think of any, I’m sure there must be some of these out there)
6. A hosted user directory and identity management solution (I know that these also exist, but am not sure if they had these types of services in mind)

Now, as I just mentioned above, with the exception of hosted forms, and possibly hosted reporting solution, I haven’t seen any of the other hosted solutions on the market. I would say the main reasons for this are:
1. The transactional through-put of web based applications has yet to match traditional solutions.
2. It is not clear how these services would integrate with each other, and more importantly there are no standards for doing so.

For the first problem, of computational resources I’m not too concerned. If the above mentioned solutions are deployed in a grid-computing fashion, and there are multiple applications running on them, then there are tremendous opportunities for optimization. This is simply a problem that will work itself out over time.

The second problem is much thornier. Standards take time to work out, and can often limit the flexibility of what can be done. A more likely outcome is that a large web services hosting company such as a Google or an Amazon.com may release a single packaged suite than encompasses all of these tools. We can see that both of these companies are already positioning themselves in this way with Google taking a more desktop approach and Amazon.com taking a more back-office approach. However, I don’t see either of them taking this space head-on.

I myself look forward to the day where I can architect, build, deploy, and provide high-level support for a full-blown system for a client in Africa or Asia, Europe, or wherever the business may be, without ever having to worry about anything other than the business details to do my job.

One of the great blessings about a career in IT is that it gives you a window into practically every other business out there. The more we can get away from commoditized details and move towards the technical essence of the business, the more varied and interesting our jobs become. Let’s hope these solutions happen sooner than later.

BEGIN SHAMELESS PLUG

I have recently got involved with a pretty cool internet radio project, that I’m proud to unofficially announce. It’s called TUN3R.com and to put it into a nutshell, it’s a next-generation internet radio portal. What makes it different from other radio/music portals is that all stations are laid out in an expansive grid which you can whiz around with a cross-hair to “tune in” immediately to any station.

The technology itself is very impressive, and what I like most is that it reminds me of the good ole’ days where you could just play around with a station tuner and randomly find things you were never expecting to hear. For now though, the product is not fully baked, and we’re going to be radically improving the searching, including the addition of a new type of search which to my knowledge has not been done yet, so that will really push the site to “11”.

Anyway, please check it out at: http://tun3r.com/

END SHAMELESS PLUG

Saturday, March 03, 2007

Rebooting Repositories: Are Wikis more viable for Metadata, CMDB, Document Management, and other forms of Enterprise Repositories?

I have been grappling with the general problem of knowledge repositories. It’s been driving me nuts. While metadata repositories have been around for a while, new breeds of repositories are emerging at an increasing rate. In particular CMDB repositories (Configuration Management Databases), Enterprise Architecture repositories, Business Rules repositories, and so on are turning up all the time. The problem of course is that:
a) The information in each of these repositories is related in some way, and those relationships are relevant, and are themselves information (i.e. derived facts)
b) The repositories all have their own data models which cannot be easily integrated.

One approach is to build your own repository, and extend it as needed. Another approach is to take an “anchor” repository and attempt to extend it. So, for example taking a CMDB, and extending it to include entities and attributes for a metadata repository. However, both of these approaches require a great deal of effort to build and maintain, and in the attempt to create a cohesive view of knowledge, we invariably get bogged down in the plumbing of the repository itself. An excellent article which exams this problem, is Repositories Build or Buy by Malcolm Chisholm.

What I feel the problem really boils down to is rigid data models that cannot be dynamically changed, and in turn require integration projects. I am a big fan of the relational model, and to my knowledge, it is the only complete data model that exists. Data that has been properly normalized, constrained, and indexed can answer pretty much any question about itself.

While integrating any two relational data models (i.e. repositories) across heterogeneous systems is always possible, it is usually very difficult. While there are numerous reasons for this, I’ll point out two major issues that will never go away:

  1. Links in the relational model are represented as foreign key to primary key relationships. Such linkages presuppose a single system [the RDBMS] overseeing both the primary key entity and the foreign key entity. Contrast this to the world wide web, where anything can link to anything, and there is no single system enforcing referential integrity. While this would not be acceptable for a “bet-your-business” operational system, when it comes to knowledge management, I think it’s fair to relax the rules a bit in favour of agility.
  2. Most software developers treat the RDBMS as a “bit-bucket” to store data, and not as a system in its own rite that is capable of managing data on its own. Furthermore, any entities used by the developers are thought of as “black boxes” to only be interfaced through the developers [typically hidden] interfaces. As such, going in and adding even a single column is fraught with peril. In any enterprise, changing a data model in production is typically the riskiest operation you can perform, and runs the greatest cost due to the amount of analysis and regression testing required.

There is also a major marketing problem with traditional repositories: The average person just sees them as obscure “black boxes” that are of the domain of techy geeks. I have argued in the past that data governance and data stewardship coupled with a well structured metadata repository is necessary if you want to achieve data interoperability, and the purest in me will always believe this, but the pragmatist in me also knows that an 80/20 solution that can be sold and implemented is better than no solution at all.

Thus without further ado, I propose that as enterprise architects, IT service managers, and data managers, we seriously consider a Wiki approach to managing and integrating our knowledge. I.e. a Wikipedia for the enterprise. Now, I’m well aware that like anything that’s popular out there in the internet world, someone is trying to apply it to the corporate world. In other words, I don’t think the idea of an enterprise “Wiki” is anything new. However, I feel that people view Wikis in a very narrow light that does not do justice to its potential, and I’d like to point out some alternative ways in which we could marry Wikis to enterprise repositories, like a metadata repository or CMDB.

Wikis in a corporate sense are often thought of as a combination document management system cum message board. It’s a place where you could put a document that could be about a procedure to backing up a server, followed by a tape retention process. Users could go in and edit the Wiki, and time the procedure itself every changed. They could then record what they changed about the procedure in the edit notes. Anyone who is familiar with a document management system knows that this is nothing new, but for the uninitiated, a Wiki is more approachable and easier to digest. I used to work with developing integrations and add-ons for DOCSOpen (the most popular Document Management System of its time), and while I could argue the merits of a document management system (primarily its third party integrations), I would have no problem recommending a Wiki approach if a client was interested. But I digress…

I believe a Wiki could be extended to hold and maintain corporate documents, Metadata, CMDB data, and all other enterprise repository data, if the following shortcomings could be addressed:

  1. We need to have more powerful editing tools. The current way of editing a Wiki reminds me of when I used to write essays in university using LaTeX. It was always very precise, and you could get beautiful layouts, and once you knew your way around the mark-up language it was very easy to put together slick looking documents. But I had to create a Makefile just to “compile” my documents, and the idea of asking my peers to edit a LaTeX file was not feasible as it was just too techy for the average person. I was always a big supporter of LaTeX since it worked for me, but I acknowledged that it was basically useless for the average person until user friendly LaTeX editing tools came around.
  2. We need to have more experience and tools to create Directed Folksonomies. A Folksonomy is basically just a taxonomy that has been created by a user community. For example, you could create a classification system for comic books referring to various genres and subgenres. Of course the problem with a Folksonomy is that it expects the person doing the classifying to know what the various genres and subgenres are to begin with and that they are also using these classifications correctly. A Directed Folksonomy on the other hand simplifies this task for the classifier as it allows them to pick and choose the correct genre and subgenre, and ideally it should provide concise definitions of categories and subcategories. This leads me though to the third shortcoming of Wikis.

  3. We need more granular security. We need to ensure that select parts of Wikis can be edited or viewed by select users and in only select ways. We would also need to ensure that for Directed Folksonomies that only select users (Data Stewards) could create and edit the Folksnomy definitions, but perhaps allow a greater number of people to tag information using those Folksonomies.

  4. We need autonomous agents that can modify sections of Wikis on their own. Taking a CMDB example, it would be nice to check a single server page to see what servers are currently up, and for how long. That same page could have multiple sections: some sections being edited by people; and other sections that are only edited by autonomous agents. By allowing both people and autonomous agents to edit the same page, we no longer calcify those data models as the agents would always be aware that the full Wiki is not its dominion, and only a section of it is. Compare this to how RDBMS tables are currently treated, and what the consequences might be if we were to add even a single column to a single table.

  5. We need better audit tools and processes to ensure Wiki integrity. It would be nice for example to ensure that when a metadata element is pointing to a server entity/Wiki, it’s actually pointing to a server entity/Wiki and not just a dead link. While I don’t believe we need to have such integrity enforced by some overruling system (like an RDBMS), it would be nice to have spiders that could crawl enterprise Wikis as an a posteriori batch process that could point out issues that need to be rectified. I feel that this approach would work better, as it would behave the same way should there be a company merger, and could quickly and effectively assist in integrating both sets of knowledge, without ever actually preventing such an integration from happening due to overly strict a priori constraints.
  6. We need better reporting tools to allow BI-style reporting. Although I’m suggesting a Wiki approach to entering and maintaining knowledge, there’s no reason why we couldn’t suck this data into a data warehouse for reporting. Such reports could tell us where change is happening the fastest, and by whom, or which areas of knowledge are old and creaky. The possibilities are endless.

Before wrapping up, I’d like to paint a picture of a Wikified enterprise through a simple use case scenario:
You’ve only been with the retail company for a couple of weeks, and have only just got to know the DBAs and a few developers. You’ve been asked as part of a SOX audit to find out where the credit card data is physically stored for its corporate customers, and confirm [or deny] that the server is properly backed-up, and the back-ups are encrypted.

In your past experience you would start making the rounds. You’d be calling people, waiting for responses, following-up with more questions, requesting documentation, and not always getting it. In the end you’ll get your answer, but it could take you the better part of a week just to track down the right people and get them to locate the correct documentation.

In my dream Wikified enterprise, things would instead go like this:


  1. You go to the Wiki portal where you search for “credit card”.

  2. You find various results for pages with credit cards, but near the top you find a “credit card” data type page. You decide to click on it.

  3. The page brings up a definition of the credit card type, and includes links to all the various entities that utilize this type. The information on these pages is structured within sections, and has been edited and maintained by knowledgeable data stewards.

  4. You click through all the entities that have the credit card attribute (there would be links from this page). You scan the definition of each entity until you find those entities that pertain to corporate customers. The definitions of the entities should provide this information, or at least provide links to other Wikis which would provide this information.

  5. You have now located all entities that hold corporate credit cards. You now need to determine where these entities are physically stored in production. The entities themselves would have links back to which DBMS they are stored in.

  6. You then click on the DBMS Wiki link for each entity to find out where they’re stored. From here you have located three DBMSs: a data warehouse; an ODS; and a third DBMS.

  7. For each DBMS Wiki, you click on its link to find out more about it. One of the sections has links to the physical hardware that the DBMS runs on. You then go to this server Wiki. You read up on the server’s security and make notes. The server Wiki has a link to the data centre Wiki. You then click on the Data Centre Wiki to read about it, where it’s located, and its security policies. You bookmark this Wiki, while taking notes.

  8. You go back to the DBMS Wiki to read about the back-up and retention policy, which is contained in its own section. You note that there is no mention of whether the back-up tapes are encrypted or not (even after checking the respective back-up server Wikis), and decide to call the person who last edited the back-up section. After getting in touch with this DBA she informs you that the tapes are in fact not encrypted. Good to know. No problem, you quickly go into the Wiki and make a note that the tapes are not encrypted, citing where you got the information from.

  9. You now take the information that you have collected and produce a report. You look back and realize that it only took you a couple of hours to collect and digest all the information, including the call to the DBA. The report took another hour to draft and format, and you feel confident if challenged that you can corroborate your facts. A job well done, in complex environment.


This to me is what the agile enterprise is all about. Clearly we’re still a ways off. However, I’m optimistic that we’ll be there sooner than we think.

There is one last thing I’d like to leave you with on this topic. I have been blabbing on about metadata for quite some time now. While most people “in the know” are quick to agree that metadata is essential to getting to the root of IT failures, it has yet to capture the popular imagination and I fear it never will. On the other hand, Wikis have captured the popular imagination. Both my parents know what they are, and can envision their use in many different ways. I mention metadata and I get blank stares. I mention Wikipedia, and I’m always in for a lively discussion. So, as a pragmatist I feel that even though there are technology hurdles to clear in bringing Wikis to being robust enterprise repositories, I feel that it is a cinch in comparison to convincing people about the merits of metadata. So, the next time you start to talk about metadata or knowledge management, try instead starting off talking about “Wikipedia for the corporation”, and go from there. I bet you will have a much better chance of engaging the person you’re speaking with, whether they agree with you or not. Dialogue is only the beginning.

Tuesday, January 09, 2007

Implementing Data Governance

If you’ve every purchased a home, (or prepared a Last Will and Testament, negotiated a severance package, or gone through a divorce), it’s likely that you have hired a lawyer to guide you through the process. If you managed to find a good lawyer, hopefully she did a decent job ensuring your best interests were being served. Hopefully she presented you with the key decisions that need to be made, adviced on how to evaluate those decisions, and did not overwhelm you with the minutiae of the law.

You probably also hired a lawyer that was an expert in her field, whether it be real estate law, wills and trusts law, employment law, or family law. This person probably knew her subject pretty well, although she wouldn’t know your specific wishes until meeting with you. She would still have a good understanding of how your situation fits in to the context of the law. To complete the required work, your lawyer may even rely upon other lawyers or legal workers to flesh out all the details and take care of some of the routine work. Continuing with our real estate example, your lawyer would then liaise with the Land Registry Office (that’s what they call the place where deeds are recorded in Ontario) to ensure that the ownership has been officially recognized and that there is a clear and unambiguous record stating your entitlement to the land. At the end of the process, assuming things went as planned, you would be responsible for and take ownership of your new home, while relinquishing ownership of your old home.

In the world of Data Governance things should work in a similar way. However, instead of the client taking ownership of a property, the client takes ownership of a particular Subject Area. Instead of working with a lawyer, the client would work with a Data Steward. Instead of storing your official records with the Land Registry Office, you would instead store pertinent information in a Metadata Registry (often referred to as a Metadata Repository). Finally, there would be a straightforward and unambiguous process to follow, as is the case when dealing with a real estate transaction. Now, I’ve probably gotten a little ahead of myself here.

Questions…
What exactly does it mean to take ownership of a Subject Area? What is a Subject Area for that matter? What is a Data Steward? And what is a Metadata Registry?

A subject area is an area or interest or subset of data concerning something of importance to the enterprise. This could be Customers or Employees or Products or Stores or Financial transactions. To take ownership of a particular Subject Area not only entails taking ownership of the quality of data, but just as importantly entails taking ownership of how that subject area’s Data Elements are defined (a Data Element is just a constrained piece of data, such as a credit card number, or customer address. This is not to be confused with unconstrained or unstructured data, such as a paragraph in a blog). Establishing ownership of a particular Subject Area is probably the most difficult and politically charged aspect of data management. It is also the reason why successful data management professionals must have strong interpersonal skills to succeed. Data management is not as much about technology as it is about PR, marketing, persuasion, and negotiation.

Wikipedia defines a Data Steward as a role assigned to a person that is responsible for maintaining a Data Element in a Metadata Registry. While this is not incorrect, I feel that the role also encompasses the following:
Is assigned to, and has an in depth understanding of one or more Subject Areas (e.g. Customers, Products, etc.).
Has good negotiation skills: This is necessary when attempting to extract information from both internal and external colleagues. The accuracy of information is as good as its source, and Data Stewards must both work hard and use finesse to get people on their side to provide the most accurate and detailed data definitions.
Liaises with other Data Stewards.
Understands the Enterprise’s method of data classification. E.g. the various value domains, value domain types, value domain type classes, etc. (yeah it’s a bit like zoology in more ways than one)
Understand how to relate Data Elements to business and IT entities. E.g. how the customer credit card data element relates to business rules, business cycles, or even which physical servers it is stored on.
Is well networked with, and liaises with Information Security to establish security classification of data
May require management skills to manage more junior Data Stewards

A Metadata Registry is a registry or repository used to not only store your data element definitions, but how those data elements relate to all other entities in your enterprise. The Metadata Registry need not be a sophisticated enterprise application (as many Metadata Registries will support ISO 11179 or OMG MOF data definitions, layered data models, Change Management user workflow, modelled business entities, integrations with ETL tools, BI tools, and IDEs, and more), but can be as simple as a spreadsheet or even text document. As long as the document has enough structure to show what each data element’s fully qualified definition is, but detailed enough to distinguish between similar but different data elments. In my experience the vast majority of Metadata Registries are just simple Excel spreadsheets, or at the most a Microsoft Access DB.

Alternatively, if you already have another type of repository such as a Configuration Management Database (CMDB), it may be possible to extend this registry to include data element definitions. By doing so, you extend the value of your original repository while at the same time solving the problem of where to store your data element definitions. From what I can tell, this is one of the best strategies you can follow as you are only building what you need while extending your IT value chain. By the way I shouldn’t take credit for this, this is really just my interpretation of a methodology and approach that has been proposed by Charles T. Betz and which is described in his book "Architecture and Patterns for IT Service Management, Resource Planning, and Governance: Making Shoes for the Cobbler's Children". You can also read his blog at: www.erp4it.com.

All right, so I’ve described what a Subject Area is, what Data Stewards are responsible for, and what a Metadata Registry is (you can also read more about Metadata in my earlier blog post: “Metadata Defined”). So, how does this all pertain to Data Governance? And how do I implement Data Governance?
When talking about any form of implementation, we are talking about the HOW and not the WHAT (I covered that in my last post). As we all know, there’s a million ways to skin a cat, so anyone who claims to know “the way” is fooling themselves. So, the steps described below, are merely “a way”, but a way that I have seen work in other companies with great success. To be perfectly honest, the steps I’m outlining below are a composite of many ways that I have observed, and I’ve just tried to find the common themes. So, in simple bullet form below, I am describing in ultra-simplistic detail HOW to implement data governance:

Determine which Subject Areas you would like to implement data governance for. It is recommended that you choose subject areas that have the greatest number of data contention issues, and therefore the greatest ROI for your governance of those Subject Areas. As this will provide you with the greatest opportunity for funding. Securing funding of course is often the hardest part in any new venture. However, given that you will be working with the most contentious data it’s difficult to wade into this slowly. Alternatively, if you can get funding for a pilot project that covers less contentious data then this might be the way to go from a risk perspective.
Put together a business case for project funding. This will involve determining the ROI for the initiative. For an outside consultant, this is difficult as you are not aware of the specific problems felt by lacking Data Governance. What you can attempt is a Data Governance Audit. As a matter of fact, IBM has recently announced a Data Governance Service. The IBM service is essentially an 11 point audit, I suspect followed up with appropriate recommendations. Selling an audit is often tough to do, but if you are going to pitch one, the best time to do so is at the end of the budget year when departments are looking for ways of spending their remaining cash (to ensure the same or greater budget for next year). Additionally, there are many factoids being published on a regular basis which you can use to justify Data Governance in general. For example, Accenture recently completed a study showing that middle managers waste two hours each day just searching for the right information, and once they get it, nearly half of it is useless.
Assuming you get funding, the next step is to determine the Data Owners of the respective Subject Areas. Determining who owns what data is probably the hardest part as there will be people who don’t want to take ownership of data, and others who want to take more ownership than they should be entitled to, and then of course there are those messy grey area situations where there may need to be multiple owners of data. The best way of convincing someone that they should own a Subject Area is to ask them how much they have to lose if the data quality is poor, or if someone else who doesn’t have as vested an interest in the data instead takes ownership. Yes, it’s a negative way of thinking but sometimes you need to paint a picture of anarchy and chaos to mobilize people into owning their data. This is especially why it’s important for you [the Data Management professional] to have excellent negotiation and persuasion skills.
After identifying and establishing Data Owners, it is now time to round-up the Data Stewards who will work on the Data Owners’ behalf to ensure that the Data Element definitions are correct. It is generally a good idea to select people who have a good knowledge of the data to begin with (as this will save time in sourcing and documenting Data Definitions), but also someone who has strong analytical skills. This could be a business analyst, a software developer, a DBA, a project manager, or potentially even a Customer Service Rep. Since the Data Steward role is just a role, it need not be a full time job. In fact most Data Stewards spend less than half of their time on Data Stewardship.

While selecting your Data Stewards you will also need to consider the reporting structure, and depending on how far you want to go, you may want to consider multiple tiers of Data Stewards. I have witnessed organizations with three levels of Data Stewards, but they had over 150 Data Stewards which is something that clearly does not happen over night.
You now have clearly established in scope Subject Areas with corresponding Data Owners. You also have Data Stewards assigned to work on behalf of those Data Owners and who are Subject Matter Experts (SMEs) for those Subject Areas. Now you need to select your Metadata Registstry. Software selection in general is never a trivial task, so I’m not going to pretend that selecting a Metadata Registry is any easier. There are best practices particular to selecting a Metadata Registry which I’ll write a later blog entry about, but for now will leave this one open, and just assume that it will happen. One thing I can reveal, is that if you’re just getting started with Data Governance then you’re best to keep the Metadata Registry simple and easy to use. Something like an MS Access DB or even an Excel spreadsheet will do. At a later point you can always migrate to a more robust solution. Alternatively, as I mentioned earlier, you can think about extending an existing repository such as a CMDB. However, if you choose this route, it will likely be a lot more difficult to migrate to a different solution, or change what you have.
Your ducks are all lined up now! You just need to tie it all together with policy and procedure. This is where you need to put your Soft Systems thinking hat on and figure out the best way of ensuring people can do their job building systems, repairing systems, and decommissioning them without too much policy and procedure getting in their way. But at the same time ensuring that the Data Owners’ best interests are being preserved. For this you will need to determine the following:
Which technology domains are in scope and which are out of scope. For example you don’t want to waste your time governing how people modify a tiny workgroup MS Access DB. Nor do you necessarily want to govern how a hash file is maintained. My advice is to concentrate first on data that is most likely to be shared across technology domains and business units. In particular I recommend focussing on RDBMS data elements (e.g. data contained within an Oracle, DB2 or MS SQLServer DB). To this day, the relational model is still the only complete data model. No other data model provides built-in guarantees of: referential integrity; value constraints; and access and update times. Even with XML, a document can point to any other document, even if that document doesn’t exist. Additionally, the RDBMS has more adaptors than any other data store, so more people have the ability to easily connect to it. Finally, there is already a great deal of rigour around the RDBMS, from back-up regimes, to security regimes, to access regimes. So data stored within the RDBMS is already perceived as more of a hardened asset than data stored in other forms.
You will need to determine what the new procedures will be for:
i. Adding new Data Elements
ii. Changing existing Data Elements
iii. Removing or decommissioning Data Elements
How this procedure will work in a multi-tier environment. I.e. how the process would work when there is a Development environment, a Test environment and a Production environment (or how many tiers you may happen to have).
How to integrate the procedure with exiting Change Management procedures and processes to ensure the correct approvals take place, but also ensuring that no more people are required to make a change than necessary
Who will be involved when executing the aforementioned procedures. Creating a RACI chart is a good way of documenting this.


If you are working for a smaller organization, then the above methodology can be greatly simplified. You will be able to quickly and efficiently determine who the Data Owners and Data Stewards are. Coming to an agreement on how the procedures will work, will also be a lot faster. As a smaller organization you would also be wise to keep your Metadata Registry very simple, such as an Excel spreadsheet or MS Access DB.

If you are working for a large organization, none of what I described above will come easy, but don’t look for silver bullets. Yes, good software can perhaps automate some of the manual procedures through people workflows and tool integrations. Good software may even help in the initial discovery process to document your data flows (especially when ETL tools are involved). However, most of the hard work comes back to the right people being properly engaged to make well informed decisions.

Going back to my first example where I talked about a lawyer working with you to complete a Real Estate transaction. Yes, you and the lawyer may be able to use technology to work more efficiently. For example, in the future (or maybe the present by the time you read this) your lawyer will be able to submit your deed to the Land Registry on-line, saving some time. Or maybe you and your lawyer can collaborate on-line without you having to make the trek down to her office to sign documents. Regardless, I would still say that the most critical work that is being done is you making sound decisions, and allowing your lawyer to interpret these decisions while ensuring that all the legal details are being taken care of in your best interests. Until artificial intelligence can make some great strides (which I’m not seeing), we’re going to depend on lawyers to help us make decisions when the law is concerned, and Data Stewards when data is concerned.