Wednesday, December 06, 2006

Metadata Defined

Regardless of whether you work in IT or some other department, if you are part of a large organization or enterprise (and sometimes even if you're not), Metadata is rapidly gaining much needed awareness.


Put simply, metadata is "data about data", or your "data definitions". However, Metadata really becomes valuable when those definitions are standardized across the enterprise and are precise enough to show the nuances between similar but semantically different data elements. For example, a "customer name" may seem similar to an "employee name", but these two data elements are semantically different and are likely not interoperable (if they are, the metadata would make this clear). Furthermore, Metadata also provides context to data so users or consumers of the data can answer basic questions like:
1. What information assets do we have?
2. What does the information asset mean?
3. Where is the information asset located?
4. How did it get there?
5. How do I gain access?


Metadata is not a new thing, in fact it has been around as long people have been storing and cataloging information. The Royal Library of Alexandria (3rd century BCE) had an indexing system overseen by Demetrius of Phaleron (a student of Aristotle) which was likely one of the first comprehensive metadata repositories to have existed; Fast forwarding to the future. When IT systems were first deployed in the 1960s, basic data dictionaries also existed to provide basic definitions of the structured data contained within the enterprise, even before the relational database was invented.


Your organization probably has some amount of metadata floating around. However, comprehensive Information Management Programs that maintain and fully leverage metadata are still rare and are typically only found in large financial institutions and governments (at the state/provincial level or federal level). That notwithstanding, successful Information Management Programs that rigorously maintain metadata almost always show huge returns on investment, although those ROI figures are often difficult to nail down and predict. This then begs the following questions:
1. Why is Metadata all of a sudden coming to the forefront now, and not before?
2. Why is it that we mainly see good metadata in large financial institutions and governments, and not as frequently in other verticals or in smaller organizations?
These are both excellent questions, and to understand the answer is to understand why you should at least be thinking about metadata as it relates to your own company. The more you understand about the value of metadata as seen through the prism of your own organization, the easier it will be for you to convince others of its inherent value.
To answer the first question as to why is metadata so important now: Within IT, Metadata can be thought of as being closer to the top of Maslow's pyramid than to the bottom. As it is commonly known, Maslow hypothesized that lower human needs must be addressed before higher needs. For example, there is no point in worrying about self-esteem if you cannot find food and shelter. However, a cursory glance in any bookstore will reveal that most people in Western society are more concerned about self-esteem than they are about obtaining basic food and shelter (not that this is not a concern, just not something that directly occupies our thoughts). Thus, the higher needs always existed, but are not at the forefront of our mind until the lower needs can be satisfied. The same goes for Information Technology. Over the past 10 years alone we have seen the following major changes:
1. Workstation stability is significantly better. You probably don't see your computer crashing (i.e. "blue screen of death") as often as you did 10 years ago. Furthermore, applications are now being rolled out as intranet web applications that don't require installation, and thus do not require "house calls" to fix.
2. Server stability is significantly better. Most modern server applications run on a virtual environment (e.g. Java Virtual Machine or Microsoft’s Common Language Runtime). Thus failures for a particular user remain isolated, and rarely affect other users. Furthermore, most modern systems now come standard with failover technology or can be configured to be part of a grid or cluster. It is even possible to patch or upgrade databases while accepting transactions, with zero downtime!
3. System interfaces are more robust and flexible than ever before, and are typically standards based (e.g. ODBC, SOAP, etc.). Furthermore, most modern interfaces are designed to work over the internet which itself is a significantly more reliable network than previous proprietary point-to-point networks.
Therefore, as users of data we are spending far less of our time calling the helpdesk about "blue screens of death" and crashed servers, and are instead spending more of our time asking questions about the data itself. This in turn also translates into application support also spending more of their time investigating questions pertaining to the meaning and understanding of data. In other words, IT spends less of its time making systems "work", and more of their time investigating the informational aspects of change.

An added problem now is that practically all organizations face the infamous "spreadmart" issue. Namely, users are extracting data from managed IT systems into desktop Excel spreadsheets or MS Access databases and copying these spreadsheets and local databases throughout the organization without also copying the data definitions behind the data (i.e. the Metadata). This creates a massive "broken telephone" situation.
Metadata provides us with the tools to address these problems. Metadata is also the cornerstone to sound Enterprise Architecture, and allows us to manage complexity and change in a cost effective manner.
To answer the second question as to why comprehensive Metadata is typically only found in large financial institutions and governments: First off, this is changing so it would be more accurate to say that financial institutions and governments in fact have the best metadata management practices. The reasons for this can be stated as follows:
1. These institutions are highly regulated and must be able to produce reports on short notice explaining every detail and provide traceability for the information they store and process.
2. Metadata requires strong governance and policy. Although there are a number of software products that can assist in the discovery of metadata, not to mention a large number of products designed to store metadata (i.e. metadata registries/repositories), Metadata requires sound governance through Data Stewardship to ensure that it is consistently managed. Conway's Law states that "the process is the product" and so this is true for Metadata. If we allow Metadata to be managed in an unfettered way ignoring corporate standards, we will never be able guarantee consistency across the enterprise. Financial institutions and governments tend to have very mature governance policies in place already and are experts at upholding governance. Furthermore, their cultures are more "command and control" in nature than other organizations, so there tends to be more buy-in for governance and a lower risk of dissent. That notwithstanding, it is possible for more nimble organizations to adapt governance to their environment and implement Information Management Programs [IMP] which can gain acceptance and provide a significant ROI while not requiring the full scope of IMPs found in more mature organizations. Furthermore, modern Metadata repositories can automate much of the governance through automatic role assignment and workflow.
3. Metadata management has been too expensive for most organizations to afford. Since Metadata software is not yet mainstream, and the know-how to implementing an Information Management Program is scarce, not to mention the human capital required through Data Stewardship, only large organizations with huge IT budgets can realize the economies of scale that Metadata provides. Nevertheless, as awareness increases so will the know-how to deliver Metadata management, and this will be the driving factor to reduce costs - even more so than the drop in Metadata repository software license costs.
Many software vendors - particularly in the data warehousing and Business Intelligence market - already offer integrated Metadata products which provide some value. However, these offerings tend to come up short when attempting to harmonize data definitions across technology domains. For example, Cognos (a popular BI vendor) offers a Metadata repository to manage Metadata for data elements directly used by Cognos. But Cognos falls short of offering true enterprise "where is" search functionality (e.g. "where are all my customers data located") since most information assets are located in other technology domains (e.g. mainframes or remote databases) that are out of reach of Cognos. In other words, if you're just concerned about Cognos reports, the Cognos Metadata Repository will serve you well, but if you're asking broader questions about the nature of, and location of information assets that do not touch Cognos, you will quickly hit a brick wall.
The software industry has developed a number of products to assist you in discovering Metadata, and in some instances even generating Metadata based on analysis of data flow, in particular through detailed analysis of ETL (Extract Transform Load) jobs. While these tools can certainly help answer questions regarding Legacy Systems, they are fundamentally just search engines and cannot actually manage the creation and maintenance of Metadata any more so than Google can manage information on The World Wide Web. Examples of these tools (or tools with this functionality) include:
1. ASG's ASG-Rochade
2. Informatica's SuperGlue
3. Sypherlink's Harvester
4. Metatrieval's Metatrieve
5. Data Advantage Group's MetaCentre
6. IBM's WebSphere MetaStage (which will soon be incorporating Unicorn's Metadata repository)
7. CA's AllFusion Repository for Distributed Systems
From a software perspective, what is really desired is a Metadata repository that can classify data in a precise enough way so as to ensure data interoperability. The most tried-and-true classification scheme is ISO 11179, although the OMG MOF classification scheme is gaining acceptance. The main difference between these schemes is that MOF is more generic and can be used to catalogue unstructured data (e.g. documents, images, e-mail, etc.) as well as structured data, whereas ISO 11179 was designed to address the taxonomy of structured data (i.e. data elements). There is also the Dublin Core standard which is primarily a classification scheme for documents, and is the most popular standard on the web for classifying HTML documents. R. Todd Stephens director of Metadata Services from BellSouth has in fact used the Dublin Core classification scheme with good success, although there are surely interoperability issues he must still face.

A good repository should also provide business context through a built-in Enterprise Architecture registry. Namely, a way of relating data elements to pertinent business entities. A sampling of these entities might be:
1. Business missions
2. Business users.
3. Business calendar cycles
4. Business locations.
5. Business rules.
6. Business processes.
7. Business transactions.
8. Etc.
If you're only interested in an Enterprise Architecture registry (hey, some people are), there are a number of standalone Enterprise Architecture registries which can do this. Most of these registries are centred on managing SOA services. To complicate matters there are also Configuration Management databases (CMDB) which are registries to manage physical IT assets such as servers, network switches, workstations, etc.
I suspect that there will be growing convergence between Metadata Registries, SOA Registries, Enterprise Architecture Registries, and possibly CMDB Registries (there is relatively little overlap between CMDB entities and Enterprise Architecture entities, since computer hardware tends to be opaque as far as the business is concerned). For the time being the following Enterprise Architecture and SOA Registries are available, so you may want to keep an eye on them as they have a lot of overlap with Metadata Registries and may begin to offer Metadata functionality:
1. Troux Technology's Metis
2. IBM's Websphere Registry and Repository
3. BEA's Flashline
4. Infravio's (now Webmethods) X-Registry
5. Mercury Interactive’s (now HP) IT Governance Foundation
Metadata that can be placed within an Enterprise Architecture context helps reduce costs as it shortens the time for impact analysis, as well as shortening employee ramp-up times. Support costs (for investigations) are also greatly reduced.


It is important to note that a structured and normalized repository can provide - for lack of a better term, Introspective Business Intelligence. Namely, in the same way in which you can apply Business Intelligence tools to derive new facts about what your business does (e.g. based on combining customer profile data with product sales data, you may find that the majority of your repeat customers are between 26 and 28 years of age). Similarly, with a well structured Metadata repository you may derive or deduce facts about your business (e.g. based on combining IS/IT systems information with business calendar cycle information, you may be able to determine that the most quiescent time of the year is in August, and therefore plan maintenance activities accordingly. As another example, you may discover that certain data elements are in higher demand than previously considered, and should therefore be moved to higher availability systems, from an Information Lifecycle Management [ILM] perspective).
A great Metadata Repository will also understand Data Stewardship processes and roles and help automate these processes and manage these roles. Roles can be managed through built-in security, and processes can be managed through configurable workflows.
Finally, an excellent Metadata Repository should be extensible and modifiable as it is impossible to predict where the enterprise will go and what business entities will appear in the coming years.
The following Metadata repositories exhibit these desired traits:
1. Whitemarsh's Metabase
2. CA's AllFusion Repository for Distributed Systems
3. Data Foundation's OneData Registry
4. Oracle's Enterprise Metadata Manager

I have spent some time discussing the background and purpose of Metadata, why it is more relevant now than ever before, and how it can be stored and organized. However, I have spent very little time discussing the governance of metadata through Data Stewardship. Data Stewardship is by far the most important aspect of Metadata, and without it you have no consistent way of managing your Metadata. In my next post I will discuss Data Governance.

1 comment:

Unknown said...

Do you know Meta Integration® Model Bridge (MIMB) Solution?

Thanks,