Thursday, June 05, 2008

Creative Destruction: Column based Data Warehouses. The next generation of data warehouse technologies has arrived.

I first heard about column-based data warehouses in the Feb. 25th print edition of Information Week. The article interviewed Michael Stonebraker (Stonebraker was one of the original architects of the relational database, and released the first low cost RDBMSs under the company Ingres) , and has fully embraced this new architecture which takes a fundamentally different approach to storing and retrieving structured tabular data.

It's hard to exaggerate the impact column-based database design is likely to have on the DBMS industry. You may have recalled the hype around object-oriented databases back in the 90s, but column-oriented databases are qualitatively different for the following two reasons:

  1. There is a pent-up demand for OLAP databases to reduce storage demands, and improve query performance.
  2. Column-based architecture is primarily a back-end change, and generally does not affect application or data architecture. It's a bit like going from a 32-bit OS to a 64-bit OS, but with significantly greater performance returns.

But what is the difference between traditional record-based DBMSs and column-based DBMSs? The difference is quite simple to understand, and according to Stonebraker can yield 50-fold performance increases over traditional record-based RDBMSs. While I have yet to test a column-based database myself, it's not hard to understand why such dramatic performance increases can be had. The reason is thus: Traditional databases (like the ones you're probably using now) treat records as a basic element of storage and retrieval. It's not possible to access an attribute of a record, without first retrieving the entire record. On the flip side, column-based architectures align data to the column (a.k.a. field or attribute). As such, when querying, only the data that is required to be analyzed, is retrieved. Furthermore, because data from a column perspective tends to be similar, column-based warehouses can also enjoy significant compression advantages. Many people [incorrectly] believe that compression always impedes access times. However, more often than not, compression can actually speed up retrieval times since there is less data to be moved from the data store through the bus.

This may seem a little academic, so let me illustrate with an example. Imagine I was a bank, and had a master customer table which stores the most pertinent attributes for each customer. For this example, lets say this table contains 100 columns. If I want to produce a simple report showing all customer accounts that have an excess of $10,000 in their savings account, along with their actual balance. Using current architectures, my query would retrieve all 100 columns for each customer that matched those search criteria, and then format the results to only show me the account number and balance. Let's say that there are 1,000,000 customers that match those criteria. As such, I'm effectively copying 100 columns for those million customer records into temporary storage (usually memory), so as to display only two columns (account number and account balance). A column-based data warehouse would instead take a different approach, and only retrieve the two columns which are required, ignoring the other 98 columns right from the get-go. In other words, I'm moving one fiftieth (1/50th) of the data. Furthermore, because account balances tend to be within a small range, and are compressed as a single unit, I am moving even less data through my bus. The end result is a dramatic performance increase. We're talking about the difference between a query executing in under an hour versus the same query taking more than a day to complete. With such huge discrepancies in performance, there are real business competitive advantages to be had here... for now.

I decided to take a quick look at the actual vendors selling these new column-based. I've summarized my key findings for six major vendors:
  1. Calpont's CNX Data Warehouse Platform is a drop-in solution for Oracle. Namely, the solution can be deployed as a standalone DBMS or as an optimization/acceleration layer into an existing DBMS environment. Currently Oracle is supported, with planned support for DB2.
  2. Infobright's Brighthouse boasts superior compression. Namely, Infobright claims it can achieve a 40:1 compression ratio. Contrast this with Oracle 11g, which uses record based storage, advertises a 2:1 compression ratio (although I've heard that depending on the data, it can actually achieve between a 3:1 and 4:1 ratio in tests). Furthermore, the company claims it leverages MySQL "making seamless use of the mature connectors, tools, and resources associated with this widely deployed open source database". From this it sounds like Brighthouse is effectively a forked version of MySQL. Also worth noting for my local readers is the fact that the business is based here in Toronto.
  3. ParAccel's Analytic Database boasts a Shared-nothing, MPP (massively parallel processing) architecture. MPP architectures are normally associated with data warehouse appliances, such as Netezza, but there's no reason why an appliance solution is required, so it's nice to finally see a vendor selling this technology.
  4. Sand Technology's SAND/DNA Analytics solution claims that no indexing or specialized schemas are required, and that this is a unique feature. I'm not sure about that last claim, but they are certainly emphasizing ease-of-use as a major selling feature.
  5. Sybase also has a columnar DB: Sybase IQ. Apart from being a recognized and stable vendor, Sybase IQ boasts the first petabyte benchmarks. Interestingly, their whitepaper discusses column-based encryption, which I've never seen before. As Sybase points out, this is ideal for data aggregators with mult-client services. However, I would say that Sybase's brand is probably their biggest advantage for the PHB crowd.
  6. Vertica is Michael Stonebraker's company. I particularly like their marketing materials as they provide real benchmarks on their web site here: http://www.vertica.com/benchmarks. They also emphasize that their technology runs on "green" grids of off-the-shelf servers, and they even have a hosted "cloud" solution. Personally, I'm a big fan of hosted solutions, and after all the catastrophes I've witnessed, I would argue that they're lower risk. But there are those that prefer to drive than fly since it gives them a sense of control over the situation, statistics be damned.
It will be interesting to see what shakes out over time. I suspect that every DB vendor is currently working on a column-based solution of their own, so waiting is certainly an option.

Monday, June 02, 2008

On a lighter note, how do we change the culture of software development

I just noticed this ad on the back of a recent print edition of Information Week. I'd seen these ads before (mainly on CNET's web site), but not in print form in a magazine geared towards data management professionals.

The ad campaign looks like it's targeting 12 year old boys, but I suppose it must be targeting the 12 year old boy within us all... or geeky software developers. But before I start shooting fish out of a barrel, I thought I share with you a very strange subtext to this mini-narrative. If you look closely into the crowd scene there is in fact one character who stands out. This of course is the busty bird woman I suppose our hero is fighting to impress. In case you missed it, here's the zoomed in version below.
If anyone has any theories as to why this particular bird woman was chosen (maybe she is in fact concerned about the cow-man in the magic bubble) - I'd really love to know!

But for me, what this ad is is a reminder of how haphazard IT decisions tend to be, and how many still view IT as a crafts-based practice. It's a bit like those sugar cereal commercials that would run during Saturday morning cartoons, as the parental pressure points are well understood.

Most developers tend to be intelligent, analytical, and creative types. If they're lucky, they will work for a dynamic software company that embraces their talents in all phases of product management and design. However, most are not so lucky and end up working for bureacratic IT departments with similar expectations. The developers role here, is to merely take requirements for a System Design Specification, code them in a development environment, and do a little unit testing to ensure the requirements are satisfied. That's all folks.

Yet so many developers I meet want to take on business analysis (well, the fun parts), project management (the fun parts), human factors (the fun parts). Sometimes these guys even want to take on data modeling (but usually treat the DB like a bit bucket for their in-memory data structures). These guys will also tell you about some new gizmo or technology which is on the cusp of solving all our problems, or will introduce us to the modern age. Of course all that Change Management, training, data interoperability, service management, risk management, quality management, and all that other stuff is just pointless busywork that gets in the way of true innovation.

But I actually relate to these guys and totally know where they're coming from. When they show up for their first week on the job, it's all sunshine and lollilops with so much optimism and hope in the air. I try my best to foster this passion and excitement, but also explain that large organizations tend to have role-based cultures, and don't always appreciate talented and knowledgable individuals. I describe career paths which include business analysts, project management, and application architecture. But it's never as exciting or free as what they have in mind.

I personally remember a time not so long ago where systems were being developed on unstable resource constrained platforms with low level languages (e.g. Windows 95 and C++). System stability was a major issue, and talented developers who could analyze a core dump file, or who knew the inner works of memory management, or who could create helper systems to better recover from instabilities were invaluable. To this day I would reckon there is still value to these talents. However, most people can't really describe these talents, so their importance is greatly diminished now that Windows is a stable OS, and that software development has been highly abstracted.

But these killer developers were more than just killer debuggers, they were full on rennaisance men and women, capable of tracing a low level _if_ statement all the way back up to a business rule, and even suggesting new business processes to accomodate hardware limitations (this still happens). These were the only people who had complete line-of-site to all aspects of the business, and these were the people who held the true power.

My point is, with no new problems for these people to solve, they now must bring the business back down to their level of tools and technologies. So, I reckon that it is perhaps a cultural issue we must solve first, before we can tackle the obvious problems at hand. In the meantime, the vendors will be selling more sugar cereal. Can't get enough of those Sugar Smacks!

Tuesday, May 20, 2008

Unstructured Data and the Hunt for the Elusive Customer Service KPI / Looking for Work

The culture of data and analytics is starting to take foot among the general population. Books like "Competing on Analytics" and "Supercrunchers" herald a new era of business whereby every little decision is vetted through intense fact-based scrutiny.

Well, it would seem to be that way. The truth is, most processes within most businesses (even the businesses discussed in the aforementioned books) are managed using crude or imprecise measures. The classic example for me is the new IT project. Such projects routinely introduce new business processes, data elements, and business rules. The success or failure of such projects is typically measured based on whether or not the project delivered on time and on budget. The operational ramifications are rarely considered. However, interestingly post-implementation costs [which account for at least 80% of the overall project's costs] are rarely measured. Most people would say that the main reason for this is that it would require periodic follow-up assessments (i.e. "busywork"), and that the original project team has long disbanded. I partially agree with this, but I would argue that there is a bigger issue here.

People (especially those in senior positions) are loathe to be measured. Ask a VP, director, or manager if she likes the idea of measuring her team's performance and she would say "yes!". Ask the same person if she would like to be measured, and you'll get a long winded answer as to why her performance can't be gauged using numeric measures, and should be based on a "360 review" with all manner of testimony and exhibits. Indeed, today's performance reviews are are more like going on trial, rather than a quantifiable measure of performance. It's okay to subject others to KPIs - just not me!

None of this should come as any surprise. But what is interesting is that the new generation of knowledge workers wants to be measured by KPIs, and they want better KPIs to be measured against. How do I know this? I know this because I routinely talk to people who are on the front line of customer service, and I ask them as many questions as I possibly can. Here is what I've learned folks: The generation that grew up with the Internet (usually under the age of 28), and has been asked to work with a computer and a telephone quickly realizes that their job is being measured by a few simple KPIs, and those KPIs can be quickly gamed. This is a generation that looks for inefficiencies on eBay, that has developed methods for finding free music and videos, that can quickly fact check for discrepancies when BS is suspected. These people are playing massive multi-player on-line games, are on every major social network, and are relentlessly logical and efficient when working with rules-based systems (i.e. companies).

It should come as no surprise that the new generation of customer services reps, and other front-line knowledge workers are quick to find the path of least resistance when approaching their job. This is their comfort zone.

The bad news is that the current KPIs that measure these people, are crude and blunt instruments. Taking customer service as an example. There are two basic KPIs which gets used: The first, "Average Handle Time", or the time it takes to get the customer off the phone. The second, "Number of Call-backs", or the degree to which the customer had his issue "resolved". That's basically it. Of course, most companies perform random audits to keep reps on their toes. While this "boogeyman" style of management keeps the train on its rails, it hardly provides any goals to aspire for. Even a callback can made be for any reason under the sun, and it may even be a satisfied customer calling back to spend more money (this contradiction between reality and metrics is often referred to a the "99 foot man paradox"). As for "Average Handle Time"? Well, I once heard a story about a bunch of call centre reps who had a nice scam going where they would simply hang-up on every incoming call (effectively deflecting the caller to other CSRs). Until they were caught, they were being held up as model CSRs for their lightening efficiency.

So how does one actually measure customer service so that someone "gaming" the KPI is forced to provide competent customer service at a reasonable cost? The answer depends a great deal on the company's missions. However, with today's technology there are a lot of options available, especially given that it's now standard to record each and every inbound call. Furthermore, the latest voice recognition software does an impressive job of recognizing the majority of words and phrases. Focusing on this data set alone, we can start pulling out some interesting KPIs. We would first need to convert these unstructured data to structured data sets. This is probably our most difficult task (on a that note, Bill Inmon's "Tapping into Unstructured Data" is one of the better books on this subject). Once we have a handle on our CSR/customer conversation data, we can start mining for certain terms that would indicate satisfaction or dissatisfaction. I'm aware that it's even possible to capture a customer's mood and emotion through vocal tone analysis. In theory, you should even be able to separate out the CSR's tone and words from the customer's tone and words, for even more fine grained analyses.

I am not saying that it would be easy to establish a KPI to objectively measure customer satisfaction. Rather, I am saying that it's not hard to improve upon our current KPIs.

But what's the point? First off, customer service is generally pretty bad these days. This may even have something to do with CSRs gaming the existing KPIs, especially "Average Handle Time". It also has to do with the complexity of services being offered these days, and the natural frustration that goes along with an excess of business rules that we can't possibly comprehended (there is a greater problem here, that I don't have time to get into in this post). My point in all of this is to say that we are headed towards a culture of analytics whether we like it or not, and if our KPIs have flaws in them, then they will be manipulated against us by our customers and employees. The world we live in is complex and nuanced, and one of our best tools at managing this is through simplification through the use of comprehensive indexes. We just need to get better at designing our KPIs, and ensure they provide us with maximum goal congruence.

---

On a completely unrelated note, I have just rolled off a big project, and am looking for new work. My area of expertise is in data management and enterprise architecture, but I'm also very nuts-and-bolts, and enjoy doing everything from: application support; software development; data modeling; requirements gathering; system sourcing and selection; process architecture; change management; external data procurement and aligntment; data warehousing & BI; metadata management; data governance policy; IT strategy; marketing analytics; and pretty much anything else technology or information related you can think of.

I'm based out of Toronto, but am willing to travel and work anywhere in the world where there's interesting work. I am incorporated, and prefer contract work, but would also consider full time work if there's a good fit.

If you know of anything that you think I might be interested in, feel free to contact me at: neil@hepburndata.com

Thursday, April 10, 2008

David Letterman can teach us a thing or two about BI

I've been working a lot these days with data visualizations and presentation reports. I must admit that I've learned a thing or two about how people approach data, both from the IT side and the business side. However, after looking at dozens and dozens of data visualizations and executive reports, I have realized that there are effectively two kinds of reports that you can present to an executive, and we should approach and understand them accordingly.

The first kind of report is what I describe as an "entertaining report". These reports rely heavily on data visualizations, and while they can and should convey information. Their primary purpose is to grab your attention (i.e. entertain you), over and above driving decisions. Since human beings are instinctively visual beasts, we have a soft spot for these types of reports. We rarely know what to do with the information we see in these fancy presentations, but we love it all the same as it speaks to us emotionally. The old saying "seeing is believing" is as true now as it has ever been.

The second kind of report is what I would describe as "decision driving". These reports tend to be bland ordered lists, with numbers. However, these reports are not only the most important in driving decisions, but due to their abstract nature (and lack of understanding of the decision-makers predicament) are very difficult to get right the first time. In fact, these reports tend to be an after-thought since we tend to occupy our imagination with the more wonderous data visualizations, and would rather avoid trying to understand the messy world that the decision making manager has to live in. In fact, I'm sure I've even seen some IT folk sneer at the decision makers for not appreciating their glorious art. I've probably sneered myself at one time.

Going one step further, if we ask ourselves how decisions typically get enacted in business, and look at how people take on decisions, we can see that there is s desire to streamline decision making. Managers are expected as part of their role to make decisions on a regular basis. However, because new decisions represent risk, this in turn leads to stress. So, if we can help managers make better decisions without increasing their stress, this is what we should strive for.

I believe that the top 10 (or bottom 10) list is an excellent framework for streamlining decision making. In fact it's so popular, that it is this tool we use to manage our own lives. I maintain my own to-do lists each day. If I need to go grocery shopping, I always have a shopping list in hand. If I need to get my personal spending down, I take a look at my biggest expenses and attack those in order. In other words, the top 10 list provides a framework for grouping decisions together, and therefore making each subsequent decision easier to tackle. Furthermore, since we already know that list items get easier as we go down the list, the entire set of decisions seems less daunting since we can get into a groove and track our progress.

But let's do a thought experiment to give you a better idea of where I'm coming from. Let's say you were the mayor of Toronto, and you had pledged to reduce crime. You might think to first get a grasp of where all the crime is happening. You hire a consultant to explain this to you. The consultant comes back one month later with an impressive heat-map of the City of Toronto showing in excruciating detail where all the crime hotspots are. As the city mayor you recognize all the neighbourhoods, and probably aren't too surprised by what you see. However, you will feel wiser seeing this map as you can now visualize where the crime is taking place (well you might think you can visualize it). Great! Now it's decision time. You need to make some hard spending decisions as to where you want to allocate social spending programs, improve community safety, and boost law enforcement. Is this map sufficient for you to sign into budget these decisions? Perhaps I as mayor could request more heat-maps showing different types of crime like homicide or grand larceny? Maybe an animated time-seriesed map showing the spread of crime might better help? Do you feel confident allocating millions of dollars based on moving blotches on a map, even knowing that those blotches are confiding the truth? Probably not.

I suspect at this point you will want to start generating good old fashioned lists. You might want to see: Top ten neighbourhoods, as ranked by a blended crime index. Or maybe, top ten neighbourhoods, as ranked by velocity of increase in crime over the past 4 years.

There is no shortage of these top 10 lists you could produce, but the whole time you're dealing with unamiguously ranked neighbourhoods, supported by hard numbers. As a compromise, I might say that you could add a simple bar chart visualization to help make some numeric comparisons a bit easier. Either way, you will need to boil things down to a list of some sort, since you will need to verbally articulate the decision you made. What sounds better: "I have allocated an increase spending to: Jane & Finch; Rexdale; and Regent Park, as they currently have the highest indexed crime per capita for the past 5 years standing, according to Statistics Canada". Or, would you rather say: "If you could have seen the map I saw, you would know to allocate funding to Jane & Finch; Rexdale; and Regent Park". Yes, if you're lucky, you might get to hold up the map, but then you would be forced to explain its legend. And if the three neighbourhoods are visually similar to a few other neighbourhoods from a heat-map persepective, then you might be in the awkward situation of squinting your eyes and saying "Well, in my opinion, this blotch looks slightly larger than that blotch". However, if there was a clear winner, then maybe the map wouldn't be so bad? But if there was a clear winner you could state that more clearly in verbal terms.

Show me a data visualization, and I'll show you a top 10 list which does a better job at driving decisions.

However, with all that said, you might be led to believe there are some exceptions to what I am saying. For example, experienced meteorologists are able to make reasonably accurate predictions by visually studying animated satellite imagry of weather patterns. Touche! But I'm not sure if I would even categorize this as BI, since the data never got beyond the video stage into a "fact-based" database. The same goes for military intelligence studying satellite imagry. Once again, the photos are being analyzed as-is for enemy presence.

What I am saying is that the name of the game is to figure out what the most ideal top 10 (actually top 5 might be better due to limits on our capacitative memory) lists are to drive decisions, and you will have saved everyone time, and make managers lives so much easier. However, the hard part is getting into the head of the decision maker. If you cannot understand what the decision maker is confronted with, you will just be throwing darts at a board. Who knows, maybe you'll get lucky.

Thank you David Letterman. You know us so well!