Thursday, June 05, 2008

Creative Destruction: Column based Data Warehouses. The next generation of data warehouse technologies has arrived.

I first heard about column-based data warehouses in the Feb. 25th print edition of Information Week. The article interviewed Michael Stonebraker (Stonebraker was one of the original architects of the relational database, and released the first low cost RDBMSs under the company Ingres) , and has fully embraced this new architecture which takes a fundamentally different approach to storing and retrieving structured tabular data.

It's hard to exaggerate the impact column-based database design is likely to have on the DBMS industry. You may have recalled the hype around object-oriented databases back in the 90s, but column-oriented databases are qualitatively different for the following two reasons:

  1. There is a pent-up demand for OLAP databases to reduce storage demands, and improve query performance.
  2. Column-based architecture is primarily a back-end change, and generally does not affect application or data architecture. It's a bit like going from a 32-bit OS to a 64-bit OS, but with significantly greater performance returns.

But what is the difference between traditional record-based DBMSs and column-based DBMSs? The difference is quite simple to understand, and according to Stonebraker can yield 50-fold performance increases over traditional record-based RDBMSs. While I have yet to test a column-based database myself, it's not hard to understand why such dramatic performance increases can be had. The reason is thus: Traditional databases (like the ones you're probably using now) treat records as a basic element of storage and retrieval. It's not possible to access an attribute of a record, without first retrieving the entire record. On the flip side, column-based architectures align data to the column (a.k.a. field or attribute). As such, when querying, only the data that is required to be analyzed, is retrieved. Furthermore, because data from a column perspective tends to be similar, column-based warehouses can also enjoy significant compression advantages. Many people [incorrectly] believe that compression always impedes access times. However, more often than not, compression can actually speed up retrieval times since there is less data to be moved from the data store through the bus.

This may seem a little academic, so let me illustrate with an example. Imagine I was a bank, and had a master customer table which stores the most pertinent attributes for each customer. For this example, lets say this table contains 100 columns. If I want to produce a simple report showing all customer accounts that have an excess of $10,000 in their savings account, along with their actual balance. Using current architectures, my query would retrieve all 100 columns for each customer that matched those search criteria, and then format the results to only show me the account number and balance. Let's say that there are 1,000,000 customers that match those criteria. As such, I'm effectively copying 100 columns for those million customer records into temporary storage (usually memory), so as to display only two columns (account number and account balance). A column-based data warehouse would instead take a different approach, and only retrieve the two columns which are required, ignoring the other 98 columns right from the get-go. In other words, I'm moving one fiftieth (1/50th) of the data. Furthermore, because account balances tend to be within a small range, and are compressed as a single unit, I am moving even less data through my bus. The end result is a dramatic performance increase. We're talking about the difference between a query executing in under an hour versus the same query taking more than a day to complete. With such huge discrepancies in performance, there are real business competitive advantages to be had here... for now.

I decided to take a quick look at the actual vendors selling these new column-based. I've summarized my key findings for six major vendors:
  1. Calpont's CNX Data Warehouse Platform is a drop-in solution for Oracle. Namely, the solution can be deployed as a standalone DBMS or as an optimization/acceleration layer into an existing DBMS environment. Currently Oracle is supported, with planned support for DB2.
  2. Infobright's Brighthouse boasts superior compression. Namely, Infobright claims it can achieve a 40:1 compression ratio. Contrast this with Oracle 11g, which uses record based storage, advertises a 2:1 compression ratio (although I've heard that depending on the data, it can actually achieve between a 3:1 and 4:1 ratio in tests). Furthermore, the company claims it leverages MySQL "making seamless use of the mature connectors, tools, and resources associated with this widely deployed open source database". From this it sounds like Brighthouse is effectively a forked version of MySQL. Also worth noting for my local readers is the fact that the business is based here in Toronto.
  3. ParAccel's Analytic Database boasts a Shared-nothing, MPP (massively parallel processing) architecture. MPP architectures are normally associated with data warehouse appliances, such as Netezza, but there's no reason why an appliance solution is required, so it's nice to finally see a vendor selling this technology.
  4. Sand Technology's SAND/DNA Analytics solution claims that no indexing or specialized schemas are required, and that this is a unique feature. I'm not sure about that last claim, but they are certainly emphasizing ease-of-use as a major selling feature.
  5. Sybase also has a columnar DB: Sybase IQ. Apart from being a recognized and stable vendor, Sybase IQ boasts the first petabyte benchmarks. Interestingly, their whitepaper discusses column-based encryption, which I've never seen before. As Sybase points out, this is ideal for data aggregators with mult-client services. However, I would say that Sybase's brand is probably their biggest advantage for the PHB crowd.
  6. Vertica is Michael Stonebraker's company. I particularly like their marketing materials as they provide real benchmarks on their web site here: http://www.vertica.com/benchmarks. They also emphasize that their technology runs on "green" grids of off-the-shelf servers, and they even have a hosted "cloud" solution. Personally, I'm a big fan of hosted solutions, and after all the catastrophes I've witnessed, I would argue that they're lower risk. But there are those that prefer to drive than fly since it gives them a sense of control over the situation, statistics be damned.
It will be interesting to see what shakes out over time. I suspect that every DB vendor is currently working on a column-based solution of their own, so waiting is certainly an option.

Monday, June 02, 2008

On a lighter note, how do we change the culture of software development

I just noticed this ad on the back of a recent print edition of Information Week. I'd seen these ads before (mainly on CNET's web site), but not in print form in a magazine geared towards data management professionals.

The ad campaign looks like it's targeting 12 year old boys, but I suppose it must be targeting the 12 year old boy within us all... or geeky software developers. But before I start shooting fish out of a barrel, I thought I share with you a very strange subtext to this mini-narrative. If you look closely into the crowd scene there is in fact one character who stands out. This of course is the busty bird woman I suppose our hero is fighting to impress. In case you missed it, here's the zoomed in version below.
If anyone has any theories as to why this particular bird woman was chosen (maybe she is in fact concerned about the cow-man in the magic bubble) - I'd really love to know!

But for me, what this ad is is a reminder of how haphazard IT decisions tend to be, and how many still view IT as a crafts-based practice. It's a bit like those sugar cereal commercials that would run during Saturday morning cartoons, as the parental pressure points are well understood.

Most developers tend to be intelligent, analytical, and creative types. If they're lucky, they will work for a dynamic software company that embraces their talents in all phases of product management and design. However, most are not so lucky and end up working for bureacratic IT departments with similar expectations. The developers role here, is to merely take requirements for a System Design Specification, code them in a development environment, and do a little unit testing to ensure the requirements are satisfied. That's all folks.

Yet so many developers I meet want to take on business analysis (well, the fun parts), project management (the fun parts), human factors (the fun parts). Sometimes these guys even want to take on data modeling (but usually treat the DB like a bit bucket for their in-memory data structures). These guys will also tell you about some new gizmo or technology which is on the cusp of solving all our problems, or will introduce us to the modern age. Of course all that Change Management, training, data interoperability, service management, risk management, quality management, and all that other stuff is just pointless busywork that gets in the way of true innovation.

But I actually relate to these guys and totally know where they're coming from. When they show up for their first week on the job, it's all sunshine and lollilops with so much optimism and hope in the air. I try my best to foster this passion and excitement, but also explain that large organizations tend to have role-based cultures, and don't always appreciate talented and knowledgable individuals. I describe career paths which include business analysts, project management, and application architecture. But it's never as exciting or free as what they have in mind.

I personally remember a time not so long ago where systems were being developed on unstable resource constrained platforms with low level languages (e.g. Windows 95 and C++). System stability was a major issue, and talented developers who could analyze a core dump file, or who knew the inner works of memory management, or who could create helper systems to better recover from instabilities were invaluable. To this day I would reckon there is still value to these talents. However, most people can't really describe these talents, so their importance is greatly diminished now that Windows is a stable OS, and that software development has been highly abstracted.

But these killer developers were more than just killer debuggers, they were full on rennaisance men and women, capable of tracing a low level _if_ statement all the way back up to a business rule, and even suggesting new business processes to accomodate hardware limitations (this still happens). These were the only people who had complete line-of-site to all aspects of the business, and these were the people who held the true power.

My point is, with no new problems for these people to solve, they now must bring the business back down to their level of tools and technologies. So, I reckon that it is perhaps a cultural issue we must solve first, before we can tackle the obvious problems at hand. In the meantime, the vendors will be selling more sugar cereal. Can't get enough of those Sugar Smacks!