I first heard about column-based data warehouses in the Feb. 25th print edition of Information Week. The article interviewed Michael Stonebraker (Stonebraker was one of the original architects of the relational database, and released the first low cost RDBMSs under the company Ingres) , and has fully embraced this new architecture which takes a fundamentally different approach to storing and retrieving structured tabular data.
It's hard to exaggerate the impact column-based database design is likely to have on the DBMS industry. You may have recalled the hype around object-oriented databases back in the 90s, but column-oriented databases are qualitatively different for the following two reasons:
- There is a pent-up demand for OLAP databases to reduce storage demands, and improve query performance.
- Column-based architecture is primarily a back-end change, and generally does not affect application or data architecture. It's a bit like going from a 32-bit OS to a 64-bit OS, but with significantly greater performance returns.
But what is the difference between traditional record-based DBMSs and column-based DBMSs? The difference is quite simple to understand, and according to Stonebraker can yield 50-fold performance increases over traditional record-based RDBMSs. While I have yet to test a column-based database myself, it's not hard to understand why such dramatic performance increases can be had. The reason is thus: Traditional databases (like the ones you're probably using now) treat records as a basic element of storage and retrieval. It's not possible to access an attribute of a record, without first retrieving the entire record. On the flip side, column-based architectures align data to the column (a.k.a. field or attribute). As such, when querying, only the data that is required to be analyzed, is retrieved. Furthermore, because data from a column perspective tends to be similar, column-based warehouses can also enjoy significant compression advantages. Many people [incorrectly] believe that compression always impedes access times. However, more often than not, compression can actually speed up retrieval times since there is less data to be moved from the data store through the bus.
This may seem a little academic, so let me illustrate with an example. Imagine I was a bank, and had a master customer table which stores the most pertinent attributes for each customer. For this example, lets say this table contains 100 columns. If I want to produce a simple report showing all customer accounts that have an excess of $10,000 in their savings account, along with their actual balance. Using current architectures, my query would retrieve all 100 columns for each customer that matched those search criteria, and then format the results to only show me the account number and balance. Let's say that there are 1,000,000 customers that match those criteria. As such, I'm effectively copying 100 columns for those million customer records into temporary storage (usually memory), so as to display only two columns (account number and account balance). A column-based data warehouse would instead take a different approach, and only retrieve the two columns which are required, ignoring the other 98 columns right from the get-go. In other words, I'm moving one fiftieth (1/50th) of the data. Furthermore, because account balances tend to be within a small range, and are compressed as a single unit, I am moving even less data through my bus. The end result is a dramatic performance increase. We're talking about the difference between a query executing in under an hour versus the same query taking more than a day to complete. With such huge discrepancies in performance, there are real business competitive advantages to be had here... for now.
I decided to take a quick look at the actual vendors selling these new column-based. I've summarized my key findings for six major vendors:
- Calpont's CNX Data Warehouse Platform is a drop-in solution for Oracle. Namely, the solution can be deployed as a standalone DBMS or as an optimization/acceleration layer into an existing DBMS environment. Currently Oracle is supported, with planned support for DB2.
- Infobright's Brighthouse boasts superior compression. Namely, Infobright claims it can achieve a 40:1 compression ratio. Contrast this with Oracle 11g, which uses record based storage, advertises a 2:1 compression ratio (although I've heard that depending on the data, it can actually achieve between a 3:1 and 4:1 ratio in tests). Furthermore, the company claims it leverages MySQL "making seamless use of the mature connectors, tools, and resources associated with this widely deployed open source database". From this it sounds like Brighthouse is effectively a forked version of MySQL. Also worth noting for my local readers is the fact that the business is based here in Toronto.
- ParAccel's Analytic Database boasts a Shared-nothing, MPP (massively parallel processing) architecture. MPP architectures are normally associated with data warehouse appliances, such as Netezza, but there's no reason why an appliance solution is required, so it's nice to finally see a vendor selling this technology.
- Sand Technology's SAND/DNA Analytics solution claims that no indexing or specialized schemas are required, and that this is a unique feature. I'm not sure about that last claim, but they are certainly emphasizing ease-of-use as a major selling feature.
- Sybase also has a columnar DB: Sybase IQ. Apart from being a recognized and stable vendor, Sybase IQ boasts the first petabyte benchmarks. Interestingly, their whitepaper discusses column-based encryption, which I've never seen before. As Sybase points out, this is ideal for data aggregators with mult-client services. However, I would say that Sybase's brand is probably their biggest advantage for the PHB crowd.
- Vertica is Michael Stonebraker's company. I particularly like their marketing materials as they provide real benchmarks on their web site here: http://www.vertica.com/benchmarks. They also emphasize that their technology runs on "green" grids of off-the-shelf servers, and they even have a hosted "cloud" solution. Personally, I'm a big fan of hosted solutions, and after all the catastrophes I've witnessed, I would argue that they're lower risk. But there are those that prefer to drive than fly since it gives them a sense of control over the situation, statistics be damned.