Saturday, August 26, 2017

A Modular Data System for Digital Advertising

This post is a follow-up to my original posting (and paper) titled “A Modular Approach to Solving the Data Variety Problem”.

In response to that posting a LinkedIn commenter (Mark B.) asked the following [paraphrased] question to understand how he might use modular approach to build a modular data analysis system to handle the following scenario:
“As a digital marketer, I would like to see how the variation in advertising images are related to responses by different audiences.”

Thank you for this question Mark.  Since you have identified two subjects: Images and Advertisements, this is an ideal jumping off point to illustrate the benefits of taking a modular approach to analytics.

To give you the short answer, using a modular approach we can ask and answer cross-subject questions that would normally be prohibitively expensive to answer:

  • “What images give me the best click-thru and conversion rates?”
  • “Do older images have the same click-thru rate as newer images?”
  • “Including the cost of image production, what is the overall cost of my Ad Campaigns?”
  • “Is there any relationship between the cost of an image and its click-thru and conversion rate by gender?”
  • “Do images with a positive sentiment perform better than those with a neutral or negative sentiment?”

How does a modular approach allow us to answer these question so easily? It all comes down to being able to leverage Dimensions and Measures already developed for each Subject on their own (i.e. Images and Ad Impressions) and then being able to combine those Subjects into a unified multi-Subject Graph that can be easily queried.

Recapping my paper, if you take a modular approach to analytics, you can decompose your analyses into separate “Subjects” (tables), and then further decompose those Subjects into Subsets. Each of these sub-components can be developed independently of the others.  Once these components (stored as portable data files) are “docked in” to the main repository, they can be “lobbed” and “linked” together by users to form graphs that allow for cross-subject analyses.

Let’s first break this down into the two subjects at hand: Images and Ad Impressions.

Let’s now tackle the first Subject “Images”. We may have a team responsible for developing reports to analyze Image statistics.  For example, this team may have developed a set of Dimensions and Measures that allows them to determine how much Images cost to produce, how old they are, and what type of sentiment they are intended to produce.  Since images would presumably be developed by different teams, they would have their own reports (represented as tables) segregated by team.  Since each team’s reports would conform to a standard published schema, they could be combined to form a single cross-department report.  For example, “Team A” and “Team B” could combine their image reports into a single “Image” Subject table.

Moving on to the second subject “Ad Impressions”. Again, there may be multiple teams running multiple advertising campaigns across multiple advertising platforms over several months.  The teams responsible for managing these ad campaigns might even be different based on the Ad Campaign or the Digital Advertising Platform the ads are being served up on.  Like the Image team,  these advertising teams would also have a set of Dimensions and Measures that would allow them to determine how often an ad was clicked on, how many conversions (e.g. goal actions) there were, what the dollar amount of the conversions is, and how these metrics break out by gender and other demographic & psychographic variables (which may be specific to the ad platform).  Again, since each team’s report would conform to a published schema, they could also be combined to form a single report.  Again, this combined “report” would constitute the “Ad Impression” subject.

I have just described two different Subjects, each with their own set of Dimensions and Measures, and each composed of their own sub-sets of data.  Where the modular approach becomes relevant is that it is now possible for users to locate these sub-sets and “lob” these sub-sets into larger subjects and then “link” these subjects to form graphs that allow for cross-Subject analyses.  Namely, we can now ask and answer the questions we raised near the beginning of this post:

  • “What images give me the best click-thru and conversion rates?”
  • “Do older images have the same click-thru rate as newer images?”
  • “Including the cost of image production, what is the overall cost of my Ad Campaigns?”
  • “Is there any relationship between the cost of an image and its click-thru and conversion rate by gender?”
  • “Do images with a positive sentiment perform better than those with a neutral or negative sentiment?”

However, there is one piece missing from the picture: In order to make this possible, we would need to define a simple “bridge” table for connecting the image profiles to the ad impressions.  This bridge table would be developed and maintained by the team that has access to the information required to link the two subjects together.


The following diagram shows how sub-sets sharing the same schema (as depicted with their own colour) can be “lobbed” together to form larger subjects, and how subjects sharing a linking column can be SEMI-JOIN linked together to form a graph for cross-subject analytics.


Astute readers might point out that there is nothing preventing a determined analyst with access to the underlying data from answering the same questions. While it is true that the end result can be achieved through current approaches, these approaches tend to be prohibitively expensive. Here is what is different about the modular approach:

  • Users can integrate data through user-friendly graphical interfaces allowing them to vertically “lob” Sub-Sets into Subjects and then horizontally link those customized Subjects without fear of introducing duplicates through the common “Fan Trap” problem that bogs down most data integration efforts
  • Users can independently develop new Subjects and Subject Sub-Sets, and then “dock in” those Subjects and Sub-Sets in a self-serve manner, without relying on IT assistance, while still conforming to enterprise data governance rules thus protecting Metadata Integrity and Data Integrity, thus allowing data to be safely located and integrated by other users
  • Users can “time travel” by choosing an older “AS-OF” date and time, and performing analyses across data that was current as of that date
  • Data files are portable and can be potentially moved to wherever they are needed for either analysis or downstream processing
    • An example file name, containing from the first Ad Impression Subject Sub-Set (as shown in the above diagram) might be: “AdImpression_V1_CAMDAPMON_SF-G-2017-04_AS-OF 2017-08-26 153100.csv

On top of all of this, other Subjects such as “Web Session” could be “docked in” in to the larger repository allowing Data Analysts to include any Dimensions and Measures developed for the “Web Session” Subject (e.g. ‘Session Duration’) to be incorporated into analyses relating to Images and Ad Impressions.  For example, we could ask and answer the question “What images have not been used for the past 7 days of Web Sessions?”

This example provides a small glimpse into how a modular approach to data management opens up new analytical opportunities that would normally not survive cost/benefit analysis using current approaches.

Sunday, August 06, 2017

A Modular Approach to Solving The Data Variety Problem

I have been working on a paper in my spare time on the weekends for a number of months now.  My goal with this paper is to change the thinking around data and ultimately bridge the chasm between how IT and the Business think about data management and in particular Data Warehousing and Business Analytics.

I am publishing the full paper here as a PDF and will publish portions of this paper in piecemeal over the coming days and weeks beginning with the Executive Summary.

I encourage readers to share this paper and discuss the ideas contained within.  I also encourage readers to send their feedback.  Since I am a human being and am as sensitive to criticism of my work as the next person, I only ask that you couch any negative criticism in a way that is civil.
Based on feedback, I may create new versions of this paper which you can easily distinguish by the paper's AS-OF date.

Before I sign off,  I would like to thank Jane Roberts for her time in reviewing this paper and for her contributions. Thank you Jane!

Here is the paper:
A Modular Approach to Solving The Data Variety Problem AS-OF 207-08-06

Executive Summary
Big Data has fully captured the popular imagination. Companies like Google, Facebook, Apple, Amazon, and Microsoft process petabytes of data daily. Limits that once seemed impossible are now the new normal. In spite of this, analysts and managers still struggle to answer unexpected questions at executive speed.

The reason is that while much attention has been given to these remarkable data volumes, a different but related problem has come into sharp focus: The Data Variety Problem. Namely, organizations continue to struggle to manage and query the ever increasing variety of data originating from sources including, but not restricted to: IT controlled systems (e.g. ERPs, PoS systems, subscriber billing systems, etc.); 3rd Party managed systems (e.g. cloud CRMs, cloud marketing DMPs); and Business controlled departmental tracking spreadsheets, grouping lookup tables, and adjustment tables.

The approach to locating, obtaining, and integrating these sources of data is highly manual. Case-in- point: In the Alteryx commissioned study Advanced Spreadsheet Users Surveyi, published in December 2016, IDC discovered that $60 billion [is] wasted in the U.S. every year by advanced spreadsheet users.” Yet this report only provides a small glimpse into this problem and misses the bigger opportunity: Organizations urgently need a one-size-fits all ‘happy path’ for consuming and producing an ever accelerating increase in the variety of structured data.

In this paper drawing on in algebra, computer science, systems thinking, history, psychology, and 20 years experience as a data practitioner Neil Hepburn posits that the best approach to addressing this problem for the long term is through embracing a modular data warehousing system. Neil goes on to describe how such a modular data warehousing system could be designed and built using readily available tools and technology, and what challenges must be overcome to realize this vision. 

A Modular Approach to Solving The Data Variety Problem AS-OF 2017-08-06

Wednesday, December 28, 2011

Perspective is Everything: Why even the most intelligent software architects don't understand the Relational Model

A few weeks ago I stumbled on this article "A co-Relational Model of Data for Large Shared Data Banks" ("coRel" hereon) in the on-line version of ACM Queue (this is the Association of Computing Machinery's magazine). The article was authored by two employees (Erik Meijer and Gavin Bierman) of Microsoft.

The authors summarize their thesis as thus: "Contrary to popular belief, SQL and noSQL are really just two sides of the same coin." The problem with this article, is that the author's are asking the entirely wrong the question. They are looking at the world with a very narrow and single minded perspective of data - a perspective which was conventional wisdom up until the 1970s (academically), and the 1980s (commercially).

In technical terms, the authors are basically asking this question: If it is possible to implement a Network Model using a Key/Value [NoSQL] database, and it is possible to implement a Network Model using a Relational database, then can the two be queried and modified by the same declarative language? The answer to this question is a resounding yes. Unfortunately, the Network Model is not perspective neutral, which is why the Relational Model was invented.

Backing up a big, allow me to explain what I mean. The Relational Model takes a perspective neutral approach, and regards all entities, no matter how insignificant they may seem, as "first class citizens". Other data models such as the Network Model and Hierarchical Model lock the data into a given perspective and make certain entities "first class citizens" and others "second class citizens". For example, if you have ever organized your inbox e-mails into folders, or documents on your computer into folders, you have probably chosen a certain hierarchy. Maybe you organized your folders by customer, so that way when a customer asks a question, you can quickly go to the right folder and find all the necessary information. But what if a project manager comes to you and starts asking questions about a particular project, and that project cuts across customers? What normally happens here is you start searching through each customer folder looking for e-mails or documents that pertain to the project. Most people will just copy (or create short-cuts) to those project documents to another folder. We've all been through these searching and sorting exercises. When software developers are confronted with the same problem, they pretty much do the same thing - they reorganize (or refactor) the data.

However, if the data were Normalized (i.e. modeled relationally) to begin with, no such reorganization would be necessary.

The author's (and the majority of software developer's) myopia is apparent in their view of history, the example data model they provide, and even in computer science theory.

Let's talk about history first. coRel has this to say:


Codd's relational model and SQL allowed implementations from different vendors to be (near) perfect substitutes, and hence provided the conditions for perfect competition. Standardizing on the relational model and SQL created a secondary network effect around complementary producers such as educators, tool vendors, consultants, etc., all targeting the same underlying mathematical principles. Differences between actual relational database implementations and SQL dialects became to a large extent irrelevant.

While it's true that standardization around SQL led to wide adoption, such standardization had already emerged prior to the introduction of the relational model. Namely, CODASYL (the same body that created COBOL) developed a standard around the aforementioned Network Model, often referred to as the Data Base Task Group (DBTG). Much of what you see in modern SQL standards actually comes from this standard - in particular the separation of DDL (data definition language) from DML (data manipulation language). However CODASYL vendors (and there were a lot of them), were blindsided by Codd's relational model.

Interestingly, Codd's original language was not SQL, but rather Alpha. Also, the first two major RDBMS vendors had competing standards: Ingress used a language called Quel; and Oracle and IBM used SQL. But because Ingres was always based on the relational model, it was able to simply slap on support for SQL. Ingress lives on to this day in the form of PostgreSQL. The other non-RDBMS vendors also live on to this day, but tend to serve particular niches (e.g. IBM's IMS is still heavily used in banking). There was nothing inherently special about SQL, and other relational languages are still around and continue to be invented. What is special is the underlying Relational Model.

For a better explanation of why the Relational Model entered the marketplace, here's a passage I scanned in from the article "The Commercialization of Database Management Systems, 1969-1983" found in the IEEE Annals of the History of Computer, Volume 31, Number 4, October-December 2009



I am now going to talk about the sample data model used by coRel. The example is based on an example taken from Amazon's SimpleDB. Here is what the original data looks like, as described by Amazon:


Here is how the authors model this using an object model (essentially a Network Model):



Here is how the authors model this using a Relational Model:



If you're an experienced relational modeller, you will observe that the model is not in BCNF (Boyce-Codd Normal Form), as the Ratings and Keywords entities two of the entities have overlapping candidate keys.

Here is what the data model should look like in BCNF:



From the perspective of the Product Catalog application, this change seems somewhat academic. However, if we extend our normalized data model to include Tweets retrieved via Twitter keyword searches - which is important from the perspective of a marketer - things get more interesting. Here's the updated model:



What is important to note in this normalized model is that all entities are "first class citizens". As an Product Catalog application developer, I can ask questions or make changes to Keywords and Products, without having to involve Tweets. Or, as a marketer I can ask questions about Tweets and Products without having to involve Products.

When data is seen in a larger context with many different perspective, the relational model makes sense. While it may be more efficient to model data for a particular perspective (i.e. the product catalog application) using a Network or Object model, the same model can be very inefficient and lead to anomalies and contradictions in the data. This point is lost on many developers, since most only deal with a single perspective of the data. The following paragraph in coRel makes this very clear:

Summarizing what we have learned so far, we see that in order to use a relational database, starting with a natural hierarchical object model, the designer needs to normalize the data model into multiple types that no longer reflect the original intent; the application developer must reencode the original hierarchical structure by decorating the normalized data with extra metadata; and, finally, the database implementer has to speed up queries over the normalized data by building indexes that essentially re-create the original nested structure of the data as well.

See the problem? There is rarely such thing 'natural' hierarchy. Perspective is everything, and depending on how we view an ontology, we can ascribe many different hierarchies.

Now, you may be wondering if it is possible to represent a relational model in a NoSQL database, such as a key/value store. The answer is: sort of. While it is possible to recreate the structure of the relational model, it is not possible to centralize the integrity of the relational model. This is not a trivial point. Referential integrity (and other forms of integrity, such as uniqueness, nullability, and value domain constraints) are what ensure the correctness of ad hoc queries. When such constraints are removed, it is up to the application developer to examine the underlying data and perform numerous tests to ensure its integrity. The end result is poor data integrity and poor data quality. I can speak quite frankly on this last point as I see the difference between poorly constrained data models and well constrained data models, all the time. Just like the second law of thermodynamics, when unconstrained, over time data entropy tends to infinity.

In other words, the relational model allows us to manage information as a separate concern. You might even say this is the whole point of the relational model.

I now want to move on to the theoretical aspects of the paper, in particular the question of compositionality. CoRel's authors define compositionality as: "the ability to arbitrarily to combine complex values from simpler values without falling outside the system" they go on to say

SQL is rife with noncompositional features. For example, the semantics of NULL is a big mess: why does adding the number 13 to a NULL value, 13+NULL, return NULL, but summing the same two values, SUM(13, NULL), returns 13?

A more precise definition of compositionality comes from Wikipedia which states "An important aspect of denotional semantics of programming languages is compositionality, by which the denotion of a program is constructed from denotions of its parts."

SQL guarantees compositionality since it doesn't have any side effects. Contrast this with most concurrent programming languages (e.g. Java, C#, Python), and they do not guarantee compositionality since it's possible to write modules which impact the other modules.
But I don't think the authors were thinking along these lines. They're really arguing that SQL is inconsistent and point out the example with the NULLs.

The reason why there is this perception of non-compositionality is that sets and tuples are treated as primitives. You cannot make an aggregate function out of tuple functions, and you cannot make a scalar function out of aggregate functions.

NULLs are controversial to this day, and Codd even wanted to take things a step further, and distinguish between "unknown but applicable" and "unknown but inapplicable".
Codd's basic argument for the inclusion of NULLs (and three-value logic) can be summarized as thus: In the real world, handling unknown values is inherently complex. Instead of thrusting the complexity back to the user, the RDBMS should handle unknowns "correctly" - in so far as the behavior correctly models real world behavior. This can result in NULLs being counter-intuitive, but just because something is counter-intuitive it doesn't mean its wrong (think flat earth intuition, round earth reality).

For example, let's say you are hotel manager and you want to know the average number of days a guest stayed for. Assuming you have a database where front reception can log check-in and check-out times. Obviously when you haven't checked out, the check out time is unknown (or doesn't exist), so that attribute would be NULL. When taking the average of checkin [minus] checkout, you will only be including rows where both the check-in and check-out times are known, since when tuple contains an unknown element [unless stated otherwise] the tuple as a whole cannot be known, and should be eliminated from the set. This is what you want. Putting it in a SQL query, it would look like this:

SELECT
AVG(DATEDIFF(CheckOutTime, CheckInTime))
FROM
BOOKINGS

As you can see, it's a very simple query to write and validate, and reflects the correct handling of NULLs. Quite the opposite of a "big mess".

But the whole question of compositionality is also missing the point with SQL. SQL is not a computational language - it's a data retrieval language, based on the relational model. When we're talking about data, compositionality is not our main concern - normalization is. Data which is not normalized, is like a program which is non-compositional. Nasty side effects can and will arise.
---
The funny thing about the relational model is that it is predicated on Relational Algebra which is completely orthogonal to the Universal Turing Machine. The former is about logic, and the latter is about flow. They are not in competition.

But when I hear people say things like "it's about time somebody built a better database than those stupid RDBMSs", it's akin to saying "it's about time somebody build a better Universal Turing Machine". Makes no sense really.

Before I conclude this post, I want to share with you an excerpt from Joe Celko's "Thinking In Sets" which is very telling:

Many years ago, the INCITS H2 Database Standards Committee (née ANSI X3H2 Database Standards Committee) had a meeting in Rapid City, South Dakota. We had Mount Rushmore and Bjarne Stoustrup as special attractions. Mr. Stoustrup did his slide show with overhead transparencies (yes, this was before PowerPoint was ubiquitous!) about Bell Labs inventing C++ and OO programming, and we got to ask questions.

One of the questions was how we should put OO features into the working model of the next version of the SQL standard, which was known as SQL3 internally. His answer was that Bell Labs, with all their talent, had tried four different approaches to this problem and they came to the conclusion that it should not be done. OO was great for programming but deadly for data.


Summing up. While I am being critical of the coRel paper, there was clearly a lot of thought that went into it, and the authors come across as being intelligent and having a good pedigree. My point is that there is an institutional bias towards application-centric data modeling, which comes at the expense of perspective neutral data modeling - i.e. the Relational Model.
It has been my experience that this bias has led to a great deal of friction between software or application architects, and data architects. Much to the frustration of both.

My greatest hope is that by educating students at an earlier age, this deep rooted bias can be avoided. This, I will point out is a long running project of mine. A blog for another day.

Sunday, October 09, 2011

Keys, Data Syncronization, and the Relational Model

A common problem faced by database developers involves keeping schemas and data in sync between environments. The problem isn't even restricted to development. Often databases must be kept synchronized in operational scenarios. Data is often replicated for reasons of high availability and performance.
In this posting, I want to illustrate a real-world scenario that frequently comes up, and a modeling approach that ensures this scenario never turns into a problem. Even if you have not encountered this scenario, it's worth reading this blog in its entirety for tips on better data modeling, and why the relational model is still very relevant.

As anyone who has had to keep their calendar or contacts in sync between devices, or for developers accustomed to using Version Management systems like Visual Source Safe (VSS) or Subversion, the problem really boils down to the following three requirements:

  1. If the data is in the source data store but not the target, then add it to the target.
  2. If the data is in the target but not in the source, then remove it from the target.
  3. If the data is both in the source and the target, overwrite the target's attributes with the source's attributes.

Simple right? Indeed everything is simple... IF you have identifiers which are consistent between source and target data store. For the purposes of this blog posting, I'm going to focus on the use case of syncing data between a development server and a test server, which is equivalent to syncing between a test server and a production server.

I am not going to get into the intricacies of syncing calendar or personal contact data - that scenario is actually quite different since there is never a universally agreed upon key or way of identifying individuals, so you really don't have a proper key to compare against. That's why most PIM synchronization software tends to take a "fuzzy" approach when syncing up contacts. Companies like Facebook and LinkedIn are in a privileged position to address this ongoing problem, as they are effectively becoming a de facto registry for contact information. But I digress...

For this posting, to make my examples more explicit I'll refer to the Microsoft technology stack, although all of the major RDBMS vendors support the same functionality and tools.

The crux of the problem when synchronizing most data comes down to the alignment of keys (or identifiers if you prefer).

A common conundrum that comes up when modeling data is defining the right primary key. For example, I can create an employee table and use the employee's social insurance number as my primary key. Or perhaps I don't have access to the social insurance number, and instead use telephone number and first and last name as my primary key. In the latter case, people are known to change their name or their telephone number. The key is effectively out of our control, in order to have reliable key we end up creating what I refer to as a "technical key". Technical keys are often auto-generated from incrementing sequences, but they can just as easily be generated from a GUID. It is common to find auto-generated primary keys because not only is the data modeler in complete control of the keys values, but they are also compact, especially when referenced as a foreign key from other [dependent] tables. Data synchronization problems begin here simply because the technical key depends on the database or application that generated the key. If you import the same data into a different database for the first time it can and probably will have a different auto-generated key. I am well aware that it is possible to load data into other databases while preserving keys, and if you're taking a "master slave" approach to data synchronization you have nothing to worry about, since you're effectively just copying the data over. However if you already have data loaded in the target table and you simply need to update a few columns then you're going to run into problems.

A good solution to this problem is to define an alternate natural key. Basically this is just a UNIQUE index on one or more columns. For our above example, this could be comprised of name and telephone number. Given that the natural keys can be different, the developer should perform some analysis to assess and quantify the risk that natural keys are different.
Once you have determined the natural keys are in sync, you can [as an example] use Visual Studio 2010 Schema Compare to perform a data schema comparison which will generate a SQLCMD script (assuming you are promoting schema changes) and you can also use Visual Studio 2010 Data Compare to generate a SQL script to perform all the UPDATEs, INSERTs, and DELETEs. In VS 2010 you can choose whether to use a table's primary key, or one of the alternate natural keys you have defined.

I should point out that if your table doesn't have any set of columns (apart from the primary key) which can be guaranteed to be unique, and the primary keys themselves are generated and therefore database server specific, you should try your best to rectify this situation as you now have a more fundamental problem, which I won't be addressing in this blog (hint: You need to start looking at the target's change history).

So far so good. But what if you have a dependent/child table whose natural key depends on the parent table? Let's say there is a table called EMPLOYEE_INVENTORY which is a list of items that have been provisioned to the employee. The natural key for this table might be the composite of EMPLOYEE_ID and the INVENTORY_ITEM_ID (and the Primary Key for EMPLOYEE_INVENTORY is an auto-generated key). To keep things simple, the Primary Key and Natural Key for INVENTORY_ITEM is one and the same, and that it is a universal SKU # of the inventory item. So in summary, the primary key for EMPLOYEE_INVENTORY "ID" is a technical key, and the natural alternate key is "EMPLOYEE_ID" + "INVENTORY_ITEM_ID"

Now we have a bit of a problem when it comes to keeping EMPLOYEE_INVENTORY in synch. Namely, half of its Natural Key is derived from the Auto-generated Primary Key in its parent table (EMPLOYEE). Should we wish to synch based on Natural Keys we're forced to develop code to perform lookups, comparisons, UPDATEs, INSERTs, and DELETEs - we can no longer rely on Data Synching software like VS 2010 Data Compare to do this for us. While there is nothing inherently complicated about this, it will invariably take you a chunk of time to write this code, whether you do it in an ETL tool like SSIS, or stick to a procedural code. If you have another dependent/child table which in turn depends on EMPLOYEE_INVENTORY (e.g. EMPLOYEE_INVENTORY_LOG) things get more complicated and you're spending considerably more time to complete the task.

The preferred approach is to create what is known as an UPDATABLE VIEW which will allow you to substitute the parent table's technical primary key for its natural key. Just to clarify, an UPDATABLE VIEW is exactly what its name says it is: a VIEW you can UPDATE. As you can imagine, there are limitations as to which VIEWs can be updated (clearly anything with an aggregate would not be updatable). For our scenario though creating an UPDATABLE VIEW might look like this:
SELECT
e.TELEPHONE_NUMBER, --Natural Key
e.FIRST_NAME, --Natural Key
e.LAST_NAME, --Natural Key
ei.INVENTORY_ITEM_ID, --Natural Key
ei.LAST ACTIVITY_DATE, --Attribute to synch
ei.QUANTITY --Attribute to synch
FROM
EMPLOYEE AS e INNER JOIN
EMPLOYEE_INVENTORY AS ei ON e.ID = ei.EMPLOYEE_ID



For SQLServer if you want to make this VIEW UPDATABLE, there are two things you need to do. First, you need define it with the "SCHEMABOUND" option. Second, you need to defined a CLUSTERED UNIQUE INDEX on the natural key columns of the view (i.e. "TELEPHONE_NUMBER", "FIRST_NAME", "LAST_NAME", "INVENTORY_ITEM_ID").

You are now in a position where you can use data synchronization tools like VS 2010 Data Compare to automatically synchronize the data for you. Because everything we haven't had to write any procedural code, we can focus instead on problems which arise from the data itself (e.g. a natural key mismatch), as opposed to betting bogged down in throwaway code.

Problem solved.

********

What I want to illustrate in this blog is the power that comes from mindful modelling and adhering to relational principles.

In the last couple of years, NoSQL databases like MongoDB, Redis, Cassandra, Google DataStore have flourished. Indeed, these databases provide significant advantages for application developers in terms of scalability. They also feel like an ORM layer, but with much greater efficiency, so they are very desirable to application developers. And not to be ignored, many of these new databases are open source and can be used for little or no money. Case-in-point I'm planning on building a new hobby application using Google Data Store since I get up and going without paying a cent.

The downside of these modern database technologies is that they suffer from many of the same limitations that plagued pre-relational database developers. Namely, there is a "perspective lock-in". What this means is that once the application developer has modeled data for their application's use cases, it may be difficult for future applications to use the data for their own purposes. It will also be difficult to run any sort of ad hoc queries without first exporting the data to an analytical RDBMS.

This is not to say that NoSQL databases should not be used. This is to say that they should be chosen with eyes wide open and a clear understanding of the trade-offs involved. In fact, I believe the most compelling reason to use a NoSQL database is for low cost (in particular hardware costs). Let me repeat that: If you absolutely cannot afford to scale using an RDBMS and you can't see yourself bootstrapping yourself along, then go with a NoSQL database.

Put another way, virtually every NoSQL innovation I've seen, has been absorbed into an RDBMS. Take for example binary large objects (BLOBS) like videos, music, and documents: Microsoft (and presumably others) now allow you access these objects directly through the file system, while allowing them to be managed transactionally through the RDBMS. Analytical databases like ParAccel an Vertica allow for Petabyte scaling - and still allow full relational capabilities. Contrast this with Google DataStore which doesn't even support basic JOINs or aggregations (GROUP BYs). This means you have to write this code on your own. Not only is that going to be cumbersome and error prone, it's also going to perform worse as well. This is why you're beginning to see "SQL layers" Google Tenzing being added to NoSQL databases to speed up the commonly requested tasks. Even Facebook, which started the Cassandra project, still uses MySQL (combined with Memcached in a sharded configuration).

There's also the notion of BASE versus ACID, and the types of business models each can tolerate. But that's a discussion for another day...

And so I will end this blog entry.

Monday, December 27, 2010

The Decline of Trust: Follow-up to Andrew Maguire post

Earlier this year I blogged about an alleged case of silver manipulation as raised by Andrew Maguire to the CFTC. If you haven't already, I urge you to read that post before continuing on with this post.

I ended my post with a challenge to any serious journalist to properly investigate Andrew Maguire's case. So far I have yet to see anything resembling a real investigation undertaken by the media establishment. While I cannot say for sure why this is given how much of a juicy story the Andrew Maguire case appears to be. I am beginning to see an interesting pattern emerge that may partly explain why this is. I must say that I'm a bit surprised by my own findings, but they are enough for me to draw conclusions.

For a period of time the only newspaper that reported the Andrew Maguire story was the New York Post. You can read the first report here which was first published in March (shortly after Maguire blew the whistle). A follow-up report can be found here. However, a third story was published shortly after I posted my last blog entry which in hindsight is most interesting. You can read it here. What's interesting about the last post is that one week following its original publication, a correction was issue which basically nullified the entire story. Namely, JPMorgan responded by saying there was no investigation to begin with. The correction is posted at the bottom of the story. The NYPost never countered.

Fast forward to October 8th of this year. Reuters reported that CFTC commissioner Bart Chilton had publicly announced that the CFTC was investigating silver manipulation. Oddly, the statement wasn't from the CFTC itself. Rather Chilton felt the public deserved some answers, speaking on his own accord. You can read that story here.

By the end of October, Reuters published a follow-up story reporting that the CFTC was in particular investigating JPMorgan and HSBC for silver manipulation. You can read that story here. This would appear to directly contradict the statements made to the NYPost back in May, that they were NOT being investigated for silver manipulation. JPMorgan declined to comment on the matter.

A day later, Reuters reported the first of at least five class action suits lawsuit filed by investors targeting JPMorgan and HSBC. Here is the article, and here is a scanned in copy of the court filing as filed in the Southern District of New York. About a week later, a second class action lawsuit was filed. You can read the court filing here. A week after that, a third law firm posted a press release announcing they were investigating silver manipulation by JPMorgan and HSBC on behalf of their clients. As a follow-up, Reuters posted another article providing historical context to these cases. The article begins with the obligatory mention of the Hunt brothers, some mention of "gold bugs" and "conspiracies", but no mention of Andrew Maguire himself.

To be fair to the media establishment, on November 3rd Andrew Maguire was mentioned by name in this Wall Street Journal article describing a fourth lawsuit against JMorgan and HSBC. You can read that article here. Apart from that brief mention in the WSJ, most newspapers either leave out mentioning Andrew Maguire by name, or they describe a "London based Metals Trader".

For those that are curious about how the manipulation works, a fifth class action lawsuit was just launched on December 28th, which describes in detail HOW the manipulation works and how JPMorgan and HSBC profited from it. It also describes WHY they were in a unique position to do so.

Let's pause for a moment and ask some questions. Why is it that only the NYPost originally reported on the story? Why didn't NYPost continue to follow the story they claim to have broken after it developed? Why doesn't Reuters mention the name of Andrew Maguire? Why doesn't the WSJ and Reuters print or mention the original e-mail transcript of the correspondences by Andrew Maguire to the CFTC? Is this just a one-off oversight?

Perhaps.

But around the same time Bart Chilton's announcement was made, another interesting and related story came to light. Namely, The Washington Post reported that one of the two CFTC judges (Judge Painter) had announced his retirement and that he had issued an Order requesting that all seven of his open cases NOT be transferred to the other CFTC judge, Judge Levine. I have quoted the key paragraph of the one page order (which you can read here) below so you can see for yourself:

There are two administrative law judges at the Commodity Futures Trading Commission: myself and the Honorable Bruce Levine. On Judge Levine's first week on the job, nearly twenty years ago, he came into my office and stated that he had promised Wendy Gramm, then Chairwoman of the Commission, that we would never rule in a complainant's favor. A review of his rulings will confirm that he has fulfilled his vow. Judge Levine, in the cynical guise of enforcing the rules, forces pro se complainants to run a hostile procedural gauntlet until they lose hope, and either withdraw their complainant or settle for a pittance, regardless of the merits of the case. See Michael Schroeder, If You've Got a Beef With a Futures Broker, This Judge Isn't for You - In Eights [sic] Years at the CFTC, Levine Has Never Ruled in Favor of an Investor, Wall St. J., Dec. 13, 2000, at A1 (copy attached).

In the last sentence of the previous paragraph, Painter cites an article that appeared on the front page of the Wall Street Journal back in 2000. You can read that article following Painter's Order here. It is a well researched investigation into the conduct of Judge Levine and sure enough does raise some important and relevant questions. It's not entirely one sided, but it does put Levine on the defensive. You would not be unreasonable to think the Wall Street Journal would publish a follow-up article vindicating their original investigation. Right?

Wrong. The Wall Street Journal has taken an entirely different and rather underhanded tack. Namely, Sarah Lynch penned an article which is a blatant smear on Judge Painter, describing him as a mentally ill alcoholic who would sleep at work and who was a failure in his private life. Sandwiched in the middle of the article we find a very brief mention of Painter's Order against Levine, which is immediately followed by "Judge Levine declined to comment, but a former colleague defended Judge Levine's record and said he is fair." The article continues to smear Painter to the very end quoting a doctor that diagnosed him with "cognitive impairment, alcoholism and depression". The tone and content is more in line with something you might read in The National Enquirer or US Weekly. The article seems highly uncharacteristic and rather beneath a newspaper like the Wall Street Journal with such a distinguished history.

Reading Lynch's WSJ article after reading Schroeder's original investigative article and the recent Washington Post article left me feeling stunned. The only thing I will say in the article's defense is that it does help to explain why Painter kept this to himself for so long. I suspect his enemies were holding a grenade over his head, and clearly they've pulled the pin.

I am not the only person who noticed this miscarriage of journalism. Barry Ritholtz was equally disgusted and contacted Sarah Lynch for an explanation. Lynch response was summarized as:

Lynch wrote back to note she did a story on the judge last Friday, but it ran on newswires but was not picked up by WSJ. (Reporters have no control over those editorial decision). The current article is a follow up to that prior piece.

This would all suggest that Lynch's article was taken out of context and something immoral is going on within the higher ranks of the WSJ with respect to reporting around the CFTC and the silver manipulation story.

******

I recently went to see the movie "Inside Job". The film documents the reasons that led up to the 2008 economic meltdown, the impact it has had so far, and why nothing has substantially changed in terms of policy and regulation. What surprised me was how interconnected the corruption is. There is no grand conspiracy of the sort you might think is required to fake the moon landing, or plan 9/11 as an inside job. Rather, the conspiracy is of incentives and motives which readers of Freakonomics will quickly recognize.

The movie describes the interconnectedness between the big investment banks (e.g. Goldman Sachs, JP Morgan, Morgan Stanley, etc.), the Federal Reserve, the government executive branch (both Bush and Obama), the government regulators (i.e. SEC and CFTC), the rating agencies (i.e. Moodys, S&P, and Fitch), and even the prestigious business schools (e.g. Harvard, Columbia, Wharton, etc.). On that last one, I have to admit I never realized that so many business professors were on the payrolls of these banks. Actually, they're usually not on a direct payroll, but rather work as consultants through intermediaries like Analysis Group. Inside Job argues that influence into academia is so pervasive that most of these school's curriculum has morphed to reflect the will of the banks, and tends to be very negative in view of regulation.

In light of this I would argue then that the media establishment, which has become highly consolidated is also susceptible to the same conspiracy of motives that has adversely influenced some of the world's most respected economic professors. In this blog post I have attempted to present evidence supporting my theory.

Newspapers like the Wall Street Journal are generally regarded as pillars of the fourth estate. I no longer believe the WSJ should be trusted and I'm sure others feel the same way. Since 2008, trust in governments, banks, rating agencies, business schools, and the media has been in steady decline. This is not good for any society.

I don't believe there is any silver bullet to resolve this. However, by continuing to shine as much light on media biases I believe it is possible for us through Citizen Journalism and other forms of grass roots reporting to get out of these dark days, hold the media to account, and in turn hold the government, banks, and corporations to account. Looks like an uphill battle, but I'm an optimist.

Monday, August 23, 2010

Twelve Reasons to use an ETL tool / Experience Report on ETL tools: Pentaho DI, SSIS, and PowerShell

Over the past couple months I've been jumping between three different ETL tools (well PowerShell is not exactly an ETL tool but has some overlapping functionality). The experience has given me new perspective on the strengths and weaknesses of each tool. I hope to share with you my experiences, opinions, and recommendations.

This report is not going to be as structured, or cover as many tools as something you might find from a Gartner report. My approach here is to get into specific experiences with the tools as well as discuss why I think ETL tools are important to begin with. My perspective is not intended to be a definitive decision making tool, but rather a useful component to the decision making process when choosing an ETL tool.

Before getting into my experiences I'll give you a brief overview of what ETL is, and how these tools fit into the ETL landscape.

ETL stands for "Extract Transform Load". ETL tools first started showing up in the mid nineties as a response to the growing demand of data warehouses. The first major ETL tool was Informatica, and continues to be one of the best ETL tools available on the market today. The main reason ETL tools were invented was that it has traditionally been time consuming and error prone to extract data from multiple source systems, merge those data together, and load into a data warehouse for reporting and analytics.

To this day, the majority of techies in IT are unaware of what an ETL tool is. However those same people are often tasked with solving the very problems ETL tools are designed to solve. More experienced developers will achieve their goals through a combination of shell scripts and SQL scripts. Less experienced developers will simply stay in their "comfort zone" and use whatever programming language they happen to be the most familiar with, with little regard for software maintenance and support.

The task itself of writing a computer program to extract data, merge it (or perform some other transformation), and loading those data into a target data warehouse is conceptually very simple for the average developer. However, it is not until you start running into the following problems do you realize that the "follow your nose" approach doesn't work so well. Here are some of the issues you will likely run into when maintaining ETL jobs for a data warehouse (I've run into them all). In other words, here is what an ETL tool is designed for:

  1. Code readability: This is probably the biggest difference between an ETL tool and pretty much every other programming language out there. It is driven off of a visual metaphor. When I first saw an ETL job, it reminded me of one of those crazy Rube-Goldberg machines. However, once you acquaint yourself with the iconography, and what all the connecting lines mean, it becomes very easy to look at an ETL job and understand what it is doing, and how it works. Unfortunately, many developers scoff at this visual approach, after all "real coders write real code". I also wonder if many developers are afraid of developing ETL jobs as it lays their code bare for all to see. Managers on the other hand love this visual approach, as it allows them to scrutinize and partially understand how code is working. Furthermore, an ETL tool will visually show your data coursing through it from one step to the next, much like observing water flow through tributaries into rivers into lakes. When a full load takes several hours to complete, it is very reassuring to see what is actually going on with the data.
  2. Data Element mapping: Much of what you're doing in data warehousing involves mapping a source data element to a target schema. This is tedious work and it's potentially error prone - especially if you're lining up an INSERT statement with a SELECT statement. ETL tools make it very easy and safe to line up and map data elements in a fraction of the time you would normally take.
  3. Impact Analysis: Agility is often seen as a function of management. However, if it takes you a month to figure out what has being impacted as a result of a minor schema change (e.g. adding a column to a table), then you're going to get stuck in your tracks before you have the chance to yell "Scrum!". ETL tools make it easy to identify where data is being sourced from, and where it is being targeted. Some ETL tools will even produce impact reports for you. Keep in mind that this only one aspect of impact analysis, but every little bit helps.
  4. Incremental loading: By default most developers will develop code to perform a full load on a data warehouse. This code is not only easy to develop, but is fairly reliable. Unfortunately, full loads can take a lot of time to complete, and as time goes on will take longer and longer. Switching to an incremental load is tricky since it involves figuring out what data is new, changed, or has been removed. Depending on your source system, your options may be limited. Nevertheless, ETL tools provide functionality to assist with different scenarios
  5. Parallelism: Writing concurrent software is not for the "faint of heart". There is an inherent overhead required to manage semaphores and other forms of inter-process communication. Furthermore, the threat of deadlocks or livelocks in complex systems can often only be discovered through trial-and-error. There are modern frameworks which greatly simplify parallelism in languages like Java or Python. However it is hard to argue against the simplicity of the ETL approach when developing parallel data processing applications. Furthermore, some ETL tools (e.g. IBM DataStage Enterprise Edition) can even parallelize JOINs. This makes it possible to outperform an in-place JOIN within an RDBMS like Oracle (most SQL queries still run single threaded).
  6. Checkpointing and job recovery: Imagine you have just developed a script to perform a large data warehouse load. Let's say you run the script on a weekly basis, and the load takes 8 hours to run (you run it over night). One morning you arrive to find the script has failed 80% into the job. Some of the tables have been updated, but there are a few that haven't been updated. You can run the job from the beginning, but this is problematic since you need to back out all the data you have already inserted (otherwise you'll get duplicate data). Alternatively, you can pinpoint the point of failure and run from there. However, this requires a code change, which means you will need to test your modified code before getting it into production (and let's not forget the psychological pressure when people are waiting on fresh data). ETL tools make it relatively easy to checkpoint your code, and resume from a failed step. Some ETL tools have built-in checkpointing which means you don't have to even instrument your code for checkpointing. Furthermore, if code changes are required [due to a failed job], they are more often a change in configuration, rather than a real change in code, so it is generally safe to make a configuration change and resume the job with less risk than traditional scripting. That said, not all ETL tools support checkpointing. However, even ETL tools which don't explicitly support checkpointing can be more easily instrumented to checkpoint than traditional scripting or programming languages.
  7. Logging and monitoring: I've already touched upon this point with respect to monitoring. To reiterate, ETL tools allow you to see (seeing is believing) data flowing from one step to the next. It's powerful data visualization that is not unlike a floor plan or map visualization. Namely, it can communicate far more information than a bar chart or line chart ever will. As for logging, ETL tools by default centralize logs and implicitly logs what you need to log (with configurable views of granularity), without having to instrument your code.
  8. Centralized Error Handling: Scripting languages don't provide centralized error handling (this has more to do with the OS and legacy applications), so you're forced to always explicitly check for errors and call the appropriate error handling routine. ETL tools often include generic error handling routines which are invoked regardless of where the error originated from or how it is raised.
  9. Slowly Changing Dimensions: Although I've listed this as #9, perhaps this should be #1 from a data quality perspective. If you don't implement a Type II (or higher) SCD policy, you're precious data warehousing investment will most certainly degrade over time. Slowly Changing Dimensions are crucially important, as they allow you to show a consistent view of history. I could go on at greater length as to what an SCD is, but if you're not sure just read the Wikipedia article. At any rate, ETL tools have built-in SCD steps which allow you to define the technical key, the natural key, the effective begin and end date fields, and in some cases a version field and "current version" flag field. Without an ETL tool SCDs are a major pain in the butt to manage, and developers will often not implement a Type II SCD policy (often because they aren't aware of what an SCD policy is), but also due to the added effort. Don't forget the classic cop-out: "Hey, if the business analyst doesn't ask for it, I'm not going to build it". SCDs are one of those classic business requirements that few people realize they need until it's too late.
  10. Pivoting. It is increasingly popular to see data being stored and logged in what are known as Entity Attribute Value data models. I generally prefer to see explicit data models over generalized data models (which effectively treat the database as "bit bucket"). But they're here to stay, and in some cases actually make sense. However, it is virtually impossible to analyze EAV data without pivoting it back into columns. This is another task which is well served through an ETL tool, and which can be very cumbersome if you intend to accomplish through traditional scripting. In short, an ETL tool allows you specify the grouping ID (essentially the primary key for the row), the pivot key column, and the various pivot values you wish to map to columns. Furthermore, if you need to extract data out of some kind of BLOB or something like an XML or JSON document, this too is easily achieved through standard ETL steps. If you were to do this through scripting, you would probably have to rely on temporary tables and plenty of code. In other words, just like with Slowly Changing Dimensions, there are fewer Function Points to worry about with an ETL tool
  11. In-memory batch manipulation: Often when manipulating data sets in bulk, many developers will rely on temporary tables, and from there perform batch operations against those temporary tables. While this works, and is sometimes even necessary, ETL tools allow you to more easily process large batches of data as part of a continuous in-memory pipeline of records. It is this notion of a "data pipeline" which also distinguishes ETL from other programming paradigms, such as scripting approaches (even those scripting languages that support piping operations.
  12. Distributed transaction control: While it is possible to create distributed transactions using a transaction coordinator, often developers who eschew ETL tools will also avoid using other high level tools, such as a distributed transaction co-ordinator. Some ETL tools include built-in integration with distributed transaction co-ordinators. However, in my own experience I have never had the need for two-phase-commit in an ETL tool. I suppose if I needed to ensure that my data warehouse was always 100% consistent, even during load times, then this makes sense. However, I would urge caution when using transaction managers, since large transactions can easily lead to locking, not to mention they perform horribly under full load scenarios, and a single COMMIT statement can take hours (depending on how you've configured indexes).

Before I get into my current experiences with PowerShell, SSIS, and Pentaho DI (PDI), I'd like to point out that there is also something called ELT (Extract Load Transform). ELT differs from ETL in that it tends to rely on the target DBMS to perform the transformations. ELT tools achieve this through code generation. Sometimes they are referred to as "code generating ETL" versus most other ETL which distinguishes itself with the moniker "engine-based ETL". ELT works like this:
  1. Extract data from source databases(s) using standard connectivity
  2. Load data in bulk into target database
  3. Generate and submit SQL DML statements within target database to achieve desired data transformations
Popular ELT tools include the open source TalenD and Oracle Data Integration (formerly Synoptic). The strength of this approach is that the target DBMS can be leveraged to achieve superior performance. Furthermore, it is possible to override the code generation and hand-tune the generated code.

If you target a robust DBMS (e.g. Oracle), the results can be impressive. However, since you are relying on the target DBMS, your options are more limited. Yet this is not the main reason I am wary of the ELT approach. My worry with ELT tools is that they can easily break the visual metaphor by relying too much on hand generated code. Furthermore, during the transformation stages it is impossible to see what is going on (I prefer maximum transparency during long running transformations). Over time, your ELT jobs can devolve into something resembling scripting.

I should point out that when I first started developing transformations and jobs using Ascential DataStage (now an IBM product), I found that most data was being read from and written to the same Oracle database, where I was working. At my workplace I was quickly able to run circles around most ETL developers using bread-and-butter SQL queries. Indeed, Oracle has one of the best (possibly THE best) query optimizer out there, and few of my peers understood how to take advantage of DataStage's parallel extender, and were probably fairly average ETL developers. But since that time I have come full circle and am a firm believer in the visual metaphor. Yes, I may sacrifice some performance, but at the time I was doing this I didn't have the perspective that I'm giving you in the aforementioned 12 reasons to use an ETL tool.

PowerShell

Getting back to my recent experiences, I'll start with PowerShell. To be clear, PowerShell is NOT an ETL tool. It's a shell scripting tool. When I first discovered it (as part of MS SQL Server 2008) I was very impressed with what I saw. There was a time many years ago when I was a proficient UNIX shell script programmer. As any UNIX buff knows, mastery of the Bourne/Korn or C Shell (and all the standard command line tools) is the fastest route to becoming a UNIX power user. Occassionally I'll still fire up Cygwin (I used to use MKS toolkit) and will pipe stuff through sed or awk. Vi is still my favourite text editor.

I was therefore very eager to embrace PowerShell as a potential ETL disruptor (one should always be on the lookout for disruptive technologies). There's a lot to love about PowerShell. As much as I will diss Microsoft for its monopolistic practices, they are one of the few true innovators of the back-office. PowerShell is no exception. Let me quickly point out some of the more powerful features of PowerShell:
  1. Objects can be piped: Piping in UNIX is a way of life, but it can also be tedious as you're forced to serialize and deserialize streams of text, in order to achieve the desired result. PowerShell allows you to pipe .NET objects from one shell app to the next. For example, it's possible to pipe a result set, or directory listing as an object. This comes in handy quite a bit. For example, it is very easy to retrieve a particular set of columns from a query's result set, without having to sift through and parse a whole bunch of text
  2. Snap-ins can be used to move between shell contexts: Currently the only snap-ins I've used are for the OS shell, and the MS SQL Server shell. The OS shell is the default shell, and listing "entities" will basically just list files and directories as you would normally expect. But while in SQLServer mode, the shell lists databases and tables as though they were directories and files. While most DBAs are accustomed to running database commands using a query shell window, PowerShell goes far beyond what is easily possible using a basic SQL (or T-SQL) query interface.
  3. Parsing files is easy: Going back to the object pipes, it is also very easy to pipe the contents of a file, and parse those contents as a file object. PowerShell has also reduced the number of steps required to open and read a file, making such operations second nature. Let's face it, processing text files is as common activity in shell processing as ever (e.g. you need to parse a log file for an error message), so it's very much a pleasure to use PowerShell for these tasks
  4. Great documentation: It's not often I compliment software documentation, but I feel it should be acknowledged that Microsoft even rethought those dry UNIX "man" pages. Perhaps it's just one technical writer at MS making this difference. But whoever you are, hats off for having a good sense of humour and making the dry and technical a pleasure to read.

With all that said, I've decided that PowerShell in its current form (I used version 2.0) has some major drawbacks from an ETL perspective. First off, by definition it's a "scripting language". So by definition it will never follow the visual metaphor. Depending on your perspective this is may be a good thing. But I for one am a huge believer in code visualization. While it is possible to run a script and examine log output in real time, it's not easy to tell what is happening where when tasks are running in parallel, and what the relationship between those tasks are.

The second major issue I have with PowerShell is that it simply cannot perform the function of what an ETL tool can do. Originally I had coded my ETL tasks as T-SQL Stored Procedures, which were basically wrappers for INSERT/SELECT statements. Unfortunately under high volumes, these queries caused database locking to occur. Apart from using TEMP tables, my only other option was to perform my JOINs within an ETL tool. This all but makes PowerShell a complete showstopper from a data warehousing perspective when compared to an ETL tool. Nevertheless, I did for a while attempt to call my individual ETL transformations within PowerShell, and instead treat PowerShell as a job controller.

The third issue with PowerShell is that I couldn't find a simple way to log everything. Depending on the type of error being raised I had to capture it in any one of three ways. Furthermore, I basically had only one easy option for where to put the log: into a text file.

Overall I am very impressed with PowerShell from the point of view that Microsoft has elevated scripting to the next level. But as an ETL job controller, it simply cannot do as good a job as an ETL tool like Pentaho DI or SSIS.

SSIS

At first I wasn't overly impressed with SSIS. It struck me as overly architected, and too reliant on scripting steps, thus undermining the all important visual metaphor. I have come to believe that it's architecture is more elegant than meets the eye, but I still believe it could use a heck of a lot more built-in steps. For example, the lack of an INSERT/UPDATE step is a glaring omission. I also couldn't find any built-in step for hashing (which comes in very handy when comparing multiple columns for changes) or an explicit DELETE or UPDATE step (you must accomplish DELETEs and UPDATEs using an OLE DB transformation step). SSIS also lacks a Slowly Changing Dimension step. Instead, SSIS provides you with a wizard which will generate multiple steps for you out of existing components.
I don't agree with Microsoft's approach here since it makes SCDs less configurable, and more reliant on moving parts. Contrast this with Pentaho DI which includes all of these steps, as well as supporting a single configurable Slowly Changing Dimension step. That said, it is possible to add in third party steps, or simply code the step using the built-in scripting components. However, the problem with third-party add-ins is that they complicate deployments. And the problem with over reliance on scripting steps is that they can't be explicitly configured, rather logic must be coded. As I keep mentioning, scripting steps break with the visual metaphor, so you have to either read the step's label or open it up to know what it means (ETL iconography makes it easily possible to see what is going on in a single glance).

Continuing my gripes with SSIS, there is some confusion as to what steps to use when it comes to database connectivity. Namely, SSIS 2008 offers both OLE DB and ADO.NET connectors. I'm currently using OLE DB, but should I be using ADO.NET now? I'm not too concerned about that dilemma. What drives me bananas is that they also include a SQLServer destination step. It turns out that this step only works properly if you're running your SSIS server on the same server as your SQL Server. For myself this is maddening, and I'm not sure why Microsoft even bothered including this step - I couldn't discern any difference in performance from the OLE DB destination step.

Okay, that's the bad news. The good news is that SSIS has some pretty nifty features making it worthwhile, especially if you're executing on a Microsoft strategy. I'll list out the strong points of SSIS, and what sets it apart from Pentaho:

  1. Checkpointing and transaction management: This is probably my favourite feature of SSIS, when compared to Pentaho DI. Checkpointing is more or less what you would expect. If a job fails part way through, SSIS records (in a local XML file) the point of failure, as well as all of the job variable values. This allows an operator to investigate the cause of failure (typically by examining logs), take corrective action, and then restart the job from point of failure. In the world of ETL, this scenario is not uncommon. On top of checkpointing, SSIS also makes it possible to contain multiple steps as part of a single transaction, so that if any step within the transaction fails, previous steps are rolled back. This is accomplished through tight integration with Microsoft's Distributed Transaction Controller - a component of the Windows operating system. I have experimented with this feature, and have got it to work successfully, although at first it was less than straightforward to configure. That said, I've never had a need for this feature. Also, this transaction management (from what I can tell) only works at the job level, and not at the transformation level, which is where it's more needed. One last thing I should point out about checkpointing: Don't confuse job checkpointing with incremental loading. Incremental loading (as opposed full refreshes) requires the ETL developer to identify when data has been INSERTed, UPDATEd, or DELETEd since the last time the job was run, which requires upfront analysis and cannot be addressed solely by a tool.
  2. Consolidated logging, and ability to log to Windows Event Log. All ETL tools produce a single log. Most ETL tools allow for varying levels of verbosity, all from a single setting. SSIS goes one step further and nicely integrates with the Windows Event Viewer which if you're a Microsoft shop is HUGE benefit. As I mentioned earlier, logging was a sore spot for PowerShell. So the difference between SSIS and PowerShell is day-and-night.
  3. Consolidated error handling. Typically when a job fails (for any reason), you want your ETL job to fire off an e-mail to technical support. SSIS makes this very easy to do, as there is a single consolidated Error Handling job which you can develop for. Both PowerShell and Pentaho DI appear to lack this feature.
  4. C# and VB.NET scripting steps: Most ETL tools include a scripting step. This makes it possible to perform arbitrary transformations, as well as doing other tasks like adding or removing rows from the pipeline, or even adding and removing columns. Pentaho DI relies on either Java or JavaScript for its main scripting step. Personally, I'm fine with just the JavaScript step in Pentaho. SSIS allows you to script in either VB.NET or C# (C# is only supported in SSIS 2008 or later) which if you're already a .NET developer is a major benefit. Furthermore, it's significantly easier for that same code to me made into a standalone data flow step component (which gets you back to the visual metaphor). My only warning with SSIS's scripting step, is that I can easily see developers overly relying on it to perform the majority of their transformations. As I keep saying: the visual metaphor is the most powerful concept behind ETL, and dramatically lowers post-implementation costs, such as support costs, and the cost of impact analysis.
  5. Persistent look-up cache: This is new with SSIS 2008. This feature allows you to physically store data in a local binary file optimized for keyed look-ups. I've only gone so far as to test that this thing works, and can be used in a look-up step. I'm not exactly sure what the primary business driver is behind this feature. I would be nice if Microsoft would elaborate on what the recommend use cases are, and just as importantly, where NOT to use this feature. My concern is that developers may use this feature in bulk look-up situations, where a sort/merge/join would perform just as well if not better. Compare this to, Pentaho DI offers "stream lookup" step. Can't say which approach is better though.
  6. Excellent SQLServer integration: This goes without saying. My only gripe here is the dedicated SQLServer target step which appears only to work if the job is running on the same server as the SQLServer DB. I found out the hard way that it won't work otherwise. While I could go on about some of the subtle benefits of the SQLServer integration (e.g. you can easily configure batch INSERTs into cluster indexed tables to perform very well), the biggest advantage is the same advantage you get whenever you homogenize technology under a single vendor. Namely, by using SSIS and SQLServer together, if something goes wrong Microsoft can't blame some other vendor. Nor can they say "well, it's really not designed to work for _that_ database)." These games that vendors will play with you is the main reason why IT departments prefer homogeneous architectures. On the downside, there is also very tight MS Excel integration. As of this writing though, this integration doesn't work when running SSIS in 64-bit mode. Maybe with Office 10 this will be fixed now, since Office 10 is the first version of Office to ship a 64-bit version.
  7. Dynamic step configurations using expressions: SSIS includes a declarative expression language which allows you to dynamically configure most settings. The most obvious use of expressions is to configure the database connection string(s), based on a variable value (which is set at run-time). At a higher level, expressions make it much easier to develop generic packages/jobs which share the same code base, but have different source data stores and target data marts. I feel Pentaho DI is lacking in this department.
  8. Built-in text analytics steps. This is one of the first things I noticed about SSIS. Basically there are two Data Flow (i.e. transformation) steps which support basic text analytics. The first is a keyword extractor. Namely, the step can be configured to extract relevant keywords from documents. These keywords can then be used to classify documents, making them easier to search against and report on. The second text analytic step allows you to search a given set of text against a list of keywords retrieved from a configurable database query. This step is very useful for doing things like sentiment analysis (i.e. determining if a block of text includes words like "cool" or "fun" versus words like "sucks" or "fail"), but it could also be used to flag sensitive information, such as people's names. While, Pentaho DI does not include any built-in steps like these, it does have a regular expression step (which SSIS does not explicitly include), but if I had to choose, I prefer SSIS's text analytics functionality.
So as you can see SSIS has some very powerful features. If you're already a Microsoft shop and already use MS SQLServer, and have a pool of developers and application support personnel already trained on Microsoft's technology, I doubt you'll find a better SSIS tool to meet your requirements. Even if you're not using SQLServer but are doing most of your development on a Microsoft stack, it's probably a very good fit. However, if you're on non-Microsoft platform, there are many other options you should to consider.

Pentaho Data Integration (PDI)

Which brings me to Pentaho DI. One of the most platform neutral ETL tools out there. I am an unabashed fan of Pentaho, partly I'll admit, because I have quite a bit of experience working with it. Before I get into what it is I like and dislike about Pentaho DI I'll explain how it's different from other ETL tools at a very high level.

Pentaho DI is an "Open Source" ETL tool, based on the Kettle project. However, unlike Linux which has many different flavours and supporting vendors, there is only one vendor, Pentaho, that supports the tool. This is not unlike MySQL which was only being supported through SUN and now Oracle. I'm fine with that.

Just to digress for a moment, years ago I read "The Cathedral and the Bazaar" which is basically a manifesto for open source development. It argues that people will self select projects to work on, and this will drive the future of software development. I don't agree with the overarching thesis, but there are merits to this argument. The problem with volunteer based projects is that people tend to volunteer for selfish reasons. In the case of Linux, it is seen by many as a way of combating Microsoft's perceived tyranny. Shared adversaries are after all the strongest bonds between humans (I heard that from an anthropologist, and believe it to be true). When it comes to other forms of crowdsourcing, people often derive enjoyment from contributing. For example, I've contributed to Open Street Map adding in streets and landmarks in my neighbourhood. If I had more time on my hands I'd probably be doing more of it. It's kind of fun. I can also see how millions of other people might find Open Street Map fun. And let's not forget how popular Wikipedia is. But the problem is that most of the work people get paid to do is not necessarily fun - even in the software world. Some things need to get done, even if nobody wants to do them. The other problem with open source projects, is that until they reach a critical mass, competing projects can pop up, which not only dilutes the overall pool of talent but also reduces the likelihood of any one project succeeding. For example, if there were many different open source operating systems at the time Linux began to emerge, all competing with each other, would we have Linux today? I'm not so sure. My point in all this is that I support companies like Pentaho backing open source projects, even if that means more corporate control over the project. The important thing is that the entire source code is code is available, and anyone can contribute to that source code. This therefore puts Pentaho in the position of being a _service_ company, which is really the future of software any way you look at it.

Okay, moving on to Pentaho DI itself. First, the bad news. As with any piece of software, there are bugs. Now, if you have a support contract with Pentaho, this is never a problem as they will provide you with a patch (and you will receive intermediate patch releases). It's also possible to install those patches yourself (or fix the bug yourself), but that involves building the application from source files. That said, all of the bugs I've encountered can be worked around. And to be fair, I haven't found any bugs yet in the latest version (version 4.0).

Now on to the good news. Here is what I love about Pentaho DI:
  1. Plenty of built-in steps: Although I've had to resort to the JavaScript scripting step on more than one occasion (version 4.0 now includes a Java scripting step), most transformations can be entirely built with their built-in steps.
  2. Web ETL steps built-in: PDI comes with several web extraction steps, include GET and POST lookups, WSDL lookups, and a step to check if a web service is even available. There is also a step for decoding XML documents. For JSON documents, I just use the JavaScript step (it's not too hard to find JavaScript code to decode JSON documents). I mention all of these steps because I believe web ETL is a big part of the future of ETL. This is especially true now with so many data services out there. Contrast this to SSIS, which is rather weak in this department.
  3. Single Type II Slowly Changing Dimension step: Unlike SSIS (which doesn't have a single SCD step, but rather uses a wizard to generate multiple steps), PDI includes a single Type II SCD step. I've used it quite a bit, and it works very well and is simple and straightforward to configure.
  4. Normalized database repository: The other ETL tools I have used (SSIS and DataStage) will allow you to save jobs in the form of a file, or within a database. However, I find those files to be relatively opaque and difficult to query (I suppose with the right tools it would be possible to query them). PDI on the other hand stores all of its Jobs and Transformations in a normalized database. This makes it possible to query your ETL code using standard SQL queries. I have taken advantage of this by writing queries to produce simple impact analysis reports, showing which tables are impacted by which jobs. Since impact analysis is a huge aspect of data Enterprise Architecture, this is a real benefit. Furthermore, because PDI jobs are stored within a centralized RDBMS database, I find that it is much easier to deploy and troubleshoot jobs in production than it is with SSIS. To be fair, SSIS does allow packages to be deployed to SQLServer, but their really just deployed as an opaque "BLOB" which can't be easily queried.
  5. Java based (runs on any platform): There was a time where Java applications were not considered appropriate for applications requiring high performance, such as an ETL tool. Well, to my surprise, Java has caught up pretty closely to binary native compiled code. As such, the benefits of "write once, run anywhere" are starting to come into focus. Indeed, I have run PDI on both Windows and Linux machines without incident. I also can't tell the difference between the performance of SSIS versus PDI
  6. Clustering/Cloud deployment: I honestly don't have any hands-on experience deploying PDI as part of a cluster, or within a cloud. Although I have heard some impressive benchmarks, with reports of over 200,000 records processed per second within a single transformation. Because it is possible to cluster PDI servers, this means it is possible to scale up individual ETL transformations across multiple server nodes. Furthermore, Pentaho is now beginning to offer cloud based solutions. This is also something I have no hands-on experience with, but I like what I'm hearing. Contrast this with SSIS, which is not "cluster aware" and must be installed as "standalone" in Windows Cluster. To be fair, I believe part of the reason for SSIS's shortcoming is that it is more transaction aware than PDI, and with that comes certain challenges. I also think that it is possible to run SSIS as part of a cluster, but Microsoft does not recommend this.
Well, that's my experience report. Hope you learned something. Feel free to e-mail me or comment any corrections. It's quite possible I've made an incorrect assumption, so if any PowerShell, SSIS or PDI developers are reading this, don't hesitate to chime in. I'll happily update this blog if I agree there is a problem with something I've written.