Monday, August 23, 2010

Twelve Reasons to use an ETL tool / Experience Report on ETL tools: Pentaho DI, SSIS, and PowerShell

Over the past couple months I've been jumping between three different ETL tools (well PowerShell is not exactly an ETL tool but has some overlapping functionality). The experience has given me new perspective on the strengths and weaknesses of each tool. I hope to share with you my experiences, opinions, and recommendations.

This report is not going to be as structured, or cover as many tools as something you might find from a Gartner report. My approach here is to get into specific experiences with the tools as well as discuss why I think ETL tools are important to begin with. My perspective is not intended to be a definitive decision making tool, but rather a useful component to the decision making process when choosing an ETL tool.

Before getting into my experiences I'll give you a brief overview of what ETL is, and how these tools fit into the ETL landscape.

ETL stands for "Extract Transform Load". ETL tools first started showing up in the mid nineties as a response to the growing demand of data warehouses. The first major ETL tool was Informatica, and continues to be one of the best ETL tools available on the market today. The main reason ETL tools were invented was that it has traditionally been time consuming and error prone to extract data from multiple source systems, merge those data together, and load into a data warehouse for reporting and analytics.

To this day, the majority of techies in IT are unaware of what an ETL tool is. However those same people are often tasked with solving the very problems ETL tools are designed to solve. More experienced developers will achieve their goals through a combination of shell scripts and SQL scripts. Less experienced developers will simply stay in their "comfort zone" and use whatever programming language they happen to be the most familiar with, with little regard for software maintenance and support.

The task itself of writing a computer program to extract data, merge it (or perform some other transformation), and loading those data into a target data warehouse is conceptually very simple for the average developer. However, it is not until you start running into the following problems do you realize that the "follow your nose" approach doesn't work so well. Here are some of the issues you will likely run into when maintaining ETL jobs for a data warehouse (I've run into them all). In other words, here is what an ETL tool is designed for:

  1. Code readability: This is probably the biggest difference between an ETL tool and pretty much every other programming language out there. It is driven off of a visual metaphor. When I first saw an ETL job, it reminded me of one of those crazy Rube-Goldberg machines. However, once you acquaint yourself with the iconography, and what all the connecting lines mean, it becomes very easy to look at an ETL job and understand what it is doing, and how it works. Unfortunately, many developers scoff at this visual approach, after all "real coders write real code". I also wonder if many developers are afraid of developing ETL jobs as it lays their code bare for all to see. Managers on the other hand love this visual approach, as it allows them to scrutinize and partially understand how code is working. Furthermore, an ETL tool will visually show your data coursing through it from one step to the next, much like observing water flow through tributaries into rivers into lakes. When a full load takes several hours to complete, it is very reassuring to see what is actually going on with the data.
  2. Data Element mapping: Much of what you're doing in data warehousing involves mapping a source data element to a target schema. This is tedious work and it's potentially error prone - especially if you're lining up an INSERT statement with a SELECT statement. ETL tools make it very easy and safe to line up and map data elements in a fraction of the time you would normally take.
  3. Impact Analysis: Agility is often seen as a function of management. However, if it takes you a month to figure out what has being impacted as a result of a minor schema change (e.g. adding a column to a table), then you're going to get stuck in your tracks before you have the chance to yell "Scrum!". ETL tools make it easy to identify where data is being sourced from, and where it is being targeted. Some ETL tools will even produce impact reports for you. Keep in mind that this only one aspect of impact analysis, but every little bit helps.
  4. Incremental loading: By default most developers will develop code to perform a full load on a data warehouse. This code is not only easy to develop, but is fairly reliable. Unfortunately, full loads can take a lot of time to complete, and as time goes on will take longer and longer. Switching to an incremental load is tricky since it involves figuring out what data is new, changed, or has been removed. Depending on your source system, your options may be limited. Nevertheless, ETL tools provide functionality to assist with different scenarios
  5. Parallelism: Writing concurrent software is not for the "faint of heart". There is an inherent overhead required to manage semaphores and other forms of inter-process communication. Furthermore, the threat of deadlocks or livelocks in complex systems can often only be discovered through trial-and-error. There are modern frameworks which greatly simplify parallelism in languages like Java or Python. However it is hard to argue against the simplicity of the ETL approach when developing parallel data processing applications. Furthermore, some ETL tools (e.g. IBM DataStage Enterprise Edition) can even parallelize JOINs. This makes it possible to outperform an in-place JOIN within an RDBMS like Oracle (most SQL queries still run single threaded).
  6. Checkpointing and job recovery: Imagine you have just developed a script to perform a large data warehouse load. Let's say you run the script on a weekly basis, and the load takes 8 hours to run (you run it over night). One morning you arrive to find the script has failed 80% into the job. Some of the tables have been updated, but there are a few that haven't been updated. You can run the job from the beginning, but this is problematic since you need to back out all the data you have already inserted (otherwise you'll get duplicate data). Alternatively, you can pinpoint the point of failure and run from there. However, this requires a code change, which means you will need to test your modified code before getting it into production (and let's not forget the psychological pressure when people are waiting on fresh data). ETL tools make it relatively easy to checkpoint your code, and resume from a failed step. Some ETL tools have built-in checkpointing which means you don't have to even instrument your code for checkpointing. Furthermore, if code changes are required [due to a failed job], they are more often a change in configuration, rather than a real change in code, so it is generally safe to make a configuration change and resume the job with less risk than traditional scripting. That said, not all ETL tools support checkpointing. However, even ETL tools which don't explicitly support checkpointing can be more easily instrumented to checkpoint than traditional scripting or programming languages.
  7. Logging and monitoring: I've already touched upon this point with respect to monitoring. To reiterate, ETL tools allow you to see (seeing is believing) data flowing from one step to the next. It's powerful data visualization that is not unlike a floor plan or map visualization. Namely, it can communicate far more information than a bar chart or line chart ever will. As for logging, ETL tools by default centralize logs and implicitly logs what you need to log (with configurable views of granularity), without having to instrument your code.
  8. Centralized Error Handling: Scripting languages don't provide centralized error handling (this has more to do with the OS and legacy applications), so you're forced to always explicitly check for errors and call the appropriate error handling routine. ETL tools often include generic error handling routines which are invoked regardless of where the error originated from or how it is raised.
  9. Slowly Changing Dimensions: Although I've listed this as #9, perhaps this should be #1 from a data quality perspective. If you don't implement a Type II (or higher) SCD policy, you're precious data warehousing investment will most certainly degrade over time. Slowly Changing Dimensions are crucially important, as they allow you to show a consistent view of history. I could go on at greater length as to what an SCD is, but if you're not sure just read the Wikipedia article. At any rate, ETL tools have built-in SCD steps which allow you to define the technical key, the natural key, the effective begin and end date fields, and in some cases a version field and "current version" flag field. Without an ETL tool SCDs are a major pain in the butt to manage, and developers will often not implement a Type II SCD policy (often because they aren't aware of what an SCD policy is), but also due to the added effort. Don't forget the classic cop-out: "Hey, if the business analyst doesn't ask for it, I'm not going to build it". SCDs are one of those classic business requirements that few people realize they need until it's too late.
  10. Pivoting. It is increasingly popular to see data being stored and logged in what are known as Entity Attribute Value data models. I generally prefer to see explicit data models over generalized data models (which effectively treat the database as "bit bucket"). But they're here to stay, and in some cases actually make sense. However, it is virtually impossible to analyze EAV data without pivoting it back into columns. This is another task which is well served through an ETL tool, and which can be very cumbersome if you intend to accomplish through traditional scripting. In short, an ETL tool allows you specify the grouping ID (essentially the primary key for the row), the pivot key column, and the various pivot values you wish to map to columns. Furthermore, if you need to extract data out of some kind of BLOB or something like an XML or JSON document, this too is easily achieved through standard ETL steps. If you were to do this through scripting, you would probably have to rely on temporary tables and plenty of code. In other words, just like with Slowly Changing Dimensions, there are fewer Function Points to worry about with an ETL tool
  11. In-memory batch manipulation: Often when manipulating data sets in bulk, many developers will rely on temporary tables, and from there perform batch operations against those temporary tables. While this works, and is sometimes even necessary, ETL tools allow you to more easily process large batches of data as part of a continuous in-memory pipeline of records. It is this notion of a "data pipeline" which also distinguishes ETL from other programming paradigms, such as scripting approaches (even those scripting languages that support piping operations.
  12. Distributed transaction control: While it is possible to create distributed transactions using a transaction coordinator, often developers who eschew ETL tools will also avoid using other high level tools, such as a distributed transaction co-ordinator. Some ETL tools include built-in integration with distributed transaction co-ordinators. However, in my own experience I have never had the need for two-phase-commit in an ETL tool. I suppose if I needed to ensure that my data warehouse was always 100% consistent, even during load times, then this makes sense. However, I would urge caution when using transaction managers, since large transactions can easily lead to locking, not to mention they perform horribly under full load scenarios, and a single COMMIT statement can take hours (depending on how you've configured indexes).

Before I get into my current experiences with PowerShell, SSIS, and Pentaho DI (PDI), I'd like to point out that there is also something called ELT (Extract Load Transform). ELT differs from ETL in that it tends to rely on the target DBMS to perform the transformations. ELT tools achieve this through code generation. Sometimes they are referred to as "code generating ETL" versus most other ETL which distinguishes itself with the moniker "engine-based ETL". ELT works like this:
  1. Extract data from source databases(s) using standard connectivity
  2. Load data in bulk into target database
  3. Generate and submit SQL DML statements within target database to achieve desired data transformations
Popular ELT tools include the open source TalenD and Oracle Data Integration (formerly Synoptic). The strength of this approach is that the target DBMS can be leveraged to achieve superior performance. Furthermore, it is possible to override the code generation and hand-tune the generated code.

If you target a robust DBMS (e.g. Oracle), the results can be impressive. However, since you are relying on the target DBMS, your options are more limited. Yet this is not the main reason I am wary of the ELT approach. My worry with ELT tools is that they can easily break the visual metaphor by relying too much on hand generated code. Furthermore, during the transformation stages it is impossible to see what is going on (I prefer maximum transparency during long running transformations). Over time, your ELT jobs can devolve into something resembling scripting.

I should point out that when I first started developing transformations and jobs using Ascential DataStage (now an IBM product), I found that most data was being read from and written to the same Oracle database, where I was working. At my workplace I was quickly able to run circles around most ETL developers using bread-and-butter SQL queries. Indeed, Oracle has one of the best (possibly THE best) query optimizer out there, and few of my peers understood how to take advantage of DataStage's parallel extender, and were probably fairly average ETL developers. But since that time I have come full circle and am a firm believer in the visual metaphor. Yes, I may sacrifice some performance, but at the time I was doing this I didn't have the perspective that I'm giving you in the aforementioned 12 reasons to use an ETL tool.


Getting back to my recent experiences, I'll start with PowerShell. To be clear, PowerShell is NOT an ETL tool. It's a shell scripting tool. When I first discovered it (as part of MS SQL Server 2008) I was very impressed with what I saw. There was a time many years ago when I was a proficient UNIX shell script programmer. As any UNIX buff knows, mastery of the Bourne/Korn or C Shell (and all the standard command line tools) is the fastest route to becoming a UNIX power user. Occassionally I'll still fire up Cygwin (I used to use MKS toolkit) and will pipe stuff through sed or awk. Vi is still my favourite text editor.

I was therefore very eager to embrace PowerShell as a potential ETL disruptor (one should always be on the lookout for disruptive technologies). There's a lot to love about PowerShell. As much as I will diss Microsoft for its monopolistic practices, they are one of the few true innovators of the back-office. PowerShell is no exception. Let me quickly point out some of the more powerful features of PowerShell:
  1. Objects can be piped: Piping in UNIX is a way of life, but it can also be tedious as you're forced to serialize and deserialize streams of text, in order to achieve the desired result. PowerShell allows you to pipe .NET objects from one shell app to the next. For example, it's possible to pipe a result set, or directory listing as an object. This comes in handy quite a bit. For example, it is very easy to retrieve a particular set of columns from a query's result set, without having to sift through and parse a whole bunch of text
  2. Snap-ins can be used to move between shell contexts: Currently the only snap-ins I've used are for the OS shell, and the MS SQL Server shell. The OS shell is the default shell, and listing "entities" will basically just list files and directories as you would normally expect. But while in SQLServer mode, the shell lists databases and tables as though they were directories and files. While most DBAs are accustomed to running database commands using a query shell window, PowerShell goes far beyond what is easily possible using a basic SQL (or T-SQL) query interface.
  3. Parsing files is easy: Going back to the object pipes, it is also very easy to pipe the contents of a file, and parse those contents as a file object. PowerShell has also reduced the number of steps required to open and read a file, making such operations second nature. Let's face it, processing text files is as common activity in shell processing as ever (e.g. you need to parse a log file for an error message), so it's very much a pleasure to use PowerShell for these tasks
  4. Great documentation: It's not often I compliment software documentation, but I feel it should be acknowledged that Microsoft even rethought those dry UNIX "man" pages. Perhaps it's just one technical writer at MS making this difference. But whoever you are, hats off for having a good sense of humour and making the dry and technical a pleasure to read.

With all that said, I've decided that PowerShell in its current form (I used version 2.0) has some major drawbacks from an ETL perspective. First off, by definition it's a "scripting language". So by definition it will never follow the visual metaphor. Depending on your perspective this is may be a good thing. But I for one am a huge believer in code visualization. While it is possible to run a script and examine log output in real time, it's not easy to tell what is happening where when tasks are running in parallel, and what the relationship between those tasks are.

The second major issue I have with PowerShell is that it simply cannot perform the function of what an ETL tool can do. Originally I had coded my ETL tasks as T-SQL Stored Procedures, which were basically wrappers for INSERT/SELECT statements. Unfortunately under high volumes, these queries caused database locking to occur. Apart from using TEMP tables, my only other option was to perform my JOINs within an ETL tool. This all but makes PowerShell a complete showstopper from a data warehousing perspective when compared to an ETL tool. Nevertheless, I did for a while attempt to call my individual ETL transformations within PowerShell, and instead treat PowerShell as a job controller.

The third issue with PowerShell is that I couldn't find a simple way to log everything. Depending on the type of error being raised I had to capture it in any one of three ways. Furthermore, I basically had only one easy option for where to put the log: into a text file.

Overall I am very impressed with PowerShell from the point of view that Microsoft has elevated scripting to the next level. But as an ETL job controller, it simply cannot do as good a job as an ETL tool like Pentaho DI or SSIS.


At first I wasn't overly impressed with SSIS. It struck me as overly architected, and too reliant on scripting steps, thus undermining the all important visual metaphor. I have come to believe that it's architecture is more elegant than meets the eye, but I still believe it could use a heck of a lot more built-in steps. For example, the lack of an INSERT/UPDATE step is a glaring omission. I also couldn't find any built-in step for hashing (which comes in very handy when comparing multiple columns for changes) or an explicit DELETE or UPDATE step (you must accomplish DELETEs and UPDATEs using an OLE DB transformation step). SSIS also lacks a Slowly Changing Dimension step. Instead, SSIS provides you with a wizard which will generate multiple steps for you out of existing components.
I don't agree with Microsoft's approach here since it makes SCDs less configurable, and more reliant on moving parts. Contrast this with Pentaho DI which includes all of these steps, as well as supporting a single configurable Slowly Changing Dimension step. That said, it is possible to add in third party steps, or simply code the step using the built-in scripting components. However, the problem with third-party add-ins is that they complicate deployments. And the problem with over reliance on scripting steps is that they can't be explicitly configured, rather logic must be coded. As I keep mentioning, scripting steps break with the visual metaphor, so you have to either read the step's label or open it up to know what it means (ETL iconography makes it easily possible to see what is going on in a single glance).

Continuing my gripes with SSIS, there is some confusion as to what steps to use when it comes to database connectivity. Namely, SSIS 2008 offers both OLE DB and ADO.NET connectors. I'm currently using OLE DB, but should I be using ADO.NET now? I'm not too concerned about that dilemma. What drives me bananas is that they also include a SQLServer destination step. It turns out that this step only works properly if you're running your SSIS server on the same server as your SQL Server. For myself this is maddening, and I'm not sure why Microsoft even bothered including this step - I couldn't discern any difference in performance from the OLE DB destination step.

Okay, that's the bad news. The good news is that SSIS has some pretty nifty features making it worthwhile, especially if you're executing on a Microsoft strategy. I'll list out the strong points of SSIS, and what sets it apart from Pentaho:

  1. Checkpointing and transaction management: This is probably my favourite feature of SSIS, when compared to Pentaho DI. Checkpointing is more or less what you would expect. If a job fails part way through, SSIS records (in a local XML file) the point of failure, as well as all of the job variable values. This allows an operator to investigate the cause of failure (typically by examining logs), take corrective action, and then restart the job from point of failure. In the world of ETL, this scenario is not uncommon. On top of checkpointing, SSIS also makes it possible to contain multiple steps as part of a single transaction, so that if any step within the transaction fails, previous steps are rolled back. This is accomplished through tight integration with Microsoft's Distributed Transaction Controller - a component of the Windows operating system. I have experimented with this feature, and have got it to work successfully, although at first it was less than straightforward to configure. That said, I've never had a need for this feature. Also, this transaction management (from what I can tell) only works at the job level, and not at the transformation level, which is where it's more needed. One last thing I should point out about checkpointing: Don't confuse job checkpointing with incremental loading. Incremental loading (as opposed full refreshes) requires the ETL developer to identify when data has been INSERTed, UPDATEd, or DELETEd since the last time the job was run, which requires upfront analysis and cannot be addressed solely by a tool.
  2. Consolidated logging, and ability to log to Windows Event Log. All ETL tools produce a single log. Most ETL tools allow for varying levels of verbosity, all from a single setting. SSIS goes one step further and nicely integrates with the Windows Event Viewer which if you're a Microsoft shop is HUGE benefit. As I mentioned earlier, logging was a sore spot for PowerShell. So the difference between SSIS and PowerShell is day-and-night.
  3. Consolidated error handling. Typically when a job fails (for any reason), you want your ETL job to fire off an e-mail to technical support. SSIS makes this very easy to do, as there is a single consolidated Error Handling job which you can develop for. Both PowerShell and Pentaho DI appear to lack this feature.
  4. C# and VB.NET scripting steps: Most ETL tools include a scripting step. This makes it possible to perform arbitrary transformations, as well as doing other tasks like adding or removing rows from the pipeline, or even adding and removing columns. Pentaho DI relies on either Java or JavaScript for its main scripting step. Personally, I'm fine with just the JavaScript step in Pentaho. SSIS allows you to script in either VB.NET or C# (C# is only supported in SSIS 2008 or later) which if you're already a .NET developer is a major benefit. Furthermore, it's significantly easier for that same code to me made into a standalone data flow step component (which gets you back to the visual metaphor). My only warning with SSIS's scripting step, is that I can easily see developers overly relying on it to perform the majority of their transformations. As I keep saying: the visual metaphor is the most powerful concept behind ETL, and dramatically lowers post-implementation costs, such as support costs, and the cost of impact analysis.
  5. Persistent look-up cache: This is new with SSIS 2008. This feature allows you to physically store data in a local binary file optimized for keyed look-ups. I've only gone so far as to test that this thing works, and can be used in a look-up step. I'm not exactly sure what the primary business driver is behind this feature. I would be nice if Microsoft would elaborate on what the recommend use cases are, and just as importantly, where NOT to use this feature. My concern is that developers may use this feature in bulk look-up situations, where a sort/merge/join would perform just as well if not better. Compare this to, Pentaho DI offers "stream lookup" step. Can't say which approach is better though.
  6. Excellent SQLServer integration: This goes without saying. My only gripe here is the dedicated SQLServer target step which appears only to work if the job is running on the same server as the SQLServer DB. I found out the hard way that it won't work otherwise. While I could go on about some of the subtle benefits of the SQLServer integration (e.g. you can easily configure batch INSERTs into cluster indexed tables to perform very well), the biggest advantage is the same advantage you get whenever you homogenize technology under a single vendor. Namely, by using SSIS and SQLServer together, if something goes wrong Microsoft can't blame some other vendor. Nor can they say "well, it's really not designed to work for _that_ database)." These games that vendors will play with you is the main reason why IT departments prefer homogeneous architectures. On the downside, there is also very tight MS Excel integration. As of this writing though, this integration doesn't work when running SSIS in 64-bit mode. Maybe with Office 10 this will be fixed now, since Office 10 is the first version of Office to ship a 64-bit version.
  7. Dynamic step configurations using expressions: SSIS includes a declarative expression language which allows you to dynamically configure most settings. The most obvious use of expressions is to configure the database connection string(s), based on a variable value (which is set at run-time). At a higher level, expressions make it much easier to develop generic packages/jobs which share the same code base, but have different source data stores and target data marts. I feel Pentaho DI is lacking in this department.
  8. Built-in text analytics steps. This is one of the first things I noticed about SSIS. Basically there are two Data Flow (i.e. transformation) steps which support basic text analytics. The first is a keyword extractor. Namely, the step can be configured to extract relevant keywords from documents. These keywords can then be used to classify documents, making them easier to search against and report on. The second text analytic step allows you to search a given set of text against a list of keywords retrieved from a configurable database query. This step is very useful for doing things like sentiment analysis (i.e. determining if a block of text includes words like "cool" or "fun" versus words like "sucks" or "fail"), but it could also be used to flag sensitive information, such as people's names. While, Pentaho DI does not include any built-in steps like these, it does have a regular expression step (which SSIS does not explicitly include), but if I had to choose, I prefer SSIS's text analytics functionality.
So as you can see SSIS has some very powerful features. If you're already a Microsoft shop and already use MS SQLServer, and have a pool of developers and application support personnel already trained on Microsoft's technology, I doubt you'll find a better SSIS tool to meet your requirements. Even if you're not using SQLServer but are doing most of your development on a Microsoft stack, it's probably a very good fit. However, if you're on non-Microsoft platform, there are many other options you should to consider.

Pentaho Data Integration (PDI)

Which brings me to Pentaho DI. One of the most platform neutral ETL tools out there. I am an unabashed fan of Pentaho, partly I'll admit, because I have quite a bit of experience working with it. Before I get into what it is I like and dislike about Pentaho DI I'll explain how it's different from other ETL tools at a very high level.

Pentaho DI is an "Open Source" ETL tool, based on the Kettle project. However, unlike Linux which has many different flavours and supporting vendors, there is only one vendor, Pentaho, that supports the tool. This is not unlike MySQL which was only being supported through SUN and now Oracle. I'm fine with that.

Just to digress for a moment, years ago I read "The Cathedral and the Bazaar" which is basically a manifesto for open source development. It argues that people will self select projects to work on, and this will drive the future of software development. I don't agree with the overarching thesis, but there are merits to this argument. The problem with volunteer based projects is that people tend to volunteer for selfish reasons. In the case of Linux, it is seen by many as a way of combating Microsoft's perceived tyranny. Shared adversaries are after all the strongest bonds between humans (I heard that from an anthropologist, and believe it to be true). When it comes to other forms of crowdsourcing, people often derive enjoyment from contributing. For example, I've contributed to Open Street Map adding in streets and landmarks in my neighbourhood. If I had more time on my hands I'd probably be doing more of it. It's kind of fun. I can also see how millions of other people might find Open Street Map fun. And let's not forget how popular Wikipedia is. But the problem is that most of the work people get paid to do is not necessarily fun - even in the software world. Some things need to get done, even if nobody wants to do them. The other problem with open source projects, is that until they reach a critical mass, competing projects can pop up, which not only dilutes the overall pool of talent but also reduces the likelihood of any one project succeeding. For example, if there were many different open source operating systems at the time Linux began to emerge, all competing with each other, would we have Linux today? I'm not so sure. My point in all this is that I support companies like Pentaho backing open source projects, even if that means more corporate control over the project. The important thing is that the entire source code is code is available, and anyone can contribute to that source code. This therefore puts Pentaho in the position of being a _service_ company, which is really the future of software any way you look at it.

Okay, moving on to Pentaho DI itself. First, the bad news. As with any piece of software, there are bugs. Now, if you have a support contract with Pentaho, this is never a problem as they will provide you with a patch (and you will receive intermediate patch releases). It's also possible to install those patches yourself (or fix the bug yourself), but that involves building the application from source files. That said, all of the bugs I've encountered can be worked around. And to be fair, I haven't found any bugs yet in the latest version (version 4.0).

Now on to the good news. Here is what I love about Pentaho DI:
  1. Plenty of built-in steps: Although I've had to resort to the JavaScript scripting step on more than one occasion (version 4.0 now includes a Java scripting step), most transformations can be entirely built with their built-in steps.
  2. Web ETL steps built-in: PDI comes with several web extraction steps, include GET and POST lookups, WSDL lookups, and a step to check if a web service is even available. There is also a step for decoding XML documents. For JSON documents, I just use the JavaScript step (it's not too hard to find JavaScript code to decode JSON documents). I mention all of these steps because I believe web ETL is a big part of the future of ETL. This is especially true now with so many data services out there. Contrast this to SSIS, which is rather weak in this department.
  3. Single Type II Slowly Changing Dimension step: Unlike SSIS (which doesn't have a single SCD step, but rather uses a wizard to generate multiple steps), PDI includes a single Type II SCD step. I've used it quite a bit, and it works very well and is simple and straightforward to configure.
  4. Normalized database repository: The other ETL tools I have used (SSIS and DataStage) will allow you to save jobs in the form of a file, or within a database. However, I find those files to be relatively opaque and difficult to query (I suppose with the right tools it would be possible to query them). PDI on the other hand stores all of its Jobs and Transformations in a normalized database. This makes it possible to query your ETL code using standard SQL queries. I have taken advantage of this by writing queries to produce simple impact analysis reports, showing which tables are impacted by which jobs. Since impact analysis is a huge aspect of data Enterprise Architecture, this is a real benefit. Furthermore, because PDI jobs are stored within a centralized RDBMS database, I find that it is much easier to deploy and troubleshoot jobs in production than it is with SSIS. To be fair, SSIS does allow packages to be deployed to SQLServer, but their really just deployed as an opaque "BLOB" which can't be easily queried.
  5. Java based (runs on any platform): There was a time where Java applications were not considered appropriate for applications requiring high performance, such as an ETL tool. Well, to my surprise, Java has caught up pretty closely to binary native compiled code. As such, the benefits of "write once, run anywhere" are starting to come into focus. Indeed, I have run PDI on both Windows and Linux machines without incident. I also can't tell the difference between the performance of SSIS versus PDI
  6. Clustering/Cloud deployment: I honestly don't have any hands-on experience deploying PDI as part of a cluster, or within a cloud. Although I have heard some impressive benchmarks, with reports of over 200,000 records processed per second within a single transformation. Because it is possible to cluster PDI servers, this means it is possible to scale up individual ETL transformations across multiple server nodes. Furthermore, Pentaho is now beginning to offer cloud based solutions. This is also something I have no hands-on experience with, but I like what I'm hearing. Contrast this with SSIS, which is not "cluster aware" and must be installed as "standalone" in Windows Cluster. To be fair, I believe part of the reason for SSIS's shortcoming is that it is more transaction aware than PDI, and with that comes certain challenges. I also think that it is possible to run SSIS as part of a cluster, but Microsoft does not recommend this.
Well, that's my experience report. Hope you learned something. Feel free to e-mail me or comment any corrections. It's quite possible I've made an incorrect assumption, so if any PowerShell, SSIS or PDI developers are reading this, don't hesitate to chime in. I'll happily update this blog if I agree there is a problem with something I've written.


Todd McDermid said...

That's an excellent commentary, Neil! I haven't gotten into POSH yet, and haven't played with PDI or other tools - but it's very interesting to hear from someone who has. I feel the same way about most of your gripes about SSIS - the tool I use most.
Most significantly, I completely agree with your sentiment of leaning towards the "visual" tool. Coding's great, but the visual representation is so accessible for the "next guy" who has to manage/extend/repair your ETL job...
Here are some comments about SSIS related to your observations. I agree with your comment of SSIS almost requiring use of scripting - it just doesn't have enough tools in the standard toolbox. I've started a list of available third party extensions, and coded a few myself including an SCD transform you probably tried. It's very useful to have the scripting steps - but I do agree that some ETL devs rely on those much too heavily, and use them as a hammer for every nail, screw, and staple. Checkpointing - although available - has been panned by my esteemed colleagues as something that's not really useful and reliable.
The persistent lookup cache is extremely useful in many scenarios. First - it can be used instead of JOIN transforms that would otherwise require expensive in-memory sorting of data, and can also be used for sources that can't connect with the Lookup component in "real time" (no persistent cache). Second - it can be used very effectively for incremental load scenarios where a surrogate key pipeline lookup cache can be incrementally updated, rather than repopulated with every load. Excel integration is also a touchy topic that is mostly touchy due to Excel alone, not SSIS. The Office 2010 provider will run in 64-bit mode... but can't be installed on the same box as the 32-bit, which causes some other difficulties. That said, once you figure out how to get a subset of your ETL running in 32-bit on a 64-bit server, using Excel as a source isn't that bad.
One other metric missing from your commentary that can be highly relevant to MS shops - SSIS is bundled (free) with SQL Server. That makes it at least attractive enough to crack the lid on and try out.
Once again - excellent post. I'll be referring back to this in the future, I'm sure, and will also direct the MS devs I know on the SSIS team towards it to add weight to various requests of mine. Thanks for taking the time to put this together.

Neil Hepburn said...

Thanks for your comments Todd!
I've bookmarked your list of available third party extensions. This is an excellent resource!

As for checkpointing. I have had my trials and tribulations, but I've got the hang of it now, and have found it to work quite well. Although I will agree that it is finicky. But if you can get it to work, it's worth the hassle. It's really just a matter of experimenting with it for 1/2 a day. But I do feel it could be made more robust.

There is also the concern of saving plaintext passwords in the checkpointing file, which can happen if you're dynamically setting your connection string using variables.

Finally, I agree with the point about costs. This reinforces my recommendation that if you're a SQLServer shop, it's a no brainer to use SSIS as your ETL tool.