Essential Complexity: August 2017

This post is a follow-up to my original posting (and paper) titled “A Modular Approach to Solving the Data Variety Problem”.

In response to that posting a LinkedIn commenter (Mark B.) asked the following [paraphrased] question to understand how he might use modular approach to build a modular data analysis system to handle the following scenario:

“As a digital marketer, I would like to see how the variation in advertising images are related to responses by different audiences.”

Thank you for this question Mark. Since you have identified two subjects: Images and Advertisements, this is an ideal jumping off point to illustrate the benefits of taking a modular approach to analytics.

To give you the short answer, using a modular approach we can ask and answer cross-subject questions that would normally be prohibitively expensive to answer:

“What images give me the best click-thru and conversion rates?”
“Do older images have the same click-thru rate as newer images?”
“Including the cost of image production, what is the overall cost of my Ad Campaigns?”
“Is there any relationship between the cost of an image and its click-thru and conversion rate by gender?”
“Do images with a positive sentiment perform better than those with a neutral or negative sentiment?”

How does a modular approach allow us to answer these question so easily? It all comes down to being able to leverage Dimensions and Measures already developed for each Subject on their own (i.e. Images and Ad Impressions) and then being able to combine those Subjects into a unified multi-Subject Graph that can be easily queried.

Recapping my paper, if you take a modular approach to analytics, you can decompose your analyses into separate “Subjects” (tables), and then further decompose those Subjects into Subsets. Each of these sub-components can be developed independently of the others. Once these components (stored as portable data files) are “docked in” to the main repository, they can be “lobbed” and “linked” together by users to form graphs that allow for cross-subject analyses.

Let’s first break this down into the two subjects at hand: Images and Ad Impressions.

Let’s now tackle the first Subject “Images”. We may have a team responsible for developing reports to analyze Image statistics. For example, this team may have developed a set of Dimensions and Measures that allows them to determine how much Images cost to produce, how old they are, and what type of sentiment they are intended to produce. Since images would presumably be developed by different teams, they would have their own reports (represented as tables) segregated by team. Since each team’s reports would conform to a standard published schema, they could be combined to form a single cross-department report. For example, “Team A” and “Team B” could combine their image reports into a single “Image” Subject table.

Moving on to the second subject “Ad Impressions”. Again, there may be multiple teams running multiple advertising campaigns across multiple advertising platforms over several months. The teams responsible for managing these ad campaigns might even be different based on the Ad Campaign or the Digital Advertising Platform the ads are being served up on. Like the Image team, these advertising teams would also have a set of Dimensions and Measures that would allow them to determine how often an ad was clicked on, how many conversions (e.g. goal actions) there were, what the dollar amount of the conversions is, and how these metrics break out by gender and other demographic & psychographic variables (which may be specific to the ad platform). Again, since each team’s report would conform to a published schema, they could also be combined to form a single report. Again, this combined “report” would constitute the “Ad Impression” subject.

I have just described two different Subjects, each with their own set of Dimensions and Measures, and each composed of their own sub-sets of data. Where the modular approach becomes relevant is that it is now possible for users to locate these sub-sets and “lob” these sub-sets into larger subjects and then “link” these subjects to form graphs that allow for cross-Subject analyses. Namely, we can now ask and answer the questions we raised near the beginning of this post:

“What images give me the best click-thru and conversion rates?”
“Do older images have the same click-thru rate as newer images?”
“Including the cost of image production, what is the overall cost of my Ad Campaigns?”
“Is there any relationship between the cost of an image and its click-thru and conversion rate by gender?”
“Do images with a positive sentiment perform better than those with a neutral or negative sentiment?”

However, there is one piece missing from the picture: In order to make this possible, we would need to define a simple “bridge” table for connecting the image profiles to the ad impressions. This bridge table would be developed and maintained by the team that has access to the information required to link the two subjects together.

The following diagram shows how sub-sets sharing the same schema (as depicted with their own colour) can be “lobbed” together to form larger subjects, and how subjects sharing a linking column can be SEMI-JOIN linked together to form a graph for cross-subject analytics.

Astute readers might point out that there is nothing preventing a determined analyst with access to the underlying data from answering the same questions. While it is true that the end result can be achieved through current approaches, these approaches tend to be prohibitively expensive. Here is what is different about the modular approach:

Users can integrate data through user-friendly graphical interfaces allowing them to vertically “lob” Sub-Sets into Subjects and then horizontally link those customized Subjects without fear of introducing duplicates through the common “Fan Trap” problem that bogs down most data integration efforts
Users can independently develop new Subjects and Subject Sub-Sets, and then “dock in” those Subjects and Sub-Sets in a self-serve manner, without relying on IT assistance, while still conforming to enterprise data governance rules thus protecting Metadata Integrity and Data Integrity, thus allowing data to be safely located and integrated by other users
Users can “time travel” by choosing an older “AS-OF” date and time, and performing analyses across data that was current as of that date
Data files are portable and can be potentially moved to wherever they are needed for either analysis or downstream processing

An example file name, containing from the first Ad Impression Subject Sub-Set (as shown in the above diagram) might be: “AdImpression_V1_CAMDAPMON_SF-G-2017-04_AS-OF 2017-08-26 153100.csv”

On top of all of this, other Subjects such as “Web Session” could be “docked in” in to the larger repository allowing Data Analysts to include any Dimensions and Measures developed for the “Web Session” Subject (e.g. ‘Session Duration’) to be incorporated into analyses relating to Images and Ad Impressions. For example, we could ask and answer the question “What images have not been used for the past 7 days of Web Sessions?”

This example provides a small glimpse into how a modular approach to data management opens up new analytical opportunities that would normally not survive cost/benefit analysis using current approaches.

I have been working on a paper in my spare time on the weekends for a number of months now. My goal with this paper is to change the thinking around data and ultimately bridge the chasm between how IT and the Business think about data management and in particular Data Warehousing and Business Analytics.

I am publishing the full paper here as a PDF and will publish portions of this paper in piecemeal over the coming days and weeks beginning with the Executive Summary.

I encourage readers to share this paper and discuss the ideas contained within. I also encourage readers to send their feedback. Since I am a human being and am as sensitive to criticism of my work as the next person, I only ask that you couch any negative criticism in a way that is civil.
Based on feedback, I may create new versions of this paper which you can easily distinguish by the paper's AS-OF date.

Before I sign off, I would like to thank Jane Roberts for her time in reviewing this paper and for her contributions. Thank you Jane!

Here is the paper:
A Modular Approach to Solving The Data Variety Problem AS-OF 207-08-06

Executive Summary
Big Data has fully captured the popular imagination. Companies like Google, Facebook, Apple, Amazon, and Microsoft process petabytes of data daily. Limits that once seemed impossible are now the new normal. In spite of this, analysts and managers still struggle to answer unexpected questions at executive speed.

The reason is that while much attention has been given to these remarkable data volumes, a different but related problem has come into sharp focus: The Data Variety Problem. Namely, organizations continue to struggle to manage and query the ever increasing variety of data originating from sources including, but not restricted to: IT controlled systems (e.g. ERPs, PoS systems, subscriber billing systems, etc.); 3rd Party managed systems (e.g. cloud CRMs, cloud marketing DMPs); and Business controlled departmental tracking spreadsheets, grouping lookup tables, and adjustment tables.

The approach to locating, obtaining, and integrating these sources of data is highly manual. Case-in- point: In the Alteryx commissioned study Advanced Spreadsheet Users Surveyi, published in December 2016, IDC discovered that “$60 billion [is] wasted in the U.S. every year by advanced spreadsheet users.” Yet this report only provides a small glimpse into this problem and misses the bigger opportunity: Organizations urgently need a one-size-fits all ‘happy path’ for consuming and producing an ever accelerating increase in the variety of structured data.

In this paper – drawing on in algebra, computer science, systems thinking, history, psychology, and 20 years experience as a data practitioner – Neil Hepburn posits that the best approach to addressing this problem for the long term is through embracing a modular data warehousing system. Neil goes on to describe how such a modular data warehousing system could be designed and built using readily available tools and technology, and what challenges must be overcome to realize this vision.

A Modular Approach to Solving The Data Variety Problem AS-OF 2017-08-06

Essential Complexity

Saturday, August 26, 2017

A Modular Data System for Digital Advertising

Sunday, August 06, 2017

A Modular Approach to Solving The Data Variety Problem

My Web Site

Click the icon below to add Essential Complexity to your RSS reader (like My Yahoo!)

Blog Archive

About Me