Sunday, May 30, 2021

Analytic Efficiency Part 3: Why Qlik's Associative Index is the Biggest Little Analytics Invention Over the Past 30 Years

The headline of this post is a quote borrowed from the Canadian author Witold Rybczynski who when asked by the New York Times what the greatest invention over the past millennium was, responded that it was the screwdriver (I would have chosen computer - by I'm not very good with home renovations), and went on to say that the Robertson screwdriver in particular was the "biggest little invention" over the past 100 years.



If you live outside Canada, or do not often handle with screws you may never have heard of the Robertson screwdriver or the story behind it. I will get into the details of that story near the bottom of this post. But for now all you need to know is that the Robertson screwdriver is widely used in Canada and only just recently is picking up traction in the US, even after being on the market for over 100 years. This is very odd since the Robertson screwdriver does the job of screwing much better than other screwdriver out there.

Qlik I would say does a better job at analytics than any other tool out there, but like the Robertson has also failed to take over the world (or at least its underlying innovation), which is also very odd.

I believe the reason for this can be summed up as a lack of connoisseurship. Most people cannot concisely explain what analytics is or how it differs from related concepts like statistics and 'data science'.

Analytics is in essence a game of '20 questions', an endless stream of why why why, which can be boiled down to yes/no questions. While most analytic tools allow the analyst to interactively see the 'yes' answer to any given question, only Qlik's Associative Index allows the analyst to get to the 'no' answer. This allows the analyst to keep asking questions without having to pivot to Data Wrangling and Data Prep work, and may thus sustain a "flow" state of mind.

It is for this reason I see Qlik's Associative Index like the Robertson screwdriver: They were both purpose built for their task, more so than the competitor's offering. Why is it then that both of these tools have not done better?

At this point you might be a bit confused. I hope by the end of this article this will make a bit more sense to you, and in the process I also hope to share with you how I have come to understand the meaning and essence of what "analytics" actually is.

This is the third and final post on this topic. In this post I wish to discuss three sub-topics:

  1. What history can teach us about the nature of tools
  2. The psychology of analytics – as I have seen it and experienced it
  3. How #1 and #2 can be understood through lens of Power BI vs. Qlik


What History Can Teach Us About the Nature of Tools

I have published this as a standalone article here.


Later in this post I will elaborate on my analogy to screwdrivers, but if you read this other article I published a few weeks back I hope to also convince you of how Qlik is more like “Wordstar” and the “mongolian horse” whereas Power BI is more like the “MS Word” and the “European horse”, respectively.


The Psychology of Analytics – In My Travels

Harkening back to the first article in this series, I explained how I see Analytics as a game of twenty questions -or even a debate -  that continually feeds back on itself: You ask a question which provides an answer that leads to another question and another answer and so on until you see no point in asking more questions. Is there a pattern to these questions that can help us choose a tool that fits these patterns?

The overarching pattern in analytics is that when asking and answering questions, we are constantly making comparisons to find what we’re looking for. The syllogism itself – which underpins analytical thinking - is the process of comparing the Major Premise (e.g. ‘All persons are mortal.’) to the Minor Premise (e.g. ‘Socrates is a person’) and comparing the two premises based on their middle term (e.g. ‘Person’) to see if they connect to trigger the conclusion.  All questions can be boiled down to ‘Yes’ or ‘No’ (this is known as the “law of excluded middle” in the so-called “laws of thought”).  All computer code basically works like this.

However, unlike computer code – which tends to follow the logical trajectory of “true” statements, while usually ignoring false or negative conditions  – analytics is equally concerned with the negative outcomes as positive ones.

The game of “20 Questions” is played out by making a series of comparisons where both the positive and negative answers guide us. Business Analytics is no different. 

Is there a pattern to the types of comparisons people make?

I have observed and participated in hundreds of analytical exercises – everything from quick ad hoc reporting to working alongside statisticians, to building predictive analytics systems, to building enterprise data warehouses – and having observed the types of questions business decision makers ask and answer (how they play the game of Analytics), I believe there are two main patterns which stand out which and are important to understand.

These two main analytical patterns roughly align with management level:

First, the “Drill-down and Compare” or simply “Drill” pattern: Managers and above are generally interested in making comparisons between aggregates.  Typically, a manager will start with the “big rocks” (e.g. Total Revenue) and then begin drilling down by whatever dimension (e.g. time, location, product category, etc.). When drilling down, the user is making comparisons between groupings.  These comparisons are typically visualized in bar charts, line charts for trending, or pie or treemap charts for share-of-total.  Most analytics at this level is concerned with either finding outliers (e.g. a spike in sales for a given date) and then drilling down to the details to learn more, but this may require a hand-off to other analysts which we will discuss in a moment. 

Managers are also interested in patterns, especially trends over time. Most of the time the manager analyst just wants to know that everything is running as usual – like a heartbeat monitor. If they look at the same report on a regular cadence (e.g. monthly) and they will grow accustomed to how charts should appear so if there is shift to the pattern, they will notice this and begin asking follow-up questions. These follow-up questions often get handed over to other analysts.

Managers and directors are for this reason fond of interactive dashboards and self-serve reports, and this is where most money is spent with respect to Business Analytics and Data Warehousing. Tools like Power BI, Qlik, and Tableau, not to mention underlying data warehousing and data prep tools from vendors like Oracle, Amazon, Google, Informatica, Databricks, Cloudera, TeraData, Microsoft and so forth. This BI+DW (Business Intelligence + Data Warehouse) spending accounts for a significant percentage of overall IT spend.

While managers and directors appreciate interactive dashboards, executives and board members on the other hand often prefer more digested insights. People working at this level expect all questions to already be answered. If there is an outlier or pattern anomaly the report will usually provide an explanation upfront.  Tools used to provide these types of reports are often just presentation tools like MS PowerPoint or graphic design tools like Adobe Photoshop.  The costs associated with these reports are mostly on the human resources who develop these custom reports and infographics. Analytics at this level ultimately plays out in an executive board room with a presentation followed by a dialectic discussion and possibly debate.

Getting back to managers and directors, this is where most money is spent on analytical tools. But as mentioned above managers will routinely find outliers and pattern anomalies through a “Drill-down” analysis. But uncovering the deeper reason for the anomaly or change in pattern requires shifting gears. Managers will often either hand-off the follow-up questions to other analysts or they themselves will switch gears and begin using… Excel.

Recalling our game of ‘twenty questions’, most tools do well when the answer to a question is ‘yes’, since we can keep drilling down into that selection. But when the answer is ‘no’ most of these tools run into a problem because they are not designed to examine the excluded values or ‘negative space’ as it were. 
What is negative space?

The poet John Keats coined a similar term “Negative Capability” to describe negative space as a source of mystique and wonder that poets and artists could explore and contemplate, often romantically. 
In the world of science, filling negative space drives all endeavours. It is the reason you keep hearing about “dark matter” and “dark energy” and why we are reminded that Einstein’s theory of relativity must ultimately be false (but like Newton's laws, is still useful).

Fortunately, however, the material nature of most organizations is recorded in structured datasets. These datasets are are finite and therefore it is physically possible to probe the ‘negative space’ when asking questions. However, most analytical tools – and data warehouses - are not purpose built for designed for this task. This means that humans are doing most of this work.

This brings me to the second pattern, the “Contrast and Compare” or simply “Contrast” pattern: The Contrast pattern normally occurs following the Drill pattern and is concerned with exploring and contrasting negative space. The Contrast pattern tends to follow one or both processes:
  1. Isolate relevant sub-population (e.g. Toronto and Ottawa) within parent grouping (e.g. Ontario)
  2. Contrast sub-population to their complement: e.g. Average Home Price in Ottawa + Toronto vs. rest of Ontario
    1. Allows us to see the difference between housing prices in urban metropolitan regions versus rest of Ontario
OR:
  1. Take two populations (e.g. Rentals during winter versus rentals during summer)
  2. Determine what is in the first population that is not in the second and vice versa
    1. Allows us to appropriately target summer-only customers or winter-only customers or all-season customers with specific and relevant ad campaigns

When it is easy to isolate sub-populations by simply filtering on a small number of fields then we need not go beyond the “Drill” pattern and many analysts may use dashboards in this way. Although they will only be able to see differences in the aggregations (e.g. total number of sales by Month) but this can still be revealing.  For example, a telecom analyst who suspects that outdoor weather precipitation (e.g. rain) might be causing a spike in call volumes might filter on all the known dates over past 12 months it rained or snowed. The analyst would then filter on the compliment – all days where it did NOT rain or snow - and then compare these sub-populations with respect to standard measures like Call Volume and Average Handle Time to understand the differences the change in weather pattern might generate. The answers to these questions could lead to cost saving improvements in staffing and revenue generating improvements in sales.

Another common variant of the “Contrast” pattern is when the analyst is handed two sets of data and simply told to find the deltas and then report on those deltas based on a of comparison key.  For example, an analyst might be given two lists: A list of customers from ten years ago and a list of customers from present-day today. The analyst might be asked to produce a report comparing the loyal customers who appear on both lists to newer customers who are only on the present-day list to the previous customers who have since left. The report could lead to a reduction of costly customer “churn”.

There is another word – “Parallax” – that also describes this pattern. In optics, the parallax effect is when an object that is nearer to us tends to move faster than an object that is far away. By exploiting this property, it is possible to measure the distance between distant objects without traveling those distances. The philosopher/comedian Slavoj Žižek uses the word parallax to explain how “truth” can only be teased out through the types of comparisons that reveal differences. More practically, when we are making assessments about the world, like reading the news or researching a new car or other big-ticket purchase, we are attempting to find contradictions between perspectives and distill from these contradictions what is going on. For example, we might read some good reviews of a new TV and then one or two bad reviews and by focusing on the differences we might find out that the good reviewer overlooked something that the bad reviewer complained about, and we might go on see that the bad review was simply complaining about a bad installation experience they had. Or maybe the good reviewer only cared about price and was not attuned to picture quality as the bad reviewer.  It is from these contradictions, this parallax effect, this contrast, that we can uncover useful insights.

While this yes/no approach might seem binary or simplistic, and to be sure Analytics can be wielded cynically (as it all too often is), but with enough questions you can begin to see a spectrum and other patterns emerging. With enough questions deep insights and epiphanies begin to emerge. 

The visual that comes to mind for the Contrast pattern is the Venn diagram, but the Contrast process tends to be table-centric because the analyst is mostly relying on either Excel or a possibly a Data Prep tool like SQL, Python, Power Query, or Alteryx to perform the detailed comparisons between sets.
What we see then is analysts spending most of their time in Data Prep tools (especially Excel) pulling sub-populations through either delta comparisons (often using the ‘VLOOKUP’ function in Excel) or by filtering on subsets of data and then using a delta comparison (again using VLOOKUP) to extract the compliment of the sub-population they were filtering on (e.g. the compliment of all dates there were precipitation).

Most Data Prep in the business world is done using Excel using the VLOOKUP function, but much of it is done in SQL as well using the NOT EXISTS or NOT IN clause. Data Prep can also be done in more business-friendly date prep tools like Power Query (built into Power BI) which has an “anti-join” operation as part of its “merge” step.

It should be pointed out though that Data Prep is essentially programming and requires a shift in thinking. When doing Data Prep the analyst/developer can certainly enter a flow state, but they are no longer in the Analytical flow state playing 20 questions. Instead their mind is in the world of syntax and code structure – even using the VLOOKUP may require a reorganization of columns which can be time consuming and distracting.

An analogy I use to explain the relationship between Data Prep (which is more like programming) and Analytics (which is more like a game of 20 questions) is based on this quote from Abraham Lincoln: “If you give me six hours to chop down a tree I will spend four hours sharpening my axe.”  You can think of Data Prep as axe sharpening and analytics as wood chopping.  The ratio of 2:1 between Data Prep and Analytics would not be bad, but what I see is something closer to 5:1 or even 10:1. Ideally, we want to be spending more of our time chopping and less of it sharpening if possible. The reasons are that with pure analysis (chopping) you can more easily achieve a flow state while also collaborating with others. Data Prep (sharpening) on the other hand is a more solitary activity and while you can attain a flow state your focus is directed at answering technical rather than business questions.

This is the reason Qlik stands out, because it is possible to contrast data sets interactively and this is because Qlik’s indexing technology – the Associative Engine – was built from the ground up to support the selection of excluded values. With every other Business Intelligence tool I have used, negative values are an afterthought and not easily accessed.

Using our example from above, if we want to compare days with precipitation to days where there was no precipitation in Qlik we can interactively select the complement of dates, we can even restrict our complement to be within a given subset (e.g. for a given year and region). From here we can contrast customer patterns for days when there was precipitation to days where there was not.

Furthermore, QlikView (not Qlik Sense unfortunately) allows users to save the subset of data that has been selected. This means that after selecting days with precipitation (or we could choose no precipitation), we can then reduce the dataset to what is associated to those dates. From there we could choose another dimension – say “temperature” - to partition the data by in order and inspect that contrast. This would allow us to see the contrast of above freezing or below freezing to see if a pattern emerges.  We could then go back up a level to the original dataset and segregate the original dataset using temperature to see if the same patterns hold as when contrasting within the precipitation subset.

By contrasting these patterns with respect to precipitation and temperature, as an analyst we can begin to see the relationship between these dimensions and other dimensions. Depending on what differences emerge between the subsets, we can choose to partition, contrast, and then take a subsequent subset.  We can keep playing this game of twenty questions until we have answered our questions or at least satisfied our curiousity. 

Alternatively, if we want to compare multiple Customer lists to see what customers are only in one and not in another, we can do so interactively, and again we can easily isolate subsets of customers and compare them looking for patterns.

It is of course possible to accomplish any of these tasks using other tools like Excel or Power BI (via Power Query) or SQL or Python, but one cannot do so interactively.

I should point out that it is also necessary while performing a “Contrast and compare” activity that on some occasions going back and “sharpening the axe” through Data Prep is required. But in many circumstances this is not necessary, and if the data has been well prepped you can ask and answer many questions interactively by leveraging the “excluded values” supported by Qlik’ Associative Index.
In my own experience I have leveraged this power to significant effect and there have been occasions where I was able to sustain a “flow state” for up to an hour of constant Q&A through a series of dimension selections, because I was able to easily contrast selections to one another and see what was exposed and what was hidden by doing so.  On these occasions I would often find “needles in haystacks” and the feeling was not unlike debugging a computer program, but I felt like was debugging the business. And debug I did. Some of these insights did lead to noticeable changes in how the business was run.

Taking the two patterns together – “Drill” pattern and “Contrast” pattern – there is an analogy that I find quite useful for thinking of these as part of a cycle.
Namely, there is a concept known as the OODA loop (credit goes to Alistair Croll for introducing me to this) which was developed by a US Air Force Colonel John Boyd. The OODA loop was developed by Boyd for the purpose of fighter jet combat. Boyd realized that the rules of combat had changed with heavy supersonic jets and that a different mode of thinking was required to use them effectively in combat.

OODA stands for:
  1. Observe
  2. Orient
  3. Decide
  4. Act
In the context of fighter jet combat the pilot might “observe” an opponent and then instead of engaging them in the moment would begin “orienting” their plane and mindset such that the pilot has the best available information to “decide” and “act”. The decision might be to attack or just as likely to move to another defensive position.  If the pilot moves to a new position, they would repeat the OODA loop cycle until either defeating the opponent or retreating.

The most important step in the OODA loop – and what distinguishes it from other business cycles like the “plan-do-check-act” lifecycle is the emphasis on orientation. It is the “orientation” step that brings to bear the subjective experience and talents of the pilot to maximize their strength.

Relating this back to the Analytics patterns I described above I see the “Drill” pattern as like the “Observe” step and the “Contrast” pattern similar to the “Orient” step.

It stands to reason that if we can make the Orient step more efficient – which is the bottleneck of the OODA loop – then we can make analytics more efficient.

There was another tidbit of wisdom that Alistair Croll also mentioned when discussing the OODA loop which relates back to my earlier points regarding horses and word processors: Once you cross a threshold of efficiency the game itself can begin to change.

The example Croll used was the steam engine that James Watt invented.  Watt’s new steam engine, design developed in the late 1700s, is generally regarded as one of the key factors that catalyzed the industrial revolution during the 19th century. However, when it was first introduced some people speculated that it would cause a depression for coal miners because only a quarter of the coal was required to produce the same amount of “horse power” (a unit of measure that Watt also invented).  Instead, because the Watt’s steam engine had crossed an efficiency threshold it could be used for all sorts of businesses that otherwise would not be able to afford the cost. As a result coal mining began to boom like never before.

The rest as they say is history.

Difference Between Qlik and Power BI and Excel When Contrasting Data

In this section I will be contrasting the tools themselves to show how they differ when used to contrast data.

In this example we will continue with the Call Centre data, although I have created a new fictional dataset with more attributes to compare. 

Differences in Qlik, Power BI, and Excel Explained

The main difference between Qlik and Power BI when it comes to contrasting data is this: 

Qlik can be more efficient for the user than Power BI (or Excel) for contrasting data because Qlik allows the user to contrast data sets interactively by clicking buttons which can be performed while in a “flow” state of consciousness and focus on the business questions. 

Power BI on the other hand can do some of this under some circumstances but in other circumstances requires the user to build new charts and in some cases it may require a change to the data model itself. Because different types of solutions may be required – some taking longer than others – it’s harder to achieve a “flow” state of consciousness while playing the game of 20 questions.

Excel – through the VLOOKUP function - is more consistent in its approach to reconciliation than Power BI although it requires the Excel user to effectively perform some basic Data Prep. Because of this consistency users can get into a “flow” state and focus on business questions, but unlike with Qlik – which keeps the analyst in a Business Analytics flow state - Excel forces data analyst down to a Data Prep flow state, distracting from our game of 20 questions.

Let’s first explore these differences between tools using a very simple example and then move to a more realistic example in the next section.

For each of these demos we are comparing two lists [of calls] to the difference between the two sets. We are using two pairs of datasets for this:

  1. Related one-to-one
  2. Related many-to-many

Let’s see how the “contrast” exercise appears for each tool, starting with Qlik then Power BI and finally Excel.




In this example with Qlik need only click the “select all” icon (highlighted in yellow) as denoted by the double-checkmark icon.  Doing so instantly reveals excluded values (3 missing from CCA and 3 missing from CCB).

If the lists were not so compact and instead had thousands (or millions) of values, we could click the “select excluded” button to instantly see the excluded values. Furthermore we could simply select these excluded values to explore their distribution with respect to other metrics.

The following screen caps show that there is no difference with the relationship is many-to-many:




As you can see there is no difference in behaviour. The user doesn’t have to think and can follow the same “happy path”.

Although it is not necessary in this example, if we wanted to analyze the subset of calls that are excluded, in QlikView we can perform a “reduction” which effectively removes everything else except for our selections. The screen cap below shows how this option is selected and how the data appears following selection.


Here is how it appears.




Now let’s see how this compares with Excel.

We use the VLOOKUP function which performs a lookup against another table and then retrieves a corresponding row value based on a match with the key column value.

Here is how we have implemented the function as used in the screencaps below:
=VLOOKUP(A4,CCB!$A$2:$B$15,1, FALSE) 
This shows how the one-to-one linked tables appear when contrasted in Excel using the VLOOKUP function:



Here we go again with the many-to-many example:




Speaking for myself, I do not find the function intuitive to use and if I haven’t used it for a while will forget to hit ‘F4’ to make the second argument absolute so that it does not change from row to row when you auto-fill the cells below. But most Excel jockeys have no problem with these quirks and can wield it relatively quickly. Given the initial difficulty with VLOOKUP and how popular it is, I’ve often thought of VLOOKUP as a kind of analytical rite-of-passage.

Regardless of how proficient you are performing a VLOOKUP, it is essentially a Data Prep task and you may need to reorganize the column order in the data itself before you can begin to use the VLOOKUP function in order to place the key field in the correct column order.

With Qlik the user can interactively select excluded records with a single click (or maybe 2 at most) thereby reserving one’s capacitive memory (we can keep track of about 7 things in our head at once – known as Miller’s magic number). This allows the user to focus on allowing business questions to guide selections (including selecting excluded calls) allowing for a “flow” state of consciousness thus enabling optimal human performance.

Contrast this with Excel where the user must change gears and switch into Data Prep mode – even if only for a few seconds – this breaks the steam of consciousness pertaining to the business questions and forces the analyst to context switch to a Data Prep exercise.  If the Excel analyst is practiced and quick and can complete the VLOOKUP in under 10 seconds, then they can maintain their train of thought then they may be able to return to a flow state. But every VLOOKUP and Data Prep task potentially breaks this flow and thus undermines the desired flow state.

While the Qlik and Excel approaches are quite different in that Qlik can perform the task interactively maintaining business question-mode thinking whereas Excel requires changing gears into Data Prep-mode thinking, but what they both have in common that they do not share with Power BI is consistency. Namely, regardless of the data cardinality, whether the data is linked one-to-one or one-to-many or many-to-many, the operations the analyst performs are the same. It is this consistency in Excel that allows an experienced user to almost work subconsciously when performing the Data Prep tasks thus keeping the train of thought on the business questions.

Having seen Qlik and Excel, let’s move to our third and final example: Power BI.
The two screen caps below show two pairs of Power BI Slicers respectively.  The screencap showing the first pair show how they appear without user selections.
In the second screencap, we can see what happens when the user selects ‘(Blank)’: Power BI reveals the missing Call IDs from the other table.

In this example we can see the user need only click ‘(Blank)’ to see the missing values. In Power BI, any time there is a one-to-one or one-to-many relationship defined, the user can always click ‘(blank)’ on the “one” side to see the missing values on the other side. Just like in Qlik the user need only click a single button to reveal the deltas.






The one-to-one example is very similar to Qlik and can be performed with a single click, thus enabling the desired “flow” state of consciousness.

But what if the relationship is one-to-many or many-to-many? The differences between the relationships are shown below.




Due to the nature of Power BI’s Vertipaq indexing engine, columns that are in a table on the “many” side of any relationship will not allow users to simple select ‘(Blank)’ as we saw in the example above. Instead, the user must develop a Table Chart and configure it to show blank values. 

The example shown below shows the delta between the two lists by building a simple Table Chart and then sorting the smallest values to the top to see the deltas (highlighted). You can also see that the Slicers no longer have the option of selecting ‘(Blank)’.




As noted above this chart requires an extra step of configuration that is normally not required: I had to check the “Show Items with No Data” option.




Reflecting on this we can see that for the many-to-many relationship it is a different and more complicated process we need to follow when reconciling across relationships that have a “many” sided cardinality on one or both sides.

It should also be pointed out that it makes no difference if the link direction is Both or Single direction or which direction the link is pointing since the only way to answer must be calculated through a chart that we must build that requires a more time-consuming dense calculation - by explicitly evaluating all possible combinations - as opposed to a quick sparse calculation like when we selected ‘(Blank)’ in the earlier example.  Qlik on the other hand is only making quick sparse calculations as it leverages its Bi-directional Associative Index (BAI) to directly pinpoint deltas.  This performance hit in Power BI adds another level of friction when it comes to reconciliations. But the main difference between the tools is that Qlik provides a consistent, quick, and interactive reconciliation experience whereas Power BI requires changing gears to build tables, and then waiting additional time for those densely calculated tables to render.

There is another approach we can take with Power BI: If we look to Power BI’s built-in Data Prep tool Power Query, here we can see a more consistent “happy path” to reconciliation much like Excel.

Namely, we can use Power Query’s ANTI-JOIN feature which can be accessed through the “Merge” step with the Join Kind set to ‘Left Anti (rows only in first)’ as shown below.


After completing both a ‘left anti’ and ‘right anti’, the screencap below shows the resulting delta datasets that were correctly revealed.



As it turns out this ANTI-JOIN approach is closer to a ‘happy path’ due to its consistency in procedure – since the cardinality and link direction of relationships is irrelevant -  rather than through navigating Slicers and Tables in a pre-built Tabular Model and performing different operations depending on the cardinality of relationships.  Furthermore, this approach to reconciliation can be easily automated and the resulting data can be selected and analyzed by the user. This is like how a Qlik user can “Select Excluded” values and analyze the compliment set.

Thus, if your focus is on contrasting data and you need a surefire tool, it is better to use the Power Query within Power BI as opposed to Power BI proper. 

In summary, the name of the game is not only to meet business requirements (any Turing-complete software can do that), but rather the game of Analytics should be about putting the user in a “20 questions” flow state of consciousness for efficient business problem solving.

Both Excel and Power Query do provide a “happy path” when it comes to contrasting data, but that happy path requires context shifting from business analytical thinking to Data Prep tool thinking. Qlik on the other hand is well designed to perform interactive reconciliations at a record level and with some training users can achieve a flow state of consciousness.

In the next and final section, we will compare Qlik and Power BI using a more realistic example.

Data and App Documents

Here is the data we used. It also includes the Excel VLOOKUP examples.

Here is the QlikView application document.
Here is the Power BI application document.

Differences in Qlik and Power BI in Practice


In this example we will analyze a telecom’s (e.g. Bell, Rogers, AT&T, Sprint, etc.) Call Centre data in order to answer the question: During February 2020 which Customers placed calls to the Call Centre answered by non-authorized Customer Service Representative and did not generate any Work Orders?

We can break this question down into a series of yes/no questions:

Question: Did the customer place a call in February 2020?
Answer: Yes.

Question: Was there an authorized Call Centre Representative on this call?
Answer: No.

Question: Does the call have an associated Work Order?
Answer: No.

With Qlik and Power BI we can ask these questions for our entire population of linked customers, calls, and work orders.

With Qlik we can answer these questions quickly and interactively while maintaining a “flow” state of mind, but with Power BI we must change gears and build and configure dense pivot tables which leads to mental context switching.

Before getting into the heart of this example I should point out that I am using fictional data that I generated using a normal random number generator. This data set lacks the novelty we find in the real world. I must apologize that the “insights” revealed are not on the level we would get from a real-world data set. However, there is also a lesson here: Publicly available data sets tend to be relatively simple in structure (e.g. most open data sets are just denormalized “fat” tables), whereas real business systems produce large normalized models with multiple fact tables.  It is when we have multiple fact tables that both Power BI and Qlik really begin to shine and is also where we can see the most contrast. For privacy and confidentiality reasons most organizations understandably don’t want to reveal these data sets to the general public.  As a side note this is one aspect I love about my job: I get to see these databases across so many industries in all their glory. But I cannot share those databases for obvious reasons.  It’s for this reason Microsoft often uses the “Adventure Works” database for demonstration. But even that is a fictional dataset lacking real world novelty.

But if you can suggest any datasets that have multiple fact tables and that are useful for teaching, please let me know?
In the meantime I’m going to keep looking for or possibly developing on my own a real data set that can be used for teaching analytics. One thing I have found since I started writing this article is an App called “TrueEntropy” that generates random data using your Smart Phone’s camera. I haven’t had a chance to really dig into this yet, but it looks like an improvement over the standard random number generator I used for this example.

Sorry for all the preamble, let’s get into the examples now.
The data model we are going to be exploring is larger (in terms of the numbers of attributes and entities) than the data model we explored in parts one, and two of this three-part blog series.

The original data model had two tables: MyCalls and MyWorkOrders.  This new data model extends this by adding in:

  1. MyCustomers: The customers who placed the calls and whom the work orders were created for
  2. MasterCal: A master calendar for allowing for drill down within a period hierarchy (i.e. Year -> Quarter -> Month -> Week -> Date)
  3. MyCSRs: The Customer Service Representative (CSR) working for the telecom. The CSR answers customer calls and interacts with customers resulting in work orders for the customer


As an aside: You may have noticed There is also a table called “WorkOrderTypeAND” which I originally included to demonstrate a feature that is specific to QlikView; It allows the analyst to make ‘and’ based user selections (as opposed to ‘or’ based user selections) and is not found in Power BI (nor Qlik Sense).  While this feature does allow us to ask more questions (e.g. Who has Television AND NOT Internet), it is not inherent to the point I am making here. For this reason, I have since decided not to leave discussion out of this blog post, but if you’re curious about how it works you can download and experiment with the demo to see its functionality.

Here is how the model appears in QlikView:



Here is how the same model (minus the ‘AND’ table) appears in Power BI:




Now let’s get into the business questions.

To recap, our main question is: During February 2020 which Customers placed calls to the Call Centre without an authorized/known Customer Service Representative and did not generate any Work Orders?

For context, in our example there was a botched Chatbot roll-out issue in February 2020 that resulted in customers (who learned of the flaw through social media) creating non-authorized Work Orders using a special promo code. We want to know who penetrated the vulnerability but did NOT take advantage of the flaw. When we find these Customers, we would like to thank them and speak with them to help uncover other security issues communicated through social media.

To get there we need to find Calls with no CSR and from there find Customers with no Work Orders in February 2020.

We are going to break this down into two exploratory (practice) questions to show how those can be answered and then combine what we have learned to answer the third and final question During February 2020 which Customers placed calls to the Call Centre without an authorized/known Customer Service Representative and did not generate any Work Orders?

By breaking down our big question into smaller exploratory questions we can begin to get a sense of what is going in the data before diving into our third an final question.

Our two exploratory questions are:
  1. What is the distribution of calls with no authorized CSR?
  2. What Customer did not have any Work Orders February (of any year)?


Question #1: What is the distribution of calls with no authorized CSR?
In Qlik, to answer this question we:
  1. Select [All] in “CSR Name” 
  2. Select ‘Excluded’ in “Call ID”



In Power BI, we simply:
  1. Select ‘(Blank)’ in “CSR Name”



As can be seen in the above example, both Qlik and Power BI allow for interactive selection here with Power BI being better than Qlik in this instance given that you only need to click one button in PBI versus two clicks in Qlik.

Now let’s move to our next question.

Question #2: Who are the customers with no Work Orders for February (of any year)?

In Qlik we can answer this question by making three selections in real time:
  1. Select ‘February’ in “Month”
  2. Select [All] in “Work Order ID”
  3. Selected [Excluded] “Customer Names”
The answer is revealed as ‘Noah Q. Moore’.



In Power BI however, because of the one-to-many relationship, we cannot rely on Slicers alone. Instead we must take extra time to construct a Table (remembering to set the “Show items with no data” setting), wait for the table to render, then sort the table from smallest to largest (or we could have filtered on blank “Num Workers”). The resulting set cannot be selected as a Slicer selection, so the user is tightly limited in the follow-up questions they may ask.



Having answered both of our exploratory questions we are now ready to begin answering our third and final question.

Question #3: During February 2020 which Customers placed calls to the Call Centre without an authorized/known Customer Service Representative and did not generate any Work Orders?

Or stated differently: Who are my most honest customers; The customers who when calling without a legitimate CSR in February 2020 never created a Work Order?

To answer this question in Qlik we can do so through 2 simple iterations:

Iteration 1:
  1. Find all calls without a CSR. Two clicks:
    1. Select [All] CSRs
    2. Select [Excluded] Calls
  2. Filter on year and month:
    1. Select on ‘Feb’ in “Month” and 
    2. Select ‘2020’ in “Year”
  3. Isolate this filtered subset by selecting from main menu File->Reduce Data->Keep Possible Values


Iteration 2 (using subset derived for ‘Possible Values’ in Iteration 1):

  1. Find all Customers with no Work Order
    1. Select [All] in “Work Order ID”
    2. Select [Excluded] in “Customer Name”


FINISHED. We have answered our question. The five customers are:
  1. Brent Z. Skinner
  2. Charlotte E. Snow
  3. Hilel A. Maddox
  4. Malachi Perez
  5. Noah Q. Moore

If we want to see these Customers in their original context (as opposed to in the isolated Dataset), we can refresh the data which reverses the data reduction while keeping the selections.




How does Power BI compare?

Power BI can certainly answer this question but requires more Chart/Table building. In this scenario, it is not enough to simply list all the Customers and sort by Num Work Orders, because we need to be sure we are only seeing Customers with Calls. Even if we “Select All” Calls our Table will continue to show Customers who did not receive any calls in Feb 2020 because we have not explicitly filtered our Customer list by the number of Calls.

This means we must add a Visual Filter on the Table itself where “Num Calls” ‘is greater than 0’.
Given that we are already filtering on “Num Calls” we can also filter on “Num Work Orders” ‘is empty’ to create a cleaner looking table.


Even with our clean table, it is still not easily possible to convert this list into a Slicer selection and then explore and ask follow-up questions about these five (5) Customers further. Whereas in QlikView we have the Customers selected and can easily continue our game of ’20 Questions’.

Normally in this situation the Power BI developer would likely go up to the Data Prep level and use either Power Query or perhaps Excel to answer this question.

Going back to our sharpening and chopping analogy taken from Abraham Lincoln this means that in Qlik you are chopping more than sharpening and in Power BI you are sharpening more than chopping.
The ideal tool in my view behaves more like a musical instrument that you play like a piano.

Data and App Documents

Here are the data files:


Here are the application documents:

Conclusions


Given that Analytics involves exploring both positive and negative space and Qlik’s Associative Index explicitly supports “Excluded Values”, and given that most other BI tools - like Power BI – do not inherently support Excluded Values, then why has Power BI and other tools like Tableau for that matter managed to outperform Qlik from a sales perspective?

More to the point, why is ability to query “Excluded Values” – a core element of Analytics - not even even brought up as a software requirement?
The simplest answer to this question is that the buyers of these tools lack connoisseurship.
I can break this answer down to three specific problems:

  1. IT tends to favour homogenous technology platforms and deal with as few vendors as possible. “One throat to choke” as they say.
  2. The “drill” pattern tends to be visual in nature and is what most senior managers – and laypersons - can quickly grasp. These are the same people who control spending.
    1. The “contrast” pattern on the other hand is more obscure in nature and does not lend itself to flashy sales demos
  3. The “contrast” pattern as I described it above is generally regarded as a Data Prep (as opposed to Business Intelligence/Data Viz) activity
    1. Qlik’s answer to the “contrast” defies traditional product categories
      1. There are no recognizable competitors to Qlik that also have the ability to query “Excluded Values”

The first problem of IT favouring homogeneous technology over “best of breed” vendors is not specific to Qlik and is  fairly well understood and discussed problem. I have nothing to add that discussion here that has not already been said.

The second problem of how Business Intelligence/Data Viz tools are often sold based on slick demo is also nothing new. Although I will add some colour to this: On more than one occasion I have seen a senior executive extol the virtues of a Business Intelligence tool for the simple reason that they used a report built with this tool. Even if that report is nothing more than a rendered image file, the executive will continue to lavish praise on the tool which quickly creates buying consensus.

Oh, but you may say that there are research firms and consulting companies that executives can look to for guidance. Unfortunately, these companies tend to all be deeply compromised in one way or another. Without naming names, think of the most prestigious business & technology research company and advisory consulting company you can think of, and I can show you how they don’t hold a candle to Consumers Reports.  The main problem is these companies have deep relationships with the vendors they are supposed to referee.  We can see elements of this in journalism whereby many esteemed newspapers and media outlets will be more generous to certain institutions or individuals in order to maintain “access” to those institutions and individuals.  Consumers Reports on the other hand is entirely supported by a large consumer base. Even Consumers Reports is not perfect, but it is significantly more trustworthy than any of the research and consulting firms in the B2B world when it comes to evaluating technology.

The third problem can best be understood through this famous quote from Henry Ford: “If I had asked people what they wanted; they would have said faster horses.”
Putting our analytical hat on, if we contrast Ford and Qlik in this light we can see that Ford has the clear upper hand:

  1. Their cars will work on any roadway (unlike the closed railway lines)
  2. The purchaser of the product is also the user and a slick demo is not enough – the vehicle must pass a test drive
  3. While Ford was an early manufacturer of automobiles, he was not the first or only manufacturer and the presence of competitors created consumer demand

As a result, was Ford less of a “think different” underdog and more of a monopolistic bully.  There is another lesson to be learned from Ford, a lesson that many Canadians are aware of:

During the early days of automobile production Ford clashed with an inventor based here in Canada: P.L. Robertson the inventor of the Robertson screwdriver, also known as the ‘square’ screwdriver. You can read the story here.

To quickly summarize, in 1908 the Canadian P.L. Robertson invented a new type of screw head and screwdriver that was square shaped which allowed for maximal torque while minimizing stripping. This allowed the Robertson screws to be more securely fastened. Henry Ford realized the importance of this invention and demanded Robertson relinquish the patent rights to Ford for a single lump sum. Robertson refused and so Henry Ford went to a competitor, Henry Phillips, who had invented a similar but inferior screwdriver, the Phillips screwdriver. Phillips did maintain some patent rights though and thus the still inferior flat-head screwdriver continued and continues to proliferate.

If we all switched to Robertson screws, we would need fewer screwdrivers and screws, and construction quality would improve.

And yet, if you asked an American they would at first be oblivious and then confused as to why the United States would be committed to a demonstrably inferior product. The simple answer is the same reason why we use Microsoft: Vendor lock-in through network effects. But the fact that Robertson remains in use also tells us that it is possible to survive and even thrive based on the merit of the tool itself.

I come back to the word connoisseurship, and for reasons of environment and circumstance connoisseurship differs among people and organization. 
How does change come about then and why should anyone care?

For me the answer is simply emotion through experience: The experience of awe and wonderment and feelings of mastery and contentment from achieving flow states.

If you have ever discovered a powerful tool and have wielded it to great affect, then you will know these feelings I am talking about and you will yearn for these feelings.

But if you have never had such a breakthrough and not experienced awe and a sense of mastery then everything I am saying will sound like noise. But I think if you have learned to ride a bicycle or swim or ski down a hill, then maybe you can relate?

I have thus decided that my North Star going forward is to figure out ways I – Neil Hepburn - can bring about a feeling of awe and wonderment for Analytics. I may or may not use Qlik to help me here and have not decided. But I have no doubt that Qlik has helped me understand the essence and spirit of Analytics and I am going to challenge myself to see if I can bring about that experience of awe to others.

If this is something you are interested in hearing more about, you are welcome to drop me a line.

Saturday, May 15, 2021

What History Can Teach Us About the Nature of Tools

A recent study  published out of The University of East Anglia has shown – what many have long suspected - that the human brain treats well designed tools (like a spoon or a hammer) as an extension of the human body. More importantly, this study shows us that well-designed tools are interpreted by the brain differently than poorly designed tools which the brain simply treats as an external object. For more context, this Twitter thread does a good job of summarizing the study’s findings.

To understand the implications of this insight we can look to history.

History teaches us that tools and systems with consistency, quickness, and interactivity can be more easily mastered, and this tool mastery leads to psychological flow states of consciousness that can bring about rapid change.

This insight is relevant to our current moment because what I am seeing in the world of business data & analytics is a never-ending tension between tools that users love and can produce great outcomes, and inferior tools which users are forced to use out because of vendor lock-in homogeneous architecture constraints.  I will return to this theme in an upcoming article (Part 3 of my Analytical Efficiency series), but in this article will explore this topic head-on.

How does history teach us about tools and mastery? 

Let me tell you some stories …

During my undergrad years at university in the early-mid 1990s – majoring in Computer Science – the computer workstations we were provisioned all ran on a UNIX based OS (Solaris). Whether you used one of the newer “X-Window” terminals or one of the older text-based terminals you would have to learn one of the command lines and associated text editor.

I recall being quite frustrated by this at first. I had my own PC, an 80386 SX. While I could use this PC for preparing assignments in WordPerfect - a popular “WYSIWYG” (‘What you see is what you get’) word processor at the time - I could not use my home PC for preparing computer science work (i.e. writing code) because I needed to have access to shared tools and datasets. Since all the Comp Sci workstations ran on Solaris I was suggested by a UNIX-savvy peer to use a text editor called “VI” (an abbreviation of “visual”). 

Using VI at first felt like psychological torture. There was nothing intuitive about it mainly because VI requires the user to toggle between two modes: “Insert mode” where you can freely enter text as you would do normally; and “command mode” which is a bit trickier to explain but as its name implies allows you to navigate and manipulate blocks of text using keyboard commands with ability to program new commands. Many of the commands in “command mode” are activated by hitting a single character on your keyboard and the most commonly used commands (for scrolling through the text) were characters on the home keys themselves. For example: ‘j’ moves the cursor up a line and ‘k’ moves you down, while ‘h’ moves the cursor left and ‘l’ moves right.

Over time I became more proficient with “VI” and found that I could work much efficiently than the old Borland IDE text editor I had been using before. But what about my other subjects? What about WordPerfect?

By 2nd or 3rd year a friend showed me how he prepared all his essays using “VI” but compiled them into PostScript files which could then be printed out and handed in.  To do this I had to use another tool called “LaTeX”. This tool was a compiler for a markup language called “TeX” which is like HTML but is more of a typesetting system with precision around page layout. In other words, a possible replacement for WordPerfect. But this also required learning LaTeX and TeX which I never mastered but did learn enough of what I needed like font changes, indented paragraphs, tables, number lists, bullet lists, and so forth. On top of this I would also borrow “code” from others to add more impressive flourishes. I also followed my friend’s advice and created a little “makefile” which I used to compile my LaTeX files into PostScript files that I could view before printing which adequately compensated for the lack of WYSIWYG functionality and provided quick feedback.

By the end of my final year in university I was able to write and edit term papers using “VI” faster than I could in WordPerfect while at the same time the quality of the printed page looked like it had come from a professional publisher.

Today, I still use “VI” occasionally, but not for word processing – formats like “LaTeX” are too obscure for the business world and Microsoft Office (or compatible copycats) are the order of the day now. I don’t have a big problem with this, but there is a part of me that yearns for that feeling of mastery I once felt when editing text. It’s probably the reason I do use “VI” when I can – it’s not just more efficient, it feels good.

George R. R. Martin, author of the Game of Thrones series apparently also feels good when using older and simpler tech like “VI” except in his case he uses another tool called WordStar. WordStar is an older word processor that only runs on MS-DOS and has not been updated since the 1990s. Like “VI” it is oriented around the keyboard and home keys in particular making it easy to navigate and manipulate a document without taking your hands off the keyboard. I heard from this podcast episode that Robert Sawyer – a popular science fiction writer- also uses WordStar for the same reason. Namely, he prefers a tool that minimizes friction between the thoughts in his head and the words that get put down.  By using a streamlined tool like WordStar, Sawyer explains that he can more easily keep his train of thought and remain “in the zone” while writing.

What Sawyer is referring to is known as the “flow state” and is now generally regarded to be the ideal state of human performance when we are both happiest and most masterful in our work. If you have ever biked, skateboarded, or skied down a big hill you will know this feeling.

Why then is it that tools like MS Word (which Sawyer eschews in favour of WordStar) do not work like WordStar? Robert Sawyer’s specific answer to this question is that MS Word did not evolve around writers and creatives but rather the main users of the tool are secretaries and assistants who appreciate the extra features that allow them to perform advanced typesetting and other sophisticated tasks with relatively little training.

In other words, the professional writers depend on tool consistency – even if there are fewer features – since mastery is of writing itself is paramount.  Executive assistants and other staff involved in document preparation are less concerned with the content of writing itself and rather the formatting, presentation, and distribution of documents – something modern word processors like MS Word are replete with.

This tension between tools like VI and WordStar suited for power users versus tools like WordPerfect and MS Word for more casual users.

As a side-note many creative writers often use another tool called “Scrivener” which is more purpose built for story writing (e.g. there are features to keep track of plots and characters) than say MS Word, but I wouldn’t be surprised if folks George R. R. Martin stick to Wordstar for the simple reason that they can wield it with more efficiency.

A user-friendly tool can spread quickly and aids efficiency. But occasionally in history we see other tools appear that are not so user-friendly but when mastered can change the world.

An example that comes mind is the horse. A horse being an animal is not normally thought of as technology. Starting in the bronze age around 5,000 years ago horses domesticated on the Eurasian Steppe led to massive waves of migration and change – it is why so many languages fall under the umbrella of “Indo-European”. To get a clearer picture of what made the horse such a formidable tool and why it remained such a powerful tool on the Eurasian Steppe for so long I want to point out a few things about Genghis Kahn’s Mongol Empire during the 13th century.

The Mongolian horse differed from the stockier European horse – like those horses The Normans favoured. Similar to the Mongol’s, the Norman’s were also known for striking fear through their armoured knights and horses who were not only large and powerful but had been bred for combat and could not be easily spooked. 

The Mongolian horse on the other hand was smaller and not as powerful as the Norman’s European horse. But the Mongolian horse did have some advantages: The European horse required grain feed in order to thrive which made it more expensive to feed. This meant that the European horse also needed to be on or near a farm where the grain could be supplied. During times of war the European horse was highly effective in the battlefield but was expensive and challenging to maintain because keeping supply lines safe was crucial for success due to the horse’s dependence on grain from farms.  On the other hand, the Mongolian horse could feed off the grass in the steppe or any meadow. Even during winter, the Mongolian horse can punch through a top layer of frost and ice to get at the grass (unlike cows which could not penetrate the frost and would starve).  A Mongolian soldier did not need to worry about supply lines for the horses and would have additional horses (each soldier usually kept four horses) that could be eaten if necessary. This provided more agility during times of war.

Because of this self-contained sustainability of the Mongolian horse in conjunction with the grass food source the steppe provides, the main advantage of the Mongolian horse is that it allowed for a nomadic/itinerant culture to sustain itself through a very horse-centric lifestyle. If you are born into a nomadic horse culture then it is said you are “born on the saddle”. In other words, through the consistency of being able to take horses anywhere and everywhere, the mounted archer quickly develops mastery and can take their mastery to higher levels than a European soldier whose experience with horses would be more sporadic. This is the true power of all great steppe cultures including the Xiongnu and Scythians that came before the Mongols (even perhaps as far back as the late 4th millennia BCE during the bronze age when the Yamnaya culture spread out from the Pontic-Caspian Steppe and seeded all Indo-European cultures). Growing up as a steppe nomad like a Mongol you would begin as a mounted archer as young as 5 or 6 years old, or maybe younger. A child would then start on a sheep with a small bow and hunt small animals like rabbits and squirrels. Once they get into their early teens they move to a slightly larger bow on a small horse and begin hunting larger animals like foxes. Once adulthood is achieved the grown adult moves on to a full-grown Mongolian horse with the full composite bow and begins hunting large game like deer.  The Mongolians would also play these “encircling” games on horseback whereby a group of mounted archers cooperates to form huge circles – as large as 10 kilometres in diameter and then begin to spiral and close the circle by herding the animals towards its centre eventually trapping all the animals in the middle. As is now well known these mounted archers could punch far above their weight and inflicted great terror during their raids which eventually allowed them to form empires like the Mongolian and Kahn Empires which at the time was the largest empire ever formed. 

Women also played a key role in these steppe nomad cultures since the horse was very much a leveler of human strength, like the gun is today (which is also why gunpowder eventually displaced horses as the primary war technology). Those “Amazonian women” you may have heard about - possibly through the classical that the Greeks who wrote about them or maybe you saw the Wonder Woman movie - were likely Scythian steppe warrior women who were mounted archer warriors that were part of a much older tradition that goes back to the bronze age. In the Xiongnu culture (predecessors to the Mongols), women routinely held the highest status often more so than men which we know from the items they were buried with (they prized large ornate belt buckles), starkly in contrast to neighbouring civilizations at the time. Technology with consistency can be a great equalizer too.


[An acrobatic archer competing in the World Nomad Games]

The word “Centaur” – the Greek mythological half-person half-horse creature - comes to mind here. 

Centaur is also the word used to describe “Advanced Chess” players. What is Advanced Chess? Advanced Chess was invented by chess grandmaster Garry Kasparov after losing to Deep Blue, a supercomputer developed by IBM. Advanced Chess allows humans to use computers to aid their decision making (e.g. by testing out a move to see what may happen) but do not rely entirely on the computer’s judgement.  This allows the human player to focus on higher order strategy while reducing accidental errors. Centaurs tend to beat both humans (without computers) and computers (without humans) at chess.

These improved chess results are in large part because the human can consistently focus on strategy, as opposed to checking for errors or flaws. The thought process becomes more consistent and more efficient.

New user-friendly tools that can be used by many and spread far and quickly and change people and by extension change the world, like how the horse and text editors/word processors changed the world through their very transmission. Over time tools begin to accrete additional features and functions. These extras can be useful for certain scenarios, but when they clutter the original purpose, they can sometimes undermine mastery. Other times if the tool has been extended that does not disrupt the consistency of the original experience then the changes are usually welcomed.

In conclusion, tools that are highly consistent, quick, and interactive in their usage that allow for a wide range of possibilities for the user can be more easily mastered than tools which may have more features but are less consistent in how those features are put together. It is primarily through this consistency that one can achieve mastery and through mastery that one can achieve a flow state of consciousness. And let us not forget that flow states are most often associated with happiness and contentment up there with spending time with friends and family. 

This is the reason people love their favourite tools.

Postscript: History is messy, and no analogy is perfect, but analogies are useful and should not always be thrown out due to a perceived contradiction. In that spirit, here are some details I left out that add colour and do not directly contradict my argument, but could be taken out of context to confuse some readers: 

  • I am using the word ‘nomadic’ to refer to an itinerant agro-pastoralist migration pattern as opposed to the more random wandering pattern that the early hunter-gatherers exhibited
  • Although the Mongols did dominate through their mastery of horses and archery, the horse was not the only piece of defining technology; The invention of stirrups which provides stability for mounted archers and the invention of the composite bow allowed for a smaller bow to produce more power were also part of this “mounted archer” package.  Furthermore, the defining feature of Genghis Kahn’s Mongolian empire (in contrast to other steppe nomads) was his ability to federate large numbers of disparate tribes and cultures for a common goal, and leverage that diversity to incorporate state-of-the art warcraft like siege technology
  • Although the Norman’s were feared in large part because of their superior cavalry it must be pointed out that the most famous and iconic Norman battle of all – The Battle of Hastings – was decided more through circumstance and chance than by cavalry: The English succumbed to a one-two-punch knock-out blow, first from Viking Norwegian invaders led by Harald in the north - who were defeated in the Battle of Fulford -  quickly followed by a separate Norman invasion led by William in the south that converged at Hastings. But perhaps this is in line with my original point – the Normans lacked the mastery of the Mongols when it came to horses
  • Technologically, mounted archers “born on the saddle” were not new: The Xiongnu - possible ancestors of the Huns - also formed a large empire centuries earlier using the power of mounted archers “born on the saddle” and like Genghis Kahn had also federated tribes – but had got caught up in a civil war. Like the Mongols, the Xiongnu and Huns were highly disruptive and nearly brought down the Han Dynasty and Roman Empire, respectively.  However, the reputation of the Mongols is better known these days and thus why I am using the Mongol example to illustrate the overwhelming power of mounted archers “born on the saddle” and how technology when fitted properly to humans can have highly disruptive consequences in a very short period of time.


Sunday, February 07, 2021

Analytic Efficiency Part 2: User Filtering and Indexing Engine Differences

Recap

In the last article we looked at the differences between Power BI and Qlik’s indexing engines with respect to their Link-based (as opposed to Merged Cube-based) indexing technologies.

The reason why we performed this experiment is that after reviewing a Qlik patent (granted in April 2020) it became clear to me that Power BI and Qlik’s technologies, while appearing similar from the outside - because they both support complex link-based architectures – would be similar internally.

In fact, the two indexing technologies are quite distinct and this difference is most noticeable in the behaviour and performance of user filtering: Power BI Slicers versus QlikView List Boxes (or Qlik Sense Filters).

As such, I ran an experiment to see if this difference also created trade-offs with respect to traditional Business Intelligence operations such as calculating aggregate revenue across two linked tables.

What I discovered was that the performance – when an apples-to-apples comparison could be made – that the performance was about the same (although at first it looked like Qlik’s performance was 4x better, but this was due to a flaw in my experiment pointed out by a reader).

That said, I was expecting Power BI to outperform Qlik in this experiment (and did find a scenario where this was certainly the case) but was still surprised to see the performance was about the same when I ran the experiment with the main use case: High cardinality key linking.

This is a significant finding I believe because Qlik’s “Power of Gray” (a term used in Qlik’s sales & marketing materials) offers a massive benefit to business analysts (think Excel jockeys) who know how to leverage this “Power of Gray”.

This one minute video demonstrates Qlik's "Power of Gray".



My next blog post will get into this “Power of Gray” and why it is so powerful and how it can be leveraged to cross analytic efficiency thresholds leading to improved business outcomes. 


Introduction

In this article we are going to explore both Power BI and Qlik’s filter controls – Slicers and List Boxes – as well as the underlying indexing technology which explains why the behaviour of this filtering is so different between the tools and why the “Power of Gray” comes easily to Qlik and why we do not see it in Power BI.


The purpose of this article is not to provide a complete description of the indexing and calculation technologies. Rather this article is intended to complement other materials (such as those we are linking to in this article) to show using an example how these engines differ. 


But before we get into that I want to go over some additional performance measures I have taken that will help illustrate the qualitative differences between the engines. Namely, the differences that analysts would notice when using both tools.

Power BI Slicer versus QlikView List Box/Qlik Sense Filter


In Power BI, the “Slicer” object - which allows users to filter fields and drill into hierarchies - is comparable to QlikView’s “List Box” object, which is effectively the same as Qlik Sense’s “Filter” object. There are some functional differences between QlikView’s List Box and Qlik Sense’s Filter (e.g. QlikView’s List Box supports an ‘AND/NOT’ selection mode), but for the features we are discussing in this post, they are identical.

Hereon we will only refer to Power BI’s “Slicer” and QlikView’s “List Box”.

With that out of the way I need to explain how Power BI and Qlik differ with respect to Slicer versus List Box and why it is not possible to make an apples-to-apples comparison with respect to performance.

Power BI and QlikView both allow users to filter on field values (e.g. selecting the value ‘West CC’ in “Call Centre Name” field using PBI Slicer or Qlik List Box). But here are the two main reasons why we cannot compare their performance:

  1. Power BI Slicer shows possible values (based on other applied filters) versus Qlik which shows all values including values that are excluded by other field selections
    1. In our example, this means that if we select Call Centre l‘West CC’ in Power BI we are only showing Work Order IDs that are linked to the ‘West CC’ call centre. Qlik on the other hand displays all Work Order IDs, including excluded work orders (which are shown in gray as opposed to possible Work Order IDs which are shown in white)
      1. This means the QlikView List Box is showing 5x as many Work Order IDs than Power BI Slicer
  2. Power BI Slicer must always be sorted whereas it is possible to turn off sorting in QlikView’s List Box
    1. Sorting has a greater time complexity than retrieval and listing and is particularly noticeable when the number of values to be sorted is greater than 8 million.
    2. Without sorting, the time complexity would be linear – O(n).  But because sorting has a linearithmic time complexity of O(n log n) and therefore is non-linear, the effort to perform sorting overwhelms any indexing efficiencies. This becomes noticeable even with more than approximately 1 million unique values in the field you are displaying in the filer (since the ‘log n’ multiplier starts to become a significant multiplier)


Putting #1 and #2 together, a performance comparison between Power BI Slicer and Qlik List Box cannot be apples-to-apples because Qlik and Power BI are dealing with different populations of distinct values while at the same time it is not possible in Power BI to separate the sort operation (with greater time complexity) from the retrieval and display of values.  

In other words, Qlik must retrieve and display more unique values than Power BI while sorting that larger population.

With all that said I still went ahead and took some measurements to illustrate some of these differences and to get a sense of how the tools behaved.

Before I get into the measurements, it is important to be reminded that most of these measurements are really just measuring a sort operation, and that because we are selecting from 140 million rows, even a single Call Centre has 35 million rows (far more than the 8 million row threshold I mentioned earlier, where we start to see a steep decline in performance).

As an aside, it is for a similar reason why DBAs and Data Developers often choose to limit partitions to 8 million rows and why Microsoft Analysis Services Tabular partitions its own data [by default] into 8 million row “segments” and Microsoft 365 Power BI uses 1 million row “segments”.


Slicer versus List Box Observations and Measurements


The first thing worth noting is that Power BI itself ran out of memory attempting to display the Slicer when no values were selected.


 


To be fair, if this were Azure Analysis Services, this may have been successful due to the larger 8M row Segment Size (as opposed to Power BI’s 1M Segment Size). So AAS would give us more options here while still staying on the Power BI platform.

With QlikView it did take approximately a minute, but we did successfully see all the values (as shown below) on the same laptop as Power BI.

 


Fortunately, when I began making selections in Power BI the Slicer was processed successfully as you can see below when I selected the “West CC” call centre.




 

Overall, the median query performance for Power BI was approximately 19 seconds per selection, as shown below.

 




Here is the underlying JSON Performance data and a .pbix I used to analyze the data.

PowerBIPerformanceData_v1_Slicer.json

V1 Slicer Performance Analysis.pbix


With Qlik, I took different types of measurements (which are not possible with Power BI).

The table below shows measurements for an unsorted List Box with the same “Worker Order ID” field.

As shown in the table below we get a median response time of approximately 2 seconds.

Tool

Size

Selection

List Box Refresh Time (ms)

Qlik

Huge

Central CC

2,000

Qlik

Huge

East CC

5,390

Qlik

Huge

North CC

2,300

Qlik

Huge

South CC

2,140

Qlik

Huge

West CC

2,080

Median

 

 

2140


But if you compare what is shown in the relatively quick unsorted display that QlikView is showing to what was shown earlier for Power BI, you can easily see the difference: Power BI is only showing the possible values, all sorted within the same block.  But in QlikView, it’s effectively just highlighting the possible values among the originally sorted population. 



 

You would have to scroll or glance through this list to see all the possible values which is obviously not practical with millions of distinct values. Although In some situations this is a helpful for making comparisons (typically when the list is short and can act like a visual heatmap), but in an example like this you would probably want to turn on both State and Text sorting which dramatically impacts performance. 

Another thing you will notice if you look carefully is that it appears as though ‘10_1’ and ‘10_2’ (which are visible in Power BI) are missing from Qlik.  If you were to scroll down the list you would find them.  The reason they are not displayed is that what is being displayed in QlikView is the original sort order based from the load script.  Often Qlik developers will pre-sort data before loading into Qlik for this reason – I did no such thing in this experiment, which is why some values appear to be missing when in fact they are just farther down the list.


I also took measurements when I turned on “State” filtering (whereby ‘possible’ values are listed above ‘excluded’ values) and I took another set of measurements when both “State” and “Text A->Z” filtering was enabled. The below table shows these in comparison with Power BI.

Tool

Test

Slicer/List Box Refresh Time (seconds)

QlikView

No filtering, all values

2

QlikView

Selection State filtering, all values

15

Power BI

Text filtering, possible values

19

QlikView

Selection State and Text filtering, all values

25


Again, I must emphasize that the measurements shown above are not for apples-to-apples comparison between Power BI and Qlik, but they do reveal contours to the underlying indexing technologies which are key to the differences between the tools and to the business user experience, and ultimately to user outcomes.

In my next post I will explain why the differences in indexing technology between the two tools make an impact to Analytic Efficiency. But first let’s dive into the example that shows the mechanics of the indexing engines themselves.


Index Engine Walk-Thru

Before getting into this example, I will refer you back to my previous blog post which describes the data model. But to quickly recap, there are two tables: “My Calls” and “My Work Orders” and they are linked on a “Call ID”.  Users select “Call Centre Name” from a Slicer or List Box and then see the total “Work Order Amount” aggregated from the “My Work Orders” table based on possible Work Order rows filtered indirectly by the “My Calls” table.

Now let us begin with our example walk thru...


Given our example of two tables: Calls and Work Orders, what is happening in Power BI when we select "Call Centre" = 'North CC' from the "MyCalls" table (using a Slicer) and then are presented with "SUM Work Order Amount" (from the "MyWorkOrders" table) in the form of a KPI Card?


Power BI (Vertipaq) Indexing Engine Walk-thru Example

Based on my reading of Microsoft and SQLBI's explainer presentations combined with my own experiences developing Power BI and Azure Analysis Services, here is what I believe is happening once a user makes the selection (feel free to correct me):


1. The cache is checked to see if this value has already been calculated.  If it has been, then no searching or calculation is required, and the result can be shown instantly (constant time).

2. If the value has not been explicitly calculated by the cache, it may be possible to use previously cached results to speed up calculation time.  For example, if the entire population of Work Order Amounts has a population of $5 and we can detect the presence of $5 while scanning the indexes, we can short-circuit the calculation because we already know it's not possible to find a smaller value.  For our example here, only the MAX and MIN can be deduced in this manner, and we can see this in the test results.

Based on reader feedback to my last post, I suspect this is only happening when the DAX expressions are explicitly bundled together using the EVALUATE and SUMMARIZECOLUMNS functions taking advantage of “DAX Fusion” optimizations.

However, the Report developer would need to be aware of this, and if the Report has separate KPI Cards, then no such optimizations would take place (as of this time of writing).

3. Assuming the cache cannot help us (which is the true subject of our test), then the first thing we need to do is determine the rows in MyCalls that have been selected.

4. Power BI partitions data by default into 1 million row segments (Analysis Services partitions into 8 million row segments).  Each segment in turn has its own Dictionary which allows the column values to be optimally compressed depending on the number of distinct values within the given segment.  Since the MyCalls table has 100 million rows, we can expect approximately 100 segments (I say approximately because segment size is in fact 1,048,576 or 2^20 rows, since segments must be a power of 2).  Based on simple calculation (100 million / 2^2 = 95.3), I estimate 96 segments and will use that number going forward.

Each of these 96 segments has its own Dictionary.  The reason for having multiple dictionaries is that within in each segment only a subset of values is likely to occur. This allows for better compression and better compression allows for faster reads. Thus, each dictionary is consulted to see if it includes 'North CC'.  Now if the segments had been aligned to this column we would only need to check a subset of segments.  Depending on how the MyCalls table has been sorted, it's possible that Power BI is searching segments that only pertain to the 'North CC' call centre.  This would mean querying approximately 20 segments (as opposed to all 96 segments).

5. Once all rows have been read (whether those rows are all lumped into a subset of segments or are spread across all segments), Power BI now needs to determine what linking keys are included in the selection scope.  In our example the "Link Call ID" which is the field that links the two tables together is a whole number meaning that Power BI can more efficiently store as a compressed symbol without requiring a dictionary to decode the value which is more efficient.  

6. Using the coded "Link Call ID" values, Power BI then determines how these linking values are mapped to the linked "MyWorkOrders" table using the "Relationship Index".  In our example with 'North CC' being selected, the Relationship Index will point to a subset of segments in the "MyWorkOrders" table.  Hopefully the tables are aligned, but it's possible they are not. Given that we have approximately 140 million work orders, we would have at least 134 segments.  This would mean Power BI would need to scan at least a fifth of those segments (27 segments).  That said, I suspect all MyWorkOrders segments would need to be scanned.  The reason is that the MyWorkOrders table does not have a "Call Centre Name" column and only a "Link Call ID" column which does not share the same distribution as the Call Centre names.  In all likelihood all 134 segments comprising all 140 million rows would be scanned.

7. As mentioned earlier, integer and whole number values are not by default "hash" encoded into a dictionary and thus do not require an additional lookup, so the values of the "Work Order Amount" can be determined directly from the compressed MyWorkOrders table.

8. From here Power BI is in the final stretch and using the "Relationship Index" can scan all the "Work Order Amount" column indexes for all rows in each of the segments that have the referenced "Link Call ID".

9. Power BI now calculates the Sum total of Work Order amount from these underlying values.


This leads us to the reasons why Power BI cannot easily query and calculate excluded rows (i.e the complement of rows):

  1. Only text and floating-point values are dictionary encoded in Power BI (whereas in Qlik there is no exception, all values are dictionary encoded since all values are of a 'dual' type and potentially hold both numeric and text values)
  2. PBI Dictionaries can be fragmented across multiple segments, requiring multiple dictionaries to be queried with results merged - a costly operation
  3. Through incremental partition refreshes, PBI dictionaries can contain values that are no longer referenced by any dictionary. Power BI even allows administrators/operators to perform a "defrag" on tables using XMLA commands.
  4. Power BI also allows for a DirectQuery mode that directly queries an RDMBS using generated SQL queries.  DirectQuery is already challenged by latency to begin with, it would be even more challenging and time consuming to add additional queries to calculate the complement of a given column's values.


So how does Qlik (QlikView and Qlik Sense) work in this scenario then?


Qlik Indexing Engine Walk-thru Example


Based on my reading of Qlik's patent US: 10,628,401 B2 and experience working with QlikView and Qlik Sense, here is what I believe is happening once the user makes the selection.

I must warn you: Qlik’s engine is not as intuitive as Power BI’s Vertipaq engine and does not follow a “Columnar DBMS” architecture.   For this reason, you can reference this document which contains the underlying example data I am referring to here.

Qlik vs Power BI example.xlsx

Here is how it works:

1. Like Power BI, the cache is checked to see if this value has already been calculated.  If it has been, then no searching or calculation is required, and the result can be shown instantly (constant time).

2. Qlik does not appear to use intermediate cache results (e.g. like the Max of the total population).  If the result has not explicitly been calculated before it is entirely calculated without using past queries.  Thus, caching in Power BI appears to be more sophisticated in this regard. Although I could be wrong here.  But if you read the addendum to my previous blog post I present some evidence to support this.

3. Qlik takes the 'North CC' value from the user selection and looks it up first in the “MyCalls” BTI. BTI stands for"Bi-directional Table Index" which is a dictionary that points back to the row position of values.  Already we can see that Qlik is starting with the dictionary before even looking at rows.

MyCalls Table

Call ID

Link Call ID

Call Duration

Call Centre Name

1

1

4

North CC

2

2

9

North CC

3

3

4

East CC

4

4

1

East CC

5

5

2

West CC

6

6

1

Central CC

7

7

4

South CC

8

8

9

North CC


4. The BTI points to all rows containing the value 'North CC' using what is known as a bitmap index.  For example, if our MyCalls table had eight rows and the 'North CC' calls were on rows 1,2, and 8 then the bitmap index would look like this: 11000001, which in turn could be encoded into a single byte as 0xC3 in hex or 193 in decimal.

I want to pause for a moment and point out that it is here we can see a radical difference between how Power BI and Qlik index their data with respect to dictionaries.

In Power BI the dictionary is subordinate to the table rows and is not used for columns that contain whole and fixed decimal numbers, whereas in Qlik the dictionary is at the centre and the table rows are subordinate to the dictionary even when the column values are exclusively integers.

In other words, Power BI is more row-centric versus Qlik which is more dictionary-centric.  It is this dictionary-centric approach to indexing which allows Qlik to efficiently query and present excluded values and thus present to the user those excluded values in Filter boxes (in Qlik Sense) or List Boxes (in QlikView).  This is what Qlik's marketing refers to as "the power of grey".

5. Moving on with our example, once Qlik has determined which rows have values - using the BTI's bitmap index - it now needs to resolve "Link Call ID" values for the corresponding table.

6. Qlik then uses a second index known as an "Inverted Index" or "II" to look for the value symbol references for the "Linked Call ID" column. Using an example of our table with eight rows, our inverted index might look like this: 1;2;3;4;5;6;7;8

Each value in this single-dimension array corresponds to a given symbol value in the "MyCalls" table.

Given that we are interested in rows 1,2, and 8, we would need to reference Work Orders from the MyWorkOrders table with "Link Call ID" in (1,2,8).

7. We now take those three values and determine how the symbol value is mapped from the MyCalls table into the MyWorkOrders table.  This is done through an index Qlik calls a "Bidirectional Associative Index" or "BAI".


8. Using the "BAI" we map the index position of the "Link Call ID" values (1,2,8) for the "Link Call ID" to the corresponding index positions. In our example, some (about 25%) of MyCalls records have no corresponding value in the MyWorkOrders table.  So those "Link Call IDs" would map to the value '-1' to indicate the value does not exist in the linked table.  The other values would have their values mapped.  The following table shows how the values could be mapped for all Call IDs, assuming Call IDs 2 and 6 do not correspond to any "Linked Call Id" in the Work Order table:

MyCalls to MyWorkOrders BAI

MyCalls BTI Index

MyWorkOrders BIT Index

1

1

2

-1

3

2

4

3

5

4

6

-1

7

5

8

6


At this point we should point out that Qlik would also build a corresponding index to go in the opposite directly. While that index would not be used in this example (since are going from MyCalls to MyWorkoders and not the other way around), it is worth showing how the other BAI would be constructed.  Here is how it would look (keep in mind there are only six distinct "Linked Call ID" values in the MyWorkOrders table:

MyWorkOrders to MyCalls BAI

MyWorkOrders BIT Index

MyCalls BTI Index

1

1

2

3

3

4

4

5

5

7

6

8

Please note: The above table is shown as an example of reverse linkage. However, for the remainder of this walk-thru we will NOT be referencing this BAI index. But we will be referring back to the MyCalls to MyWorkOrders BAI index.


It is important to note here that these link mappings are not linking from table to table, but rather from BTI to BTI.

It is through this BAI that Qlik is directly linking Dictionaries with one another.

It is as if the tables themselves don't really exist in any explicit format, the position of values within tables is essentially an attribute of the dictionary.


8.  Now we have the BTI index positions of the selected "Link Call ID" values, Qlik can determine using what rows those values are found on using the BTI index for the given Call IDs.

In our example this means that Qlik would be looking up values (1,6) in the BTI for the "Linked Call ID" in the MyWorkOrders table.

MyWorkOrders BTI

BTI Index

Link Call ID Value

Value Position Bitmap

1

1

100000000000

2

3

010000000000

3

4

001101000000

4

5

000010100100

5

7

000000011010

6

8

000000000001


Using our example if we assume that the MyWorkOrders table has twelve (12) rows our bitmap indexes for "Linked Call ID" 1 and 6 would look like this:

1:   100000000001

6:   000000000001


Qlik then performs a bit-wise OR to produce the following bitmap index

1+6= 100000000001


A minor caveat: I mentioned earlier that a bitmap index is used by the Dictionary BTI to determine what rows the value falls on.  However when the value is very sparse (I am making a conjecture here), I suspect the bitmaps are further compressed (probably using an offset and run-length-encoding). To be clear while I don't know for sure how the bitmap indexes are managed for sparse value distributions, it doesn't really matter all that much for the purpose of this example.

MyWorkOrders Table

Work Order ID

Link Call ID

Work Order Amount

1

1

$15

2

3

$10

3

4

$20

4

4

$15

5

5

$5

6

4

$10

7

5

$20

8

7

$5

9

7

$15

10

5

$10

11

7

$5

12

8

$15

 

10. Using the BTI for "Linked Call ID" Qlik locates all rows that we need to query "Work Order Amount" from using the "Inverted Index" or "II" for "Work Order Amount" might look like this: 

1,$5

2,$10

3,$15

4,$20

3;2;4;3;1;2;4;3;3;2;1;3.

If we refer to the merged bitmap index above (using bitwise arithmetic), we can now use the II to determine that the values for "Work Order Amount" are: 3 and 3.

From the above example, we can see the BTI for Work Order Amount might look like this:


11. Qlik can now determine that since both references are to row 3 in the BTI, it can more efficiently calculate the SUM of Work Order Amount by multiplying the value by the number of instances.  This would be 2 * $15 = $30.

12. Qlik now presents back $30 as the total back to the user.