Essential Complexity: 2020

Thursday, October 08, 2020

Snowflake’s Market Valuation and Network Effects

TL/DR: Snowflake’s $63 billion valuation shortly after their IPO is primarily based on user network effects rather than its database innovations. Tech journalists and other analysts don’t want to sound cynical by reminding people of “lock in” and would rather discuss innovative features. But technical innovation is a distraction here. Whether you are an investor or a customer, it behooves us to understand the subtle nature of network effects and lock in or else we are doomed to lurch from one locked-in vendor silo to the next. But there is reason for hope.

A few weeks ago I was listening to a podcast interview with Bill Gates.

At one point the conversation turned to Bill’s relationship with Warren Buffett. Gates recalled an early conversation he had with Buffet where he mentioned that the ‘Oracle of Omaha’ was struggling to understand how Microsoft could compete with IBM in the long run. This was in 1991 and IBM at that time was still considered the juggernaut of IT and the conventional wisdom was still “nobody got fired for buying IBM”.

While Gates didn’t come out and say it explicitly during the interview, the reason of course has to do with network effects or more specifically what we now refer to as user or vendor “lock in”.

For the same reason that Microsoft was able to compete with and eventually displace IBM, was the reason that Microsoft was nearly broken up in the late 1990s. Once a critical mass of users and applications is reached, it is nigh impossible to break that network without some kind of technology disruption occurring, in which case a new technology or vendor takes over and the cycle repeats itself.

Network effects, lock-in, and barriers to entry for competitors are something many people are aware of but the tech journalists and other analysts don’t always remind us of this when I think they should.

My introduction to the network effect concept came in the 1980s when I learned that the inferior VHS videotape standard had displaced the superior BetaMax format that Sony had developed. The story goes something like this (I am recollecting from memory, apologies for any errors ): Sony had developed two video cassette technologies: VHS and BetaMax. VHS was deemed a prototype and not worth protecting IP. BetaMax was the quality product Sony was aiming for. So VHS was out there as a viable standard but initially had no backing to speak of. When the adult film industry realized they could sell movies directly to customers through home video, their efforts were stymied when Sony refused to allow its BetaMax to be used for distributing “lewd and unsavoury” content. With no access to BetaMax the adult film industry simply took the path of least resistance and distributed through VHS. Since there were no IP restrictions for VHS, this opened the standard to other companies like RCA, Philips, Panasonic and other home electronics manufacturers to manufacture and sell their own VCRs.

Within a few short years VHS became the de-facto standard for home video.

As an epilogue I do recall BetaMax hanging on for a while by emphasizing the quality aspect of their invention. However, even that was displaced by the LaserDisc format which could provide both quality and features that neither VHS nor BetaMax could provide. By the mid nineties there were two home video “networks”: Low-end VHS tapes for the masses and high-end LaserDiscs - which offered superior picture and sound quality, directory commentary, and elaborate liner notes - for film buffs. The esteemed Criterion Collection was effectively borne out of the network effects from LaserDisc format.

By the late nineties the DVD format began to pick up steam and because DVDs could offer LaserDisc features at VHS prices, both of the latter standards would soon succumb to the same fate at the BetaMax format.

An episode of The Simpson’s has a funny joke about this where we see a garbage dump with piles of VHS tapes, LaserDiscs, DVDs, and an empty area with a sign that says “Reserved for Blu-Ray”.

I suspect (even hope) many readers just skimmed through the last two paragraphs. As anyone who cares or follows the technology industry knows, network effects are the name of the game in the modern world of technology.

In a sense it has been like this for a while: in the “olden days” you could be shrewd inventor like Morse or Edison and work the levers of the patent system to protect your invention and build up a network effect.

Alternatively, you could be like IBM or Apple and develop a strong reputation and brand that keeps customers loyal while scaring people away from competitors products.

In ancient times (and even still today) innovation was limited to what could taught and passed on. Hero of Alexandria invented the steam engine nearly 2,000 years before it would be re-invented. He wrote down instructions on how to make one, but anyone who took the time to build one probably kept the knowledge to themselves. We even romanticize such inventors as “wizards” when many of them were really just knowledge hoarders protecting their secrets.

While I have my reservations about Edison, IBM, and even Apple, I think building businesses on patents and strong reputations is not necessarily a bad thing for the simple fact that this dynamic propels innovation. The inventors in the pre-modern world would also have to compete.

On the other hand, businesses that are able to exploit network effects to grow quickly are also worthy of praise. These companies which often come out of nowhere can quickly shake up otherwise calcified industries. Everyone loves an underdog, especially when it beats the champ.

But taken to extremes all of these things: patents; brands; and network effects can end up stifling competition or worse, creating institutions that society depends on but which effectively undemocratic. Companies like Facebook, Google, Microsoft, Amazon, Apple, and I’ll put Western Union in there for good measure all leverage network effects to great success.

Do you think it’s possible to out-innovate Google with a better search engine? Maybe it’s possible, but you also have to realize that Google has the world’s largest database of user search history which is now the main driver of search results. Good luck in getting that, but maybe it’s possible?

Apple has hundreds of thousands of developers locked into its App ecosystem (which only competes with Google).

Facebook I don’t need to explain, they don’t even bother boasting about the size of their network any longer and basically act like the Saudi Arabia of social networking.

Amazon in some ways has the most genuine competition to deal with. But when you compare it to its next best competitor Walmart, Amazon is miles ahead in terms of its integration between supply chains, warehouses, and delivery chains, not to mention the fact that Amazon also has its own cloud infrastructure “AWS” it can cheaply tap into for all of its digital needs.

The trillions of dollars these companies are worth are based mostly, in my opinion, on these network effects more so than any particular technology or innovation you can point to. We live in a world of that is increasingly substituting glorious free market competition with the less splendid monopolistic competition.

Whenever I see talk of a new start-up, I feel like the honest question to ask is: How does this company combat existing network effects that are working against them and how can they bring about network effects that lock their customers in going forward?

When I phrase it like this, it sounds a bit anti-capitalist or crude to even wish for lock-in.

It is so obvious that this is what is happening and yet I rarely see this level of frankness when new technologies are discussed.

In other words, I believe the way we talk about technology and innovation is dangerously infused with magical thinking that ultimately plays into this dynamic.

This now brings me to Snowflake, the recently public IPOed cloud database company.

As of October 2nd 2020 on the NYSE, “SNOW” closed at $227 with a combined Market Cap of $63 billion.

That is an extraordinary sum of money for a relatively new technology that was founded in 2012 (8 years old).

Compare this to other companies in the business data analytics on October 2nd:

Domo (DOMO) $1.1 billion (founded in 2010)
TeraData (TDC): $2.4 billion (founded in 1979)
Looker was recently sold to Google for $2.6 billion (founded in 2012)
Cloudera (CLDR): $3.28 billion market cap (founded in 2008)
Databricks, still private but last valued at $6.2 billion (founded in 2013)
Alteryx (AYX): $8.1 billion market cap (founded in 2010)
Tableau was recently sold to Salesforce.com for $15.3 billion (founded in 2003)
Oracle (ORCL) closed with a market cap of $177.1 billion (founded in 1977)
SAP SE (SAP) closed with a market cap of $189.4 billion (founded in 1972)

Based on this smattering of companies, it would seem like Snowflake is an outlier in the data & analytics space. It’s a newer company, but is valued at least 4 times as much as it’s closest peer (Tableau). On the other end of the spectrum Snowflake is one third of the value of Oracle and SAP, both highly established businesses with deep customer bases.

I can point out many companies that would struggle to keep operating without Oracle and SAP, so their hefty valuations make sense to me. I can also look at a company like Tableau and appreciate its value, mainly based on the strong brand image it was able to cultivate and the loyalty this has generated.

With Snowflake I am not seeing - at a ground level - the level of business process lock-in that Oracle and SAP enjoy. Nor am I seeing the level of brand recognition that Tableau has among business users (although that has ironically been boosted by the IPO itself).

Furthermore, the vast majority of IT departments (and businesses in general) tend to prefer homogenous architectures. This means that they would rather purchase from a single vendor with “one throat to choke” than deal with multiple vendors and the friction that comes with cross-vendor integrations. This means that if SAP forms the centre of business functions and processes then you would buy SAP products even if they are not quite as good as the competitor’s product. I’m not saying this is always the right thing to do, but it often is the most efficient approach, and more to the point it’s a more easily defensible position, a la “Nobody ever got fired for buying IBM”.

So how does Snowflake overcome these huge barriers and go on to be valued at the $63 billion?The answer of course is network effects.

But how exactly?

Before I answer that question, I’ll just explain briefly how Snowflake works and how it is different from traditional databases.

Most databases combine both the data (storage) and the compute power (VM [virtual machine]) into a single service. While modern data warehouses do a decent job of scaling large numbers of users and workloads, they do not scale elastically. What this means is you need to reserve ahead of time the amount of compute hardware for the database. This means you are often paying for unused compute power and it can also mean that you might hit an upper limit if too many users are querying the data warehouse at the same time. It’s not a common thing but it happens, and quite frankly it is rare to see problems that cannot easily be fixed by simply adding more hardware, which is can be accomplished with a few clicks on cloud platforms. You can even automate this scaling. There are also benefits to this constraint in that costs are more predictable and easier to budget for.

Nevertheless, this user isolation concern has been known for decades and techies tend to gravitate towards these types of problems, even if they are mostly theoretical. Originating from Google and beginning around 2010, the concept of a “Data Lake” began to emerge as the solution to this problem. The idea was (and still is) that you run programs on clusters of VMs (i.e. one ore more cloud compute nodes) and read and write from a specialized file system known as HDFS (Hadoop File System, [originally the Google File System]) that is designed for both resilience (i.e. protection from data loss) and performance. This means that data developers and data analysts alike can blend data from multiple sources (essentially just flat files like CSV files, but also binary optimized flat flies like ORC and Parquet formatted files), and then write back out their results as other flat files. However, the problem with this approach is that there has never been a consensus for how developers manage the Metadata that surrounds these flat files. It’s a bit like asking hundreds authors to fill up a library with books but without any indexing system. Eventually you will be left with a library teeming with books but with no ability to find what you’re looking for or even understand who wrote what and why.

Metadata management is a crucial service that databases provide (often in sub-system known as the “information schema”) so users can quickly understand how tables are related and what assumptions we can make of their their contents.

The way in which Data Lake centric platforms solve this problem is to create a managed Metadata service that would provide this service. If you have heard of “Presto” or “Hive” or “Impala”, these software systems piggyback on Data Lakes and provide a Metadata service layer which allows data developers to work with Data Lakes while maintaining the crucial Metadata in a centralized location available to all other users. For example, Databricks and Cloudera both allow their users to create “Hive” databases, whereby all the data is stored in a Data Lake and the computation costs are charged back to the user.

I would argue that this approach is as good if not better than the Snowflake DB approach. The reason is that when you are querying Snowflake data you need to use Snowflake’s own database compute engine (regardless of the fact that it is perfectly isolated). To be fair to Snowflake, they have built an engine that performs well under the vast majority of situations.

But the Data Lake approach has no constraints when it comes to computation or how files are being accessed. You also don’t need to go through the Metadata (e.g. Hive) layer if you don’t want to. For example, you could develop specialized file formats and corresponding compute engines that encode and decode those files for specialized applications that don’t fit neatly into pre-existing patterns. Again, to be fair to Snowflake they can and have extended their platform to allow this type of customization too, but you have to wait for Snowflake to do this.

Let’s get back to this extraordinary market valuation of $63 billion. If there are established alternatives to Snowflake as we just discussed, and those alternatives are already baked into the eco-systems of major cloud vendors (e.g. Databricks and Cloudera are already baked into Microsoft Azure), and it is undesirable to use vendors outside of the dominant platform, then why would so many people abandon homogenous architectures and complicate the IT landscape with Snowflake?

The answer I believe comes down to something much simpler: Convenient user access. Not unlike VHS videotapes.

A pattern I often see play out in the modern enterprise goes something like this: Technologists (researches, consultants, early adopters, IT staff, etc.) hear about some bleeding edge new technology that “all the cool kids” are using. This is often associated with some kind of hype cycle, like “big data” or “machine learning” or “blockchain”. Then there is a rush to evaluate and even adopt the technology. Some “proof of concept” or trial is produced that might show some potential. Usually there is a one-time data dump/extract provided along with a few business users who are onboarded with the specific purpose of evaluating the potential of the technology.

But then when the proof-of-concept is over and the business users themselves are left to their own devices, they rarely go back and use or even ask for the new technology. Occasionally a user might mention it or even ask if they can use it, but more often than not they will go back to data sources they readily have access to and tools they are comfortable with.

With all of that said, doesn’t Snowflake also have the same problem, and isn’t it made worse by the fact that Snowflake does not (yet) have the advantage of being sold under the banner of a big cloud provider like Microsoft GCP or Amazon AWS or Google GCP?

Yes and no.

Yes, because there is much of the same type of friction you normally run into when sharing data within the organization or enterprise.

No, because there is significantly less friction when you are collaborating with people outside of the organization or enterprise.

It is this cross-organization data collaboration where Snowflake really stands out and where I can see justifying it’s astronomical Market Cap of $63 billion.

Allow me to elaborate…

The challenge with most modern data platforms is they do not work out-of-the-box with most data tools (e.g. Excel). Instead, the data developer or data user must adjust to a new way of working that is often more awkward than the old way. For example, take that “Data Lake” I was mentioning earlier. To get data into our out of a Data Lake requires special desktop tools (e.g. Azure Storage Explorer) and command line tools that most business analysts don’t have access to. Since these platforms were designed by technologists for “data scientists”, they often lack many simple features most business users take for granted and often feel clunky. For example, if you want to open a file from Azure Data Lake Store, you can’t just browse using Windows File Explorer and then pop open a data file in Excel. Instead you first need Azure Storage Explorer (which may not be available as a standard Enterprise application) and then even if you get it you need to download the file locally first before being able to peer into its contents.

The point being, no matter what modern Data & Analytics platform you are using, there will be some change to the way you work. Thus in this scenario it doesn’t matter if you are using a Microsoft Data Lake or a Snowflake DB.

Now here is where things get interesting. Once you pivot to using a modern Data & Analytics platform, whether it’s Azure Databricks or Azure Synapse or Google Big Query or Amazon RedShift, if you decide at some point that you need to collaborate with groups or individuals who are outside of your organization (or perhaps even in other departments), you start running into brick walls.

For example, let’s say you are an insurance company that wants to share data to related to fraud for the purpose of catching serial fraudsters (this is a real thing btw). Basically your only option is to find some agreed upon data store that is deployed specifically for this very purpose and that requires building data pipelines to populate and maintain. Even keeping a single table refreshed daily would be a big ordeal given all of the concerns that normally are associated with moving data over the public Internet.

But you might think “Hey could we not just provide some kind of guest account into the Data Lake or whatever the database platform we are using?” The answer is “Yes, but.” It’s not so simple is for security reasons. Namely, most cloud Data Platforms are really just one of many components that are managed under a common cloud service for a given “tenant”. For this reason the big cloud providers (Microsoft, Amazon, Google) tend to put up many barriers to external users from logging in easily. By default there are network firewalls and what are known as “conditional access policies” that will take into account your device, location and other factors and require you to perform “multi-factor authentication”, and that’s even if you can get past these policies to begin with. These cloud platforms are designed - by purpose - to minimize friction internally while introducing much friction externally. That external friction is done in the name of security but it also has the effect of nudging people into using more of the cloud vendor’s tools.

This is where Snowflake really shines: It is designed to allow data sharing across customers while not compromising security. While this might sound like something that only applies to cross-organization (internet) sharing and collaboration, it can also apply to intra-organization (intranet) sharing and collaboration. This is because all of those Cloud Data Platforms I just listed tend to be very modular and isolated in their design. These days most organizations have several Data Lakes and other Big Data platforms, but those Data Lakes are often just silos used by only a small number of persons. Contrast this with Snowflake which only behaves like a silo if you want to it, but the moment you want to share data with another Snowflake user, whether they belong to your organization or not, you have the simple and easy option to do so. To be clear, it is possible to lock Snowflake down and make a silo if you want it to be, but it’s also much easier to change those settings when you don’t. The architecture never commits you to the extreme form of isolation that the other cloud platforms do.

While we’re on the topic, I should also point out that Snowflake has another advantage over Data Lakes (which is the alternative that most closely matches Snowflake’s approach): With Snowflake you have fine grained control over what specific rows and columns you want to share, whereas in the Data Lake world, the most granular object is a flat file (table), which tends to made up of many rows and columns. Yes, it is possible to create flat files that are a custom materialized “view” into the data. But you would need to have a data pipeline that keeps that document refreshed on a regular basis. The “custom view” feature in Snowflake (same as RDBMS VIEWs) allows data developers to rapidly and securely collaborate without requiring any “accidental complexity” from data pipeline scripting to creep in.

So what are the implications of all of this?

For Snowflake, if they can get to a certain critical mass they could get to the tipping point where their service simply becomes a cost of doing business. At that point their $63 billion market cap will make sense and many investors might even be wishing they had bought sooner.

This is what Dropbox (NASDAQ:DBX) has been trying to do for a while. Their current market cap is $8 billion down from a high of nearly $17 billion in June 2018.

But Dropbox is more of a consumer technology, and consumers are fickle and don’t really depend on storage services the same way, because very few people maintain their own software systems that manage data.

Snowflake is not consumer data storage service like Dropbox (btw, I am a fan of Dropbox and wish them well) because Snowflake is much more plugged in to Enterprise IT. Snowflake has more “lock in” potential.

I’m sure Snowflake knows this and is using its IPO money to pursue this “winner take all” end game.

But what about the big cloud vendors, surely they are working on something to combat Snowflake?

There are two main challenges they are up against:

They definitely need to make their existing data platforms more thoroughly elastic. Platforms like Azure Databricks works more like this, but deployment and administration is not as streamlined as it could be when comparing against Snowflake.

This can be challenging but it’s very much a tractable problem.

They might need to figure out a way to make cross-organization sharing easier while not compromising security for the other components in the same cloud platform. This is a much harder problem to solve because of all the trade-offs involved. A solution that might benefit data sharing in a data warehouse or data lake might create vulnerabilities or complexities elsewhere.

My advice to these giants would be to play to their own strengths and reduce the friction within their own service perimeters. Earlier in this post I mentioned how Microsoft Azure business users must use Azure Storage Explorer to access data in Azure Data Lake Stores. I should also point out that Azure Data Lake Store itself only supports up to 32 users or user groups from a security standpoint. As a result it’s now common to see dozens of Data Lakes and other data platforms - all silos within the same cloud vendor - sprinkled throughout various departments.

If Microsoft, Google, and Amazon were to focus on these friction points they wouldn’t need to build a competitor to Snowflake.

Instead what I see is a half baked approach whereby business users are nudged towards more business friendly platforms like Microsoft 365 (formerly Office 365) and Google Workspace (formerly G Suite) while leaving the real power tools for just the IT department.

The so-called Data Scientist is basically stuck between these two worlds and often will just resort to running local Data Prep tools on their laptop like Python or Alteryx or KNIME or whatever they can get their hands on.

To sum up, it is important to recognize that technically innovative features can distract us from the market value of a given business. Tech journalists spend most of their time discussing technical innovations as if this is what drives these businesses, when instead it is through subtle network effects that can arise out the absence or inclusion of user friction.

What does this mean for the future of technology and business?

Are we doomed to hop from one locked in platform to another for ever and ever?

The good news is that as this [pay walled] article in The Economist points out most people now realize this is a problem and governments are beginning to regulate the tech industry in much the same way other “winner take all” industries like utilities and banking have been regulated in the past.

The EU in particular appears to be working on legislation that if passed would open up companies like Google, Apple, and Facebook to more competition.

But as The Economist also points out “the devil is in the details” and getting regulation right is tricky (and worth the effort).

This means we should also be looking for solutions within technology itself.

History does not tolerate lock-in forever and eventually something better emerges out of these constraints. It is that “something better” that I am working on and I hope others are working on this same problem too. Much progress has been made due to the open source software movement, both for commercial and non-commercial users. But open source software is not enough because most of what I just described is ironically based on large swathes of open source software and yet here we are.

But, I can see a way forward from these silos.

How do you think we will break past these network effects and allow a more naked form of innovation to be the arbiter of success?

Tuesday, April 28, 2020

The Power of Diversity: Why I Love the Humble and Organic CSV File

These days when we hear the word 'diversity', different things might come to mind for different people. Diversity - with modern connotations of immigration or affirmative action has infused the word with with notions of multiculturalism, gender equality, and other forms of social levelling. I am supportive of that, but what I want to discuss here is diversity from a more abstract perspective.

Lest you feel that it's nigh impossible to discuss diversity without taking a political position, I will come out and say it right now: I am pro diversity. Anti-diversity people can stop reading right now. That's the end of this post for you. Have a nice day.

If you're still reading, let's take a step back and look first at diversity from a cosmic perspective...

Life itself here on earth emerged from a process known as Abiogenesis. There's a lot we don't know about this process, but what we do know is that it came about through the power of diversity. Abiogenesis involved mixing of complex molecules (proteins, possibly delivered through meteors from space), shifting temperatures and shifting movement (through tectonic plate shifting), shifting gravitational patterns (through the orbit of the moon), shifting light and radiation patterns (through the orbit of the earth). Even the part of the universe we inhabit has been shown to have more atomic diversity than the other parts of the universe. For example, phosphorus, a key building block for life happens to be available in our next of the cosmic woods but not most other places in the cosmos. Without it we wouldn't exist. It turns out, this element is not evenly spread around the universe, and we happen to be luckily placed in this cosmic soup of diverse elements. We also need the right amount of radiation too so mutations may occur, allowing new forms of diversity to emerge and further evolve.

Given enough time and space, our existence is both inevitable and miraculous depending on how you look at it. What sets earth apart from most other planets in the visible universe is not only that we have the right conditions to allow life to take hold once it arrives, but just as importantly the right amount of diversity so life may spring from chaos, and not just thrive in a simple state, but evolve to every increasing levels of complexity.

Turning to the present moment, COVID-19 is likely also the product of diversity. Most epidemiologists agree that COVID-19 - like the SARS virus before it - emerged from animal markets in Wuhan China through natural evolution. There is a long history connecting plagues to animal domestication. It has something to do with animals being brought to mingle together in ways that would never occur in the wild. Again novel diversity.

This is why Europeans who first emigrated to North America brought disease that wiped out most of the indigenous population. These diseases evolved in European farms through similar processes that led to COVID-19. Conversely, Europeans that emigrated to South America were on the receiving end, constantly battling disease that evolved out of the tropical jungle's diverse ecosystem.

Bio-diversity may have also led to the extinction of Neanderthals. Our old cousins may have been wiped out or partially decimated by diseases that originated from Homo Sapiens homeland in Africa.

While it may seem that diversity of the kind I speak of here emerged from the "forces of nature" and is outside the purview of human life, I want to draw your attention to another example of how diversity shaped the trajectory of human civilization.

If you want to hear the full story, just listen to my podcast "Cradle of Analytics", and jump to episode 10 "Meet the Flintstones". I'll summarize the key points for here:
Interest/usury which is the backbone of Lending, Capitalism, and Free Markets was not invented by anyone on record. It was not invented by Adam Smith. It was not invented by the Greeks. It wasn't even really invented by the Sumerians who first practiced it over 5,000 years ago.
Rather, it appears have emerged out of a soup of diversity in the plains of Mesopotamia between the Tigres and Euphrates sometime around 8,000 years ago by who we now refer to as the Ubaidian Culture.

We believe that interest emerged during this period through the discovery of clay "counting tokens". These "counting tokens" with symbols depicting animals and other assets were then placed in clay envelops (known as 'bulla') about the size and shape of a softball and fire kilned, preserving them to this day.

We cannot say for sure why or how these tokens were created and used, but there is something of a consensus among archeologists, anthropologists, historians, and economists that these counting tokens were the original contracts and they were for lending with interest.

It sounds almost impossible even anachronistic that people were taking out interest based loans nearly 8,000 years ago, but if you follow me I'll show this may have happened:

Imagine you are a sheep herder who owns a flock of sheep. You use the sheep for their wool which you provide to your family and extended family (your tribe). One day something happens, and a pride of lions destroys a quarter of your flock (there used to be lions in Mesopotamia at this time - but recreational hunters later wiped them out). The situation is dire but there is still hope. You have the skills to get your flock back up to its original size, you just need a couple of rams for breeding and a few extra ewes to fill the gap. You also know of another sheep herder that is less than a days walk away. Borrowing sheep in times of need is quite common and surely you would lend a few sheep to someone you know who is in need?

But as you are walking to the other sheep herder's town you realize that you don't really know this person that well and he is from another tribe and worships a different god. He might not feel the need to do you any favours. The fear of rejection is going through your mind as you keep walking. You need to figure out how to sweeten the deal for this guy. Simply asking for help as an act of charity might be a bridge too far. So you are thinking "how can I repay this person to make it worth their while, when I have nothing to give up front"? And then you are hit with an idea: You realize that once you can get the rams to breed more lambs you may have a small surplus of sheep, and you can pledge that surplus back to the lender. Win win. Bingo!

You might not be a mathematician, but you know how to count and how to add and subtract numbers for the purpose of keeping track of your sheep. So through your basic knowledge of quantities you come up with a number of lambs that you feel is a fair amount. After all, it is your skill as a sheep herder that is allowing these sheep to safely multiply (and now you're even better cause you have since learned how to deal with hazards from lions).

You eventually make it to the foreign sheep herder before sundown and manage to meet him in person. You tell him of your predicament and then quickly explain that you realize you're not from the same family nor worship the same god but would like to make a deal that will make it worth his time. You then explain to the foreign sheep herder that you will borrow 50 sheep from his flock for a year including at least two rams, and then you will repay him the entire flock of 50 sheep plus an additional 10 lambs.
The foreign sheep herder - who already has far more sheep than he or his tribe actually needs - considers this offer. In his mind he might think to himself "Well I've always been proud of the fact that I own the most sheep in my town, but there is another herder who lives within a 2 day walk from me that I have heard has the most sheep of any herder. I have always envied and fantasized about being that person. I bet that I could be that person if I allow for this loan and it works out for me. Heck, if I pull off a few more of these loans I could become the most powerful sheep herder in all of Mesopotamia."
And so one man's fantasy of salvation is married up with another man's fantasy of domination.
One last thing is required to seal the deal: A contract.

Fortunately, being near the Tigres (or Euphrates) there is no shortage of soft clay. While you were on your way to meet the sheep herder you scooped up some clay and form it into a small discs - one for each sheep that you would repay. Using a reed you etch a small symbol on each token. Sixty (60) tokens in all. It took about an hour to do this.
On your instruction, your lender takes the tokens, bakes them in a kiln, and then takes the fired tokens and places them in a larger clay envelope. You then imprint a symbol known to represent your tribe, this symbol is a representation of the god your tribe worships and who protects you. Your god is now a party to the contract. Don't mess this up or you could bring a much bigger catastrophe to your tribe. Before sealing this contract you make it clear what will happen if the sheep cannot be repaid: You will give your life to this man. You will allow him to enslave you. Hopefully this will not happen.
The contract has been signed and so the deal has been made. Interest/usury has now emerged as a new concept.

If the more powerful sheep herder had been an enemy, our hero would not have bothered to make the journey. It would have been too dangerous. But if the more powerful sheep herder had been part of the same tribe, our hero would not have felt the need to make such a deal and would have made an appeal to charity. And if there was no clay to scoop up and shape along the way, none of this would work.

It is only through the ability of these individuals - each with their own selfish needs - to collaborate from which a new invention emerges and thrives and multiplies: Interest based contracts.
A lot of things needed to happen in order for this contract to emerge, but most importantly a diverse population competing for resources is at the core of this organic invention.

You can also look at interest as a kind of 'virus' just like COVID-19. This is why interest/usury is expressly forbidden by all Abrahamic religions (unless it's for people in another religion - just don't charge people in your own religion). The oldest critique against interest was discovered in India and dated to the second millenium BCE. India (and by extension China) never really had interest for most of their history until nearly the 20th century.

What do I think of interest? Well I can see why the Indian Vedics were suspicious of it. I think it's rocket fuel that has a force multiplier effect that has no equal, and should be treated like the powerful rocket fuel it is. I think it's both useful and dangerous. As a rule of thumb, keeping interest below 10% can be sustainable, but interest above 20% is difficult to sustain. That's just the beginning of what is needed to regulate interest and I won't get into all of that here.

Viruses are also useful. We now increasingly use viruses to combat genetic diseases like Sickle Cell Anemia. And it was only through the study of extremely weird and obscure viruses that emerged out of novel pools of diversity that we ever even came to discovering that it's possible for a virus to re-program our genetic code to cure disease. It's actually one of the most mind-blowing inventions of the 21st century.

And this now brings us to why I love CSV (comma separated values) Files for storing structured data more than any other format out there...
In short, it's because CSV is an organic standard that was selected from a primordial soup of diversity.

There are many file formats out there for structured data. If you are on the business side, it's probably Microsoft Excel.
Or maybe it's 'CSV'

If you are the technology side it's probably something like XML or JSON or YAML or Pickle, or if you work with 'big data', either Parquet or Avro or ORC.
Or maybe it's CSV.

What do I mean by CSV being organic?
What I mean is that the CSV format was selected and refined through a process of diversity similar to how interest/usury was sparked out of Mesopotamia's diversity and similar to how powerful viruses are selected from the novel mingling of animals.

From best I can tell, there is no inventor of the format as it is today - it merely emerged as a standard that was retroactively made to appear as a top-down engineered standard after the fact.
You can tell this from looking at this list of competing file formats and from the CSV RFC introduction. CSV is the only standard with no clear inventor on this file formats list. It's just there with a retroactively defined RFC (RFCs are standards developed by the IEEE, a governing body of the Internet).
This article provides a brief history of how the standard may have originated and points out that it emerged around the same time as the Relational Database (RDBMS). There were lots of other standards at this time too. But the CSV has been embraced by folks on both the Business side and Technology side of most organizations, so there is something about it that has universal appeal. I think it's because it's both compact and easily readable and almost reminds of lists we might otherwise read in a book or newspaper. I also think its popularity has to do with the fact that it closely aligns to the Relational Model - a perspective neutral approach to data modelling, which I'll explain later.

No other data file format has been embraced by both the Business side and Technology side of organizations like the CSV File.

There is power here.

Why is it not even more popular then?

There are a couple main reasons why CSVs frustrate technologists.
I'm going to explain those reasons below, but will indent the text so you can skip over these reasons if you don't really care about what technologists think.

First, because CSVs were never designed as a standard in the first place, the encoders do not always encode information consistently. A common situation is when you have a comma in the text value itself. This in turn has led to another organic invention of simply requiring that values that contain commas must be enclosed in double quote characters. Furthermore if the string contains both a quote and a comma, the quotes within the text value must be 'escaped' by preceding the character with another double quote.
If I wanted to encode this text in CSV: Hello, "world"
I would encode it like this: "Hello, ""world"""
That's basically the only rule, it's all you need to know.
But as simple as that rule is, many programs that export to CSV neglect to enforce this rule (or at least used to - most programs no longer make this mistake).
Second and more recently (well since around 2010 when Hadoop came on to the scene) there has been another criticism of CSV: CSV files rely on newline characters as record delimiters while permitting those same newline characters to be enclosed in quotes (just like commas).
Most systems don't have a problem with this, but in the world of Hadoop there are libraries known as "Sedars" (Serial Decoders), and the most popular Sedar for CSV files has made it a rule that CSVs cannot contain newline characters in string values. The reason is that Hadoop operates on a "divide and conquer" paradigm (known as map/reduce), and wants to split data off into chunks as efficiently as possible with the smallest chunk being a single record. So if you can assume that all chunks are separated by newline characters, end of story, then you can more efficiently split up files for sub-processing by respective worker nodes. But if you need to perform the additional step of looking for enclosure characters then it is harder to achieve the same level of processing efficiency.
So if you ask me, it's just a matter of correcting the Sedars and respecting the CSV format's simplicity. In other words, I should be able to choose a Sedar that doesn't have this limitation. But for reasons I'm not aware of, this doesn't seem to be in the offing.

While pushback against CSVs mainly comes from technologists, business users will often prefer the Excel workbook format over CSVs. Why is that?
The reason comes down to Metadata. CSVs just contain the data values with little or no information about how those values should be interpreted. The only information that does exist (if it is even written out) is the header row. Excel workbooks also allow you to bundle together multiple worksheets, which would require separate CSV files.

So I have two responses to the Excel advocate: First, the CSV does have a sufficient amount of Metadata for most list taking purposes and is extremely lightweight. Case-in-point: I maintain several key lists just on the notepad app on my phone. I'll share a portion of one with you here right now (you could even paste this into a text document, save as a CSV, and boom you have a database you can load into a BI application like Power BI or Qlik or Tableau or just Excel):
-----
Restaurant,City
El Coyote,Los Angeles
Rutt's Cafe,Los Angeles
Denver Biscuit Co.,Denver
Bubby's,NYC
H Bar,Toronto
Peppercorn,Las Vegas
The Argo,Vancouver
-----
I keep similar lists for books and movies.
I could use an App I suppose or some services, but them I'm locked into their tools making it tricky to blend that information with other data I have. Those Apps also don't allow me to visualize and analyze the data in my own way.

The other reason why Business users might push back on using CSV is that Excel allows them to easily blend Metadata such as text formatting (e.g. underline, bold, italics) and formulas (e.g. A1+B1).

But Metadata is just data, and if Microsoft wanted to, they could separate out all this information into one or more additional CSV files that are then bundled with the underlying data CSV files, in a single folder and ZIP file while presenting everything to the user in a neatly integrated form. Microsoft does do something like this behind the scenes already. Although they are doing so in proprietary formats, rather they are doing what could have been accomplished in zipped CSV files in proprietary formats instead.

These proprietary formats like .xlsx and .xls and .xlsm are all well and good if you are working entirely within the Microsoft Office ecosystem, but even if you move to Microsoft's Azure platform, the Excel workbook is persona-non-grata. So it's limited there.

Microsoft's decision to store structured data in proprietary formats was probably made during a time where the performance of opening and saving an Excel document was a major concern. This is the same thinking that led many mainframe developers to store dates using a 2-digit year as opposed to the full 4-digit year.
Since the whole 'Y2K' problem (which is still ongoing - just look at your credit card's expiry date) went down, very few people now would see it being sensible to store a 2-digit year given that we are routinely storing movies, pictures, and other forms of unstructured data in volumes that dwarf most structured data stores.
Storing all structured data and Metadata in CSV format would surely annoy some technologists. But it would also provide a way of flattening away today's database bureaucracy which does far more damage than any minor performance issue.
But if Microsoft were to do that, they might lose their grip on spreadsheets. So they don't much motivation to see this happen. It would only benefit the vast majority of people, but just not them.
Oh well.

Only the humble CSV files seems to bridge both the Business and Technology worlds. It is the closest thing we have to a lingua franca for structured data.

But it goes deeper than just this encoding. CSVs also nudge us towards 'perspective neutral' data models (i.e. normalized models) rather than lock us into hierarchical perspectives.

Take for example the JSON format which is popular among developers and is also a text format similar to CSV. Yet, unlike CSV it prefers a hierarchical representation of data.
Using my above example of Restaurant and City which has one row per relation, a JSON file format might decide to encode this as a parent-child relationship with the parent being the City and the child being the Restaurant, especially if there were many restaurants listed under a single city.
To look up any Restaurant I would first need to traverse the parent City in order to get to it. This is also known as "pointer chasing" and is the reason we shifted from Network and Hierarchical databases (like IDS and IMS) to Relational databases starting in the 1970s and throughout the 1980s and 1990s. The problem with hierarchical data models is that they tend to lock you into a certain perspective whereas the relational model (which CSVs align to) work more like the human mind, where we can jump from list to list to list without any preconceived hierarchy.

Because most data developers now prefer the relational model, I routinely see JSON documents that are encoded in such a way as to not be hierarchical and simply be lists just like CSV files. Ironic, but expected. CSVs on the other hand are much more compact and readable than any JSON document for this purpose.

It's also for this reason that spreadsheet tools like Excel cannot directly open JSON documents. In order to analyze a JSON document in a tool like Power BI, it is necessary to "flatten" the hierarchies into explicit rows and columns. Defenders of hierarchical formats like JSON point out that they are useful because they allow us to blend a number of related entities into a single document. It's similar to the reason why Business users like to bundle multiple worksheets into a single .xlsx workbook.
My response to this is: Why not just store the CSVs in a single folder? You can then easily move the folder as a ZIP or RAR file, and its contents can more easily be analyzed as rows and columns, and more easily combined in novel ways.

As a data architect, the CSV feels like that Mesopotamian clay in hand. I can employ the power of relational algebra to combine and analyze CSVs at the speed of thought - pivoting from entity to entity as my stream of consciousness takes me. Tools like Power BI and Qlik Sense allow me to rapidly connect these sets together like snapping together Lego blocks.
Give me a set of CSVs and some interesting questions at 9 AM, and I'll have your answers and insights by noon.
But give me a hierarchical format, and I'm mucking around with with an unwanted homework assignment of picking apart hierarchies and flattening them into CSVs that can be manipulated like my beloved Lego blocks.

Why does any of this matter?!?
Well this blog post is really just warming up for something much bigger: A rethink of how we manage and access all structured information.

And what I will attempt to demonstrate in the near future is that it is possible to manage ALL structured information through the humble CSV, if we have the collective will to make this happen. Think of CSVs as universal well worn data Lego blocks.

And why does that matter?
Most information problems can be boiled down to a combination of human and technology bureaucracies leading to data gate-keeping and information asymmetry. In many cases gate-keeping is desired and with merit. But in most cases I come across, data gatekeeping is enabled through a by-product of technological bureaucracies built on an alliance of information system vendors who build the systems, consultants who implement these systems, and custodians that operate these systems.
Show me a software vendor working together with consultants and custodians, and I'll show you a locked-in service layer that often results in confusion and bureaucracy to those on the wrong side of the service.

Perhaps that's the way it has always been with Information Technology. Cyber Crud has been with us for a very long time, just ask Ted Nelson.
I hope to show you that this is not how it has to be.

Stay tuned...

Monday, March 30, 2020

Does this Patent (which expires today) Explain Qlik and Power BI’s Opposing Philosophies?

Over the past five years or so I have taken a deeper interest in philosophy. More to the point, I have become especially curious as to WHY certain philosophies are adopted over others.

If you have had the chance to listen to my podcast “Cradle of Analytics”, you may know that I spent ten episodes (and nearly nine hours) explaining the “Origins of Analytics”, and why I believe that so-called “deductive-analytical thinking” is at the root of Western culture and philosophy. While the capability to thinking deductive-analytically comes from writing, the appetite for this mode of thinking can be traced back to Mesopotamian financial interest based contracts from as far back as 5,900 BCE, starting with the Ubaidian culture, followed by Sumerian culture who developed temples into early banking institutions. The Phoenicians would ultimately transmit this knowledge of usury to the Greeks in the 9^th century BCE.

In my view, Western culture and philosophy has more-or-less been adopted as a “meta-philosophy” that subconsciously affects the core of our very thinking. It is part of our identity now. I argue in the podcast that this appetite for “deductive analytical” thinking (or “digital thinking” as opposed to “analog thinking”) emanated from Mesopotamia and was eventually codified by Aristotle in the 4^th century BCE, and that this codification of deductive logic (the “syllogism” as described in Aristotle’s “Prior Analytics”) quickly led to the development of the “axiomatic method” (as codified by Euclid), and would eventually lead to the scientific revolution as best exemplified by Johannes Kepler (superseding Galileo Galilei) and best codified by Charles Sanders Pierce (superseding William Whewell). This in turn has led us to humankind’s ultimate product of thinking efficiency: The invention of the Turing complete programmable computer. This invention will eventually automate all rules-based tasks – if we allow them to, and possibly allow humanity to expand across the universe in a scalable manner – if we have the will.

In other words, I believe that there is a vein of history that you can trace from today back nearly 8,000 years and see that there has been for some time evidence of “digital thinking”. In fact, all Western religions (Judaism, Christianity, Islam) run off this same basic operating system of so-called “sovereign laws” which are structured as ‘If A then B’. Digital code runs deep in our collective imagination even if we don’t realize this consciously.

Digital thinking is often wrong – but it is always efficient.

I should point out that non-Western “analog” thinking tends to be based on – you guessed it – thinking through analogies. Charles Sanders Pierce coined the term “abduction” which describes a specific application of analogy thinking to solve problems. Abduction is very difficult to explain and is often conflated with deduction, and philosophers and psychologists argue over what it is to this very day.

The point is, digital thinking tends to be more top-down and is efficient because it leverages what we might call “artificial intelligence” (which is to say formulas and rules that have already been tried and tested), but can sometimes lead you astray, often with disastrous consequences. Whereas “analog” thinking leans more on abduction and is where genius resides, but is less powerful without the aid of formulas and rules to build upon.

I could go on here, but I don’t want to digress too much.

If you are specifically curious about the origins of Western philosophy, just listen to the podcast.

The reason why I bring this up and the main point I am making here and in my podcast is that I have come to see top-down thinking and bottom-up thinking as philosophies unto themselves and we often don’t realize when we are embracing one philosophy and not the other. In this blog post I will be explaining and demonstrating that Microsoft Power BI has adopted a more “top down” philosophy whereas Qlik offers a more “bottom up” philosophy, and why I believe the “bottom up” philosophy is often ignored even though I believe it offers advantages and benefits over Microsoft’s “top down” philosophy.

Put another way, I might just label these two philosophies (which can both to a large extent be followed within Power BI):

The Tao of Microsoft Power BI data modelling
The Tao of Qlik data modelling

But before I show you the example which illustrates the differences I should go over some history…

To understand the relationship between Power BI and Qlik, you need to back to the origins of Qlik and Power BI. Here is my abridged version of Qlik and Power BI’s history (I apologize if I have made any errors here):

In 1993 the first version of QlikView (originally called “QuikView” after the acronym: Quality; Understanding; Interaction; and Knowledge) was released. QlikTech (the owning company) was [and to a large extent still is] based in Sweden. The first two versions of QlikView were built on Microsoft Excel.

What differentiated QlikView from other BI tools was and is its colour coded filter boxes (known as “List Boxes” in QlikView and “Filters” in Qlik Sense). I refer to this capability as Qlik’s “State-Aware User-Experience” or “State-Aware UX”.

In 1997 version 3 of QlikView is released. But it has now been entirely rewritten in compiled “C” language for maximum efficiency.

This version of Qlik also codified two other aspects of Qlik that would become both indispensable and difficult to explain:

A SEMI-JOIN link indexing system for efficient semi-joins

The main benefit of this is that this allows us to easily integrate multiple fact tables without running into a “fan trap” that leads to duplicates which normally occur when JOINing multiple Fact tables to form “cubes”
The other benefit of this is that this enables self-serve Data Prep, since it is possible to naïvely (or playfully) integrate new entities in a modular way that is safe and does not corrupt existing entities through merge/join set operations which can lead to “fan trap” duplicates

An associative storage and query engine. This associative engine stores every unique data element value exactly once with all tables and columns pointing back to a reference of this value – regardless of whether the value is a number, a date or a string.

The benefit of this associative index is to both:

Improve data compression; and
Facilitate full text and numeric range searching, so that even if you don’t know where a specific value might be located in the data model (e.g. your last name), you can easily search against all rows and columns instantly from a single search bar through this “associative index”. It’s really quite a useful feature

It should be noted that Qlik refers to ALL of its innovations under the banner “Associative” even though this is the only feature that is strictly associative (based on the academic definition of “associative”)

To summarize, here are Qlik’s salient features that distinguished it from its competitors at this time (and to a large extent even to the present day):

State-Aware User Experience
Linking SEMI-JOIN Model
Associative full text searching

Even with these benefits Qlik did not at this time displace incumbent BI (Business Intelligence) platforms like Cognos, Business Objects, Microstrategy, and Microsoft Analysis Services, in the same way the automobile displaced the horse.

One possible explanation for this is that most BI platforms are never used for business analytics to begin with and are merely glorified data extraction engines, whereby users will simply:

Find a report (or reports) they need
Download all reports into an Excel friendly format (e.g. CSV, TAB, XLS, XLSX)
Open the CSV or Excel documents in a new Excel workbook and begin a process of manual Data Prep usually entailing: Cut-copy-paste; VLOOKUP (to integrate data); and Pivot Tables (to summarize data)
Present the final output in the form of a beautified Excel document or PowerPoint/PDF Report

That notwithstanding, there is a need for interactive dashboards that go above and beyond glorified data extraction engines, so you would think that Qlik would be more successful during this time with so little competition.

Unfortunately, this is not how technology and innovation works.

Henry Ford is said to have once said “If I asked people what they wanted, they would ask for a faster horse.” With Ford here implying that people would not ask for an automobile because they could not conceive of it even though they would clearly benefit from its efficiencies. History has been kind to this quote (Ford himself was not a very nice man).

With something like an automobile it is easy to see its benefits and why one would want an automobile over a horse. But most innovations are not as apparent as most inventions are often just a cog in a bigger machine. Unless you can overturn the entire machine itself (and not just improve parts of it), then you are always in a vulnerable position as an innovator.

This unfortunately is where Qlik has always been; A set of clever – even brilliant - inventions but not enough to overturn how the entire eco-system that Business Intelligence software thrives in. This eco-system includes databases, ETL tools, Semantic Layers, and Data Visualization tools, not to mention the IT professionals who have staked their careers in learning these tools and how they interact with one another.

I will continue to explain Qlik’s history, but first I should point out when I first came to learn about Qlik and how I saw its history unfold before my own eyes.

It was in 2005 while on a vacation in Florida. I reconnected with a friend of my wife who had recently become very interested in Business Intelligence. She had heard that I had also worked in BI and wanted to get my opinion about some software she and a colleague had developed some dashboards with.

At the time I was somewhat new to BI but had read Ralph Kimball’s “Data Warehouse Toolkit” and had learned the Cognos stack. I understood the ins-and-outs of “cubes”, “dimensions”, “measures”, and detail-drill through reports. In my mind I felt somewhat confident that I knew what “Business Intelligence” was and how it could help companies.

But what I saw in that demo that morning over brunch blew my mind. The first thing I noticed was the speed. Everything was instantaneous with no lag. She would click in a chart and everything on the page would appear to magically update itself to be consistent with her selections. I then noticed how she would casually go to any field and filter on it. It felt like the world of Dimensions and Measures I had been accustomed to had been flattened and that hierarchies were something arbitrary, and that one could merely follow one’s own train of thought to answer any business question.

I was blown away and could not stop thinking about Qlik.

After returning from my vacation I continued to think about the demo I saw. It just didn’t line up with what I knew about BI through Cognos and I would mention it to people who I thought were BI experts and they never heard of it.

When I searched Google for reasons Qlik was not more popular, I eventually found one that made some sense to me: Qlik’s indexing technology required all data be loaded into RAM (computer memory) and that RAM was scarce so Qlik was not a realistic technology for many large companies with lots of data. A big problem was that most computers ran on 32-bit operating systems that could only reference 4 GB of RAM in total and only 2 GB of RAM for any given program.

But around 2008 something started to change. What happened was this was around the time that Microsoft “Windows 7” came out and Windows 7 was selling more 64-bit versions than 32-bit versions. In turn this meant that 64-bit versions of applications could now address up to 2 TB (terabytes) of RAM which at the time was more than most companies had in their data warehouses. As a result, Qlik was now positioned to take on large volumes of data and began to surge in popularity.

This would be the beginning of a new golden era for Qlik.

By 2009 Microsoft caught wind of this disruptive trend and had announced a new Business Intelligence platform was under development. This platform was code named “Project Gemini” and was slated for an initial release by the end of 2010. I followed this news closely and recall Project Gemini having the most lead time of any Microsoft project to date. In other words, it felt like Microsoft was firing a shot across the bow against Qlik.

I was excited by this news. I liked the idea that Microsoft was embracing Qlik’s innovations and looked forward to having an alternative to Qlik with all the same benefits. I had experimented with other tools like Tableau and Microstrategy and others but because they were all built on cube/OLAP based platforms (even Tableau), they lacked the flexibility, responsiveness, and User Experience that Qlik had.

By 2010 Microsoft had released the first deliverable from Project Gemini: PowerPivot, a plug-in for Excel.

After experimenting with PowerPivot for while I could see its potential. I was relieved to see Microsoft had essentially copied Qlik’s SEMI-JOIN linking model which Microsoft dubbed “The Tabular Model” and they also introduced a new language to go along with the Tabular Model called DAX (Data Analysis Expressions), replacing the older MDX (Multi-Dimensional Expressions) language which had been built for the “Multi-Dimensional Model” which was now seeming obsolete. Currently, MDX is no longer supported by Microsoft in their cloud products – it’s Tabular and DAX all the way now.

I warned BI sales people that Microsoft was just warming up and had laid the foundation for disrupting the BI industry given their huge platform leverage – more so with their Office suite than the Windows platform.

I am aware of Microsoft’s history as the “fast follower” who gobbles up disruptive technologies like Qlik in the same way Star Trek’s “Borg” assimilates new life forms into its Cube Colonies. Microsoft is after all a creature of capitalism, and capitalism is all about leverage, and Microsoft’s platforms give it huge amounts of leverage. It’s not even close to an even playing field in the software industry. Case-in-point: There was a big law suit that ended in 1999 with Microsoft being ordered to be broken up after destroying Netscape (and countless other small companies). But then George W. Bush was elected and that judge’s decision was overruled and basically canceled. This was still a black-eye for Microsoft and they behave more gently than in those days – some say Google would not exist had this not been the case – but few people know anything about Business Intelligence (it’s not quite a consumer technology), so I figured Qlik would be Microsoft’s crosshairs.

To be clear, I am speculating here. I don’t have any documentary evidence to present here that Microsoft was indeed “following” Qlik; This is an educated guess.

Regardless of their motivations, Microsoft had developed a tool that for once presented a formidable competitive product to Qlik.

It should be noted that Tableau was also disruptive (their golden era would arrive around 2012) as they had developed a very user friendly “drag-and-drop” interface for self-serve BI. They also invested heavily in engaging data visualizations. It has been said that Narcissus was seduced by the image and in this same way Tableau was very seductive; it had superior aesthetic to its competition. Tableau also wisely invested in academia by giving its licenses for free to all college and university students. But Tableau eventually would see its own growth stunted for it lacked the depth of tools like Qlik and Power BI, and many users would often hit brick walls concerning scalability and performance with Tableau.

Enough about Tableau.

It would not be until 2015 that Power BI was released. The main reason for it taking so long between the release of PowerPivot and Power BI is that Microsoft unfortunately lost time on a fruitless detour. Namely, in response to the criticisms of PowerPivot being too much of an Excel tool and not like a “real BI tool”, Microsoft’s answer to this problem was to invent “PowerView”. PowerView was a server based product that would finally allow IT departments to roll out “Enterprise Business Intelligence” solutions through the users web browser. The technology was built an another Microsoft technology called “SilverLight”. Silverlight ended up being shut down as it was deemed by the wider technology community to be something of a withered rump following a keynote speech by Steve Jobs where Jobs stated that he would not support Adobe Flash on the iPhone as it was seen by Apple as a battery hog. Instead, Jobs went on, Apple would throw its support behind the open HTML5 standard which could be made more energy efficient to run. That was then death knell for Adobe Flash and would be the death knell for Microsoft Silverlight, and in turn PowerView.

In 2015 Power BI was released – built entirely on the open HTML5 standard that Steve Jobs was hyping for the iPhone.

In 2016 I was a consultant working for a large business and technology consulting firm. We were technology advisors and perhaps I was lucky, but I can genuinely say that the people I was working for were happy for me to recommend Qlik as the best Business Intelligence software with Power BI being the next best alternative starting in 2015.

Nevertheless, I couldn’t help but notice that even if I could convince a business department that Qlik was the way to go, I would inevitably get blocked by an IT department that would claim they had decided some other tool was already decided upon and that there was no room for Qlik. I would explain the reasons why Qlik was able to meet business scenarios better than the tool they had decided on. Sometimes I would even be able to persuade the person I was speaking with, but they would inevitably send me an e-mail explaining the decision was out of their hands.

Some of my colleagues at the consulting firm I worked for took notice of Qlik and admired it, but if their careers were deeply intertwined with Microsoft or another large vendor, they would continue to cast aspersions of doubt around Qlik’s ability to do this or that (usually scaling to beyond 2 Terabytes). But as soon as Power BI started to gain traction, the Microsoft consultants were quick to forget their skepticism of Qlik and were almost instantly True Believers of Power BI.

It was at this point that I could see the power of Microsoft’s leverage working from all sides.

Part of me was happy to see that Qlik had a strong competitor as it gave me more leverage when negotiating license terms with Qlik. I had also hoped that others would see the underlying relationship between Qlik and Power BI. This has barely happened.

I mentioned earlier that “Narcissus was seduced by the image”. If you are lucky to get a captive audience and show them a 10 minute demo of some BI dashboard, no one in the audience will be able to tell you anything about the underlying data model and how it works. It’s invisible.

And when it comes to features like State-Aware UX and Full-text associative searching, unless you have come to appreciate those benefits you will lack the connoisseurship required to appreciate these things.

To use an analogy, when it comes to Business Analytics, most people’s level of connoisseurship would barely allow them to tell the difference between an overcooked Salisbury Steak and a perfectly cooked Chateaubriand.

If this analogy comes across as snobby and out-of-touch, well that’s partly my point. The finer things in life are typically out of the reach of most people – not as much for lack of money, but mostly for lack of education. Yes, the two are related, but it is also possible to have one without the other, and in this context, connoisseurship is more important than money, and connoisseurship (refined education) can be achieved without a lot money if you are passionate about the subject at hand.

This finally brings me to the point of this blog: I have recently concluded that while Microsoft may have been a “fast follower” of Qlik, the Microsoft Power BI culture lacks the connoisseurship of Qlik and has – in my opinion – subconsciously embraced a philosophy that is more top-down and cube-like than what it is truly capable of.

What are these two philosophies exactly?

Qlik’s philosophy tilts bottom-up, and believes that all Business Intelligence should work like the human mind – where we can hop from concept to concept with no preconceived structures

The example they often site is how our thought process works if we lose something like our keys:

We scan through a list of places we have been (e.g. outside walking, driving, going to the mall, etc.)
Then we drill-down on subjects that we think are more relevant; We might start to list out all the places in the shopping mall we visited
Eventually after jumping from list to list, we eventually find what we were looking for “I left my keys in the shoe store in the mall when I put them down to pay for the shoes”

Microsoft’s philosophy tilts top-down and embraces the perspective of the CEO

Time should be spent up-front planning out Dimensions and Hierarchies that in turn reflect a “single version of the truth”

We always start with Big Rocks (e.g. Years and Countries) that can be broken into smaller rocks (Dates and Towns)
Outliers can be discovered by drilling progressively from Big Rocks down to Little Rocks
A key outlier will eventually be discovered that can be used to drive better results

I should point out that Qlik’s philosophy allows for an embrace of both the bottom-up and the top-down. But Microsoft’s top-down philosophy is more difficult to invert.

Put another way “Splitters can be lumped more easily than lumpers can split”

https://en.wikipedia.org/wiki/Prefactoring

So in my view, Qlik is more of a “Splitter” philosophy with Microsoft embracing more a “Lumper” philosophy. As you shall see, it is possible to make Power BI act like Qlik. Well almost.

To explain more of I mean here and how I came to this conclusion, I have created an example that I believe clearly illustrates these two philosophies.

In this example I have a small dataset that includes 2019 populations for three countries:

Australia;
Canada; and
United States

This dataset includes both national totals as well as State/Province totals for each State or Province. Although I have not included any non-state districts or territories. So the country total is not exactly the same as the total for all provinces and states.

I have instrumented this dataset such that it supports “natural linking”. Namely, I have named each column/field such that I am indicating to both Qlik and Power BI how the tables ought to be linked.

In the Qlik’s case, linking is linking. There is no option in telling it how to link apart from the fields to connect.

But with Power BI, a link may be either “Single” direction or “Both” directions (also known as bi-directional links). So it is here we can see Microsoft showing its hand when it makes one link “Single” and other “Both”

I have pasted below snippets of the raw data so you can see how the fields are laid out:

Dim_Country

Fact_Country

Dim_StatProv

Fact_StatProv

Once I load the data into Power BI as is, Power BI automatically links up the entities based on a combination of looking at Field names and profiling the underlying data cardinality.

Here is Power BI’s automatically generated the model:

I have highlighted the key fields so you can see clearly how the tables are linked.

Notice how the link between Dim_Country and Dim_StatProv flows in one direction, top-down from Country to State/Province.

Now if you look at the Qlik model you can see it’s more-or-less the same, with the main difference being that all links are bi-directional (all links in Qlik are always bi-directional – there is no Single direction linking in Qlik).

If you just look at these diagrams it will not be clear how this link direction actually impacts the User Experience, so I have built the same dashboard in both Power BI and Qlik Sense so we can get a better sense of how this subtle difference changes the user experience.

To give the dashboards something to analyze I have also created the following three Measures:

Total Country Population

Defined as: SUM([Country Population])

Total State/Province Population

Defined as: SUM([StatProv Population])

State Total Percent of Country Population

Defined as: SUM([StatProv Population]) / SUM([Country Population])

I should also remind you that “Total Country Population” does not equal “Total State/Province Population” because I have left out any regions that are not strictly states or provinces. For example, DC (District of Columbia) and the Yukon Territories are not included in the State/Province populations but are included as part of “Country Population”.

Let us take a look and compare the Power BI and Qlik dashboards respectively, starting with this Power BI Report (dashboard):

In this Power BI Report we can see everything looks quite normal and the numbers all appear to make sense except for the bar chart on the lower right (where I have placed a large red ‘X’) which shows the States/Provinces ranked as a percentage of their country’s population.

We can see California is ranked at the top of the list, when the correct answer should be Ontario. The reason why we get California is that the Single Direction link is preventing Power BI from linking the State’s population back to its parent population, so it is comparing against the population of all three countries. This phenomenon is known as a “chasm trap” and Power BI has stepped right into this trap so to speak.

That said, there may be ways of getting around chasm trap this using advanced DAX expressions; I don’t want to mislead readers in thinking there is only one way to solve this problem. My point is to show how the default assumptions Power BI makes when creating links can lead to misleading totals when using Measures that aggregate across linked fact tables.

For comparison let’s look at Qlik’s dashboard:

In this Qlik dashboard we can see that Ontario is ranked at the top of the list with 38.72% of Canada’s population. I have placed a large green checkmark to make it easy to find. You can also see that most of the States/Provinces on this list are from Australia or Canada as you would expect. Whereas in the Power BI Report, the top of this is mainly made up of large US states.

The reason why Qlik gets the correct answer here is that its links are all bi-directional (“Both” directions) with no “Single” links at all.

Now let’s look at what I really wanted to demonstrate here which is the interface for making field selections (generally known as “filtering”).

In this first example Power BI this is how it appears when a country is selected in the Country Slicer:

As you can see, the interface is intuitive. We can see what country (Canada) is selected as well as the alternate countries that are possible to select. Furthermore in the “State/Province” Slicer we can see all the provinces that are specific to Canada while all of the other States in both USA and Australia are completely hidden.
We can also see that when we select a single country that our chart in the lower right corner is now producing the correct results, since have effectively eliminated the “chasm trap” by explicitly filtering on a single country.

If my objective is to drill down from the top (e.g. Countries) down to the bottom (e.g. States/Provinces, or perhaps lower down), then this presentation and UX is decent.

Let’s compare this to Qlik’s dashboard with the same selection:

As we can see the same country “Canada” has been selected and we can clearly see the 10 provinces of Canada with a white background. But where things differ is that we can now see a list of the excluded States from the other countries. While our analysis is not strictly concerned with these regions, they are available for selection should we choose to change our course of questioning.

I’ve already shown you a couple minor points of contrast between Power BI and Qlik, but now I want to take you to the most salient point of contrast between the two tools. Namely, what happens when we select a “State/Province”.

Here is how this appears in Power BI when we select the province of Ontario:

In my view, there are two issues with this selection. First, I can see at the bottom right that my percentage calculation is off (as was the case earlier when no selections were made). Ontario is only showing as 3.70% of the total population when it should be 38.72%.

What is new and most frustrating about this is that when we look at the Country Name Slicer we cannot easily see what country Ontario belongs to. Yes, I know it’s obviously Canada, and most people know their own states or provinces. But I can imagine there are some Americans who might not know if a non-American State/Province is Canadian or Australian, and this dashboard will not tell us that unless we exhaustively go back to the top and click on each country one by one.

Let us finally compare with the Qlik example:

As we can see the chart in the bottom-right is producing the correct result of 38.72%.

More significantly, we can clearly see that Ontario belongs Canada to by looking at the “Country Name” Filter box.

The reason why this is significant is that Qlik is allowing us to go back up the hierarchy but choose any path we would like. In other words, Qlik is basically telling us the hierarchy need not be traversed as a hierarchy but however we would like.

And this gets to the essence of the “tao of Qlik”: Qlik is allowing us to ignore the typical user constraints of hierarchies and instead invites our minds to wander.

To a cynic, this idea of “wandering” through data and breaking down hierarchies might come across as some post-modern hippy dippy mumbo jumbo. But in my experience this flexibility has led to tangible benefits.

You see Analytics is basically a game of “20 questions” (if you have never heard of “20 questions” – it started as a British gameshow and these days is a fun game you can play over a dinner table with just two people). The more questions you can ask the faster you can drill down to the final answer. I have used Qlik’s flexibility to both solve business problems that were unsolved as well as develop very flexible dashboards where I was told after demoing the dashboard “How did you do this so quickly? Was it you or was it the tool?” where I respond: “I appreciate the complement, and I am proud of my work, but I would not have been able to complete it so quickly without this Qlik tool and its unique features.”

I am a critic and skeptic at heart and the idea of rallying behind a for-profit corporation is not how I see myself. I am somewhat uncomfortable lavishing this praise on Qlik. The point of this post is to show that my objections to other tools are not based on some “Coke versus Pepsi” preference, but are rather rooted in fundamental philosophical differences, and I wish that other tools and vendors would get this point and start competing on the same philosophical ground that Qlik has laid out.

In that spirit, let’s go back to Power BI and modify its linked model to make that Single Direction link bi-directional to see how this changes things.

First here is a picture of the modified model so you can see the difference:

Now let’s take a look at the dashboard when there are no selections.

Here we can immediately see in the lower-right bar chart that that the Percent calculation is correct and Ontario is at the top of the list.

Now let’s take a look at scenario where we select a State/Province:

Here too, we can see something of an improvement as Power BI is telling us that Ontario belongs to Canada. But we cannot see the alternative countries now, whereas before when there was a Single direction link, we could.

So by replacing that Single direction link with a Both direction link, we have gone from the “tao of Microsoft” to the “tao of Qlik”.

But there is still a bit of a User Experience trade-off with Power BI.

Why is that?

I believe the answer comes down to this patent held by Qlik: US Patent # 6037938 with the inventor Hågan Wolke

https://patents.google.com/patent/US6037938

The patent is called “Method and a device for displaying information about a large number of independent data elements”. Here is the abstract of the patent:

“A method of displaying a large number of interdependent data elements in a computer screen area that is small compared with the number of data elements to be displayed is disclosed. Each data element is defined by a data element type and a data element value and has an associated status value. The method comprises the steps of displaying the data element types of the data elements as a scrollable list in the computer screen area, and, for each data element type which defines more than one different data element which has a predetermined status value, displaying a predetermined indication thereof in association with the data element type in the scrollable list; and sorting, in response to a change of the status value of at least one of the data elements, the data element types in the scrollable list according to the status values of the data elements defined by the data element types. An article of manufacture including a computer-readable medium having stored thereon a computer program for carrying out the method is also disclosed.”

Can I say for sure this patent is the reason Microsoft is not displaying excluded values in a similar fashion to Qlik? No I cannot say that and I highly doubt Microsoft would ever confirm my suspicion for legal reasons.

But if I am correct in my belief, given that this patent expires today, March 30^th 2020, then anyone, including Microsoft should be able to copy this invention without legal repercussions in the US. I am not a lawyer so do not take what I am saying here as legal advice in any way shape or form. Also I don’t know what other jurisdictions the patent has been filed under and how that might complicate Intellectual Property claims. Also, I might be barking up the completely wrong tree – this could all be explained by underlying constraints to Microsoft’s technology that I am not aware of.

To conclude, I wish Qlik the very best and feel they deserve the fruits of their invention. But given that they never really got the Business Intelligence world at large to understand the benefits of their bottom-up philosophy to Business Intelligence this might be a good thing for the world at this juncture.

Still I feel bad for Qlik and I will continue to root for them.

But life’s not fair.

Essential Complexity

Thursday, October 08, 2020

Snowflake’s Market Valuation and Network Effects

Tuesday, April 28, 2020

The Power of Diversity: Why I Love the Humble and Organic CSV File

Monday, March 30, 2020

Does this Patent (which expires today) Explain Qlik and Power BI’s Opposing Philosophies?

My Web Site

Click the icon below to add Essential Complexity to your RSS reader (like My Yahoo!)

Blog Archive

About Me