Tuesday, April 28, 2020

The Power of Diversity: Why I Love the Humble and Organic CSV File

These days when we hear the word 'diversity', different things might come to mind for different people. Diversity - with modern connotations of immigration or affirmative action has infused the word with with notions of multiculturalism, gender equality, and other forms of social levelling.  I am supportive of that, but what I want to discuss here is diversity from a more abstract perspective.

Lest you feel that it's nigh impossible to discuss diversity without taking a political position, I will come out and say it right now: I am pro diversity. Anti-diversity people can stop reading right now. That's the end of this post for you. Have a nice day.

If you're still reading, let's take a step back and look first at diversity from a cosmic perspective...

Life itself here on earth emerged from a process known as Abiogenesis.  There's a lot we don't know about this process, but what we do know is that it came about through the power of diversity.  Abiogenesis involved mixing of complex molecules (proteins, possibly delivered through meteors from space), shifting temperatures and shifting movement (through tectonic plate shifting), shifting gravitational patterns (through the orbit of the moon), shifting light and radiation patterns (through the orbit of the earth).  Even the part of the universe we inhabit has been shown to have more atomic diversity than the other parts of the universe.  For example, phosphorus, a key building block for life happens to be available in our next of the cosmic woods but not most other places in the cosmos.  Without it we wouldn't exist.  It turns out, this element is not evenly spread around the universe, and we happen to be luckily placed in this cosmic soup of diverse elements.  We also need the right amount of radiation too so mutations may occur, allowing new forms of diversity to emerge and further evolve.

Given enough time and space, our existence is both inevitable and miraculous depending on how you look at it.  What sets earth apart from most other planets in the visible universe is not only that we have the right conditions to allow life to take hold once it arrives, but just as importantly the right amount of diversity so life may spring from chaos, and not just thrive in a simple state, but evolve to every increasing levels of complexity.

Turning to the present moment, COVID-19 is likely also the product of diversity.  Most epidemiologists agree that COVID-19 - like the SARS virus before it - emerged from animal markets in Wuhan China through natural evolution.  There is a long history connecting plagues to animal domestication. It has something to do with animals being brought to mingle together in ways that would never occur in the wild. Again novel diversity.

This is why Europeans who first emigrated to North America brought disease that wiped out most of the indigenous population. These diseases evolved in European farms through similar processes that led to COVID-19.  Conversely, Europeans that emigrated to South America were on the receiving end, constantly battling disease that evolved out of the tropical jungle's diverse ecosystem.

Bio-diversity may have also led to the extinction of Neanderthals. Our old cousins may have been wiped out or partially decimated by diseases that originated from Homo Sapiens homeland in Africa.

While it may seem that diversity of the kind I speak of here emerged from the "forces of nature" and is outside the purview of human life, I want to draw your attention to another example of how diversity shaped the trajectory of human civilization.

If you want to hear the full story, just listen to my podcast "Cradle of Analytics", and jump to episode 10 "Meet the Flintstones".  I'll summarize the key points for here:
Interest/usury which is the backbone of Lending, Capitalism, and Free Markets was not invented by anyone on record. It was not invented by Adam Smith. It was not invented by the Greeks.  It wasn't even really invented by the Sumerians who first practiced it over 5,000 years ago.
Rather, it appears have emerged out of a soup of diversity in the plains of Mesopotamia between the Tigres and Euphrates sometime around 8,000 years ago by who we now refer to as the Ubaidian Culture.

We believe that interest emerged during this period through the discovery of clay "counting tokens".  These "counting tokens" with symbols depicting animals and other assets were then placed in clay envelops (known as 'bulla') about the size and shape of a softball and fire kilned, preserving them to this day.

We cannot say for sure why or how these tokens were created and used, but there is something of a consensus among archeologists, anthropologists, historians, and economists that these counting tokens were the original contracts and they were for lending with interest.

It sounds almost impossible even anachronistic that people were taking out interest based loans nearly 8,000 years ago, but if you follow me I'll show this may have happened:

Imagine you are a sheep herder who owns a flock of sheep.  You use the sheep for their wool which you provide to your family and extended family (your tribe).  One day something happens, and a pride of lions destroys a quarter of your flock (there used to be lions in Mesopotamia at this time - but recreational hunters later wiped them out). The situation is dire but there is still hope. You have the skills to get your flock back up to its original size, you just need a couple of rams for breeding and a few extra ewes to fill the gap.  You also know of another sheep herder that is less than a days walk away.  Borrowing sheep in times of need is quite common and surely you would lend a few sheep to someone you know who is in need?

But as you are walking to the other sheep herder's town you realize that you don't really know this person that well and he is from another tribe and worships a different god. He might not feel the need to do you any favours. The fear of rejection is going through your mind as you keep walking.  You need to figure out how to sweeten the deal for this guy. Simply asking for help as an act of charity might be a bridge too far. So you are thinking "how can I repay this person to make it worth their while, when I have nothing to give up front"?  And then you are hit with an idea: You realize that once you can get the rams to breed more lambs you may have a small surplus of sheep, and you can pledge that surplus back to the lender. Win win. Bingo!

You might not be a mathematician, but you know how to count and how to add and subtract numbers for the purpose of keeping track of your sheep.  So through your basic knowledge of quantities you come up with a number of lambs that you feel is a fair amount.  After all, it is your skill as a sheep herder that is allowing these sheep to safely multiply (and now you're even better cause you have since learned how to deal with hazards from lions).

You eventually make it to the foreign sheep herder before sundown and manage to meet him in person.  You tell him of your predicament and then quickly explain that you realize you're not from the same family nor worship the same god but would like to make a deal that will make it worth his time.  You then explain to the foreign sheep herder that you will borrow 50 sheep from his flock for a year including at least two rams, and then you will repay him the entire flock of 50 sheep plus an additional 10 lambs.
The foreign sheep herder - who already has far more sheep than he or his tribe actually needs - considers this offer.  In his mind he might think to himself "Well I've always been proud of the fact that I own the most sheep in my town, but there is another herder who lives within a 2 day walk from me that I have heard has the most sheep of any herder. I have always envied and fantasized about being that person. I bet that I could be that person if I allow for this loan and it works out for me.  Heck, if I pull off a few more of these loans I could become the most powerful sheep herder in all of Mesopotamia."
And so one man's fantasy of salvation is married up with another man's fantasy of domination.
One last thing is required to seal the deal: A contract.

Fortunately, being near the Tigres (or Euphrates) there is no shortage of soft clay.  While you were on your way to meet the sheep herder you scooped up some clay and form it into a small discs - one for each sheep that you would repay. Using a reed you etch a small symbol on each token. Sixty (60) tokens in all. It took about an hour to do this.
On your instruction, your lender takes the tokens, bakes them in a kiln, and then takes the fired tokens and places them in a larger clay envelope.  You then imprint a symbol known to represent your tribe, this symbol is a representation of the god your tribe worships and who protects you.  Your god is now a party to the contract.  Don't mess this up or you could bring a much bigger catastrophe to your tribe. Before sealing this contract you make it clear what will happen if the sheep cannot be repaid: You will give your life to this man.  You will allow him to enslave you.  Hopefully this will not happen.
The contract has been signed and so the deal has been made.  Interest/usury has now emerged as a new concept.

If the more powerful sheep herder had been an enemy, our hero would not have bothered to make the journey. It would have been too dangerous.  But if the more powerful sheep herder had been part of the same tribe, our hero would not have felt the need to make such a deal and would have made an appeal to charity.  And if there was no clay to scoop up and shape along the way, none of this would work.

It is only through the ability of these individuals - each with their own selfish needs - to collaborate from which a new invention emerges and thrives and multiplies: Interest based contracts.
A lot of things needed to happen in order for this contract to emerge, but most importantly a diverse population competing for resources is at the core of this organic invention.

You can also look at interest as a kind of 'virus' just like COVID-19.  This is why interest/usury is expressly forbidden by all Abrahamic religions (unless it's for people in another religion - just don't charge people in your own religion).  The oldest critique against interest was discovered in India and dated to the second millenium BCE.  India (and by extension China) never really had interest for most of their history until nearly the 20th century.

What do I think of interest?  Well I can see why the Indian Vedics were suspicious of it. I think it's rocket fuel that has a force multiplier effect that has no equal, and should be treated like the powerful rocket fuel it is.  I think it's both useful and dangerous.  As a rule of thumb, keeping interest below 10% can be sustainable, but interest above 20% is difficult to sustain.  That's just the beginning of what is needed to regulate interest and I won't get into all of that here.

Viruses are also useful.  We now increasingly use viruses to combat genetic diseases like Sickle Cell Anemia.  And it was only through the study of extremely weird and obscure viruses that emerged out of novel pools of diversity that we ever even came to discovering that it's possible for a virus to re-program our genetic code to cure disease.  It's actually one of the most mind-blowing inventions of the 21st century.

And this now brings us to why I love CSV (comma separated values) Files for storing structured data more than any other format out there...
In short, it's because CSV is an organic standard that was selected from a primordial soup of diversity.

There are many file formats out there for structured data.  If you are on the business side, it's probably Microsoft Excel.
Or maybe it's 'CSV'

If you are the technology side it's probably something like XML or JSON or YAML or Pickle, or if you work with 'big data', either Parquet or Avro or ORC.
Or maybe it's CSV.

What do I mean by CSV being organic?
What I mean is that the CSV format was selected and refined through a process of diversity similar to how interest/usury was sparked out of Mesopotamia's diversity and similar to how powerful viruses are selected from the novel mingling of animals.

From best I can tell, there is no inventor of the format as it is today - it merely emerged as a standard that was retroactively made to appear as a top-down engineered standard after the fact.
You can tell this from looking at this list of competing file formats and from the CSV RFC introduction.  CSV is the only standard with no clear inventor on this file formats list.  It's just there with a retroactively defined RFC (RFCs are standards developed by the IEEE, a governing body of the Internet).
This article provides a brief history of how the standard may have originated and points out that it emerged around the same time as the Relational Database (RDBMS).  There were lots of other standards at this time too. But the CSV has been embraced by folks on both the Business side and Technology side of most organizations, so there is something about it that has universal appeal.  I think it's because it's both compact and easily readable and almost reminds of lists we might otherwise read in a book or newspaper.  I also think its popularity has to do with the fact that it closely aligns to the Relational Model - a perspective neutral approach to data modelling, which I'll explain later.

No other data file format has been embraced by both the Business side and Technology side of organizations like the CSV File.

There is power here.

Why is it not even more popular then?

There are a couple main reasons why CSVs frustrate technologists.
I'm going to explain those reasons below, but will indent the text so you can skip over these reasons if you don't really care about what technologists think.

First, because CSVs were never designed as a standard in the first place, the encoders do not always encode information consistently.  A common situation is when you have a comma in the text value itself.  This in turn has led to another organic invention of simply requiring that values that contain commas must be enclosed in double quote characters.  Furthermore if the string contains both a quote and a comma, the quotes within the text value must be 'escaped' by preceding the character with another double quote.
If I wanted to encode this text in CSV: Hello, "world"
I would encode it like this: "Hello, ""world"""
That's basically the only rule, it's all you need to know.
But as simple as that rule is, many programs that export to CSV neglect to enforce this rule (or at least used to - most programs no longer make this mistake).
Second and more recently (well since around 2010 when Hadoop came on to the scene) there has been another criticism of CSV: CSV files rely on newline characters as record delimiters while permitting those same newline characters to be enclosed in quotes (just like commas).
Most systems don't have a problem with this, but in the world of Hadoop there are libraries known as "Sedars" (Serial Decoders), and the most popular Sedar for CSV files has made it a rule that CSVs cannot contain newline characters in string values. The reason is that Hadoop operates on a "divide and conquer" paradigm (known as map/reduce), and wants to split data off into chunks as efficiently as possible with the smallest chunk being a single record.  So if you can assume that all chunks are separated by newline characters, end of story, then you can more efficiently split up files for sub-processing by respective worker nodes.  But if you need to perform the additional step of looking for enclosure characters then it is harder to achieve the same level of processing efficiency.
So if you ask me, it's just a matter of correcting the Sedars and respecting the CSV format's simplicity. In other words, I should be able to choose a Sedar that doesn't have this limitation.  But for reasons I'm not aware of, this doesn't seem to be in the offing.

While pushback against CSVs mainly comes from technologists, business users will often prefer the Excel workbook format over CSVs. Why is that?
The reason comes down to Metadata.  CSVs just contain the data values with little or no information about how those values should be interpreted.  The only information that does exist (if it is even written out) is the header row.  Excel workbooks also allow you to bundle together multiple worksheets, which would require separate CSV files.

So I have two responses to the Excel advocate:  First, the CSV does have a sufficient amount of Metadata for most list taking purposes and is extremely lightweight.  Case-in-point: I maintain several key lists just on the notepad app on my phone.  I'll share a portion of one with you here right now (you could even paste this into a text document, save as a CSV, and boom you have a database you can load into a BI application like Power BI or Qlik or Tableau or just Excel):
-----
Restaurant,City
El Coyote,Los Angeles
Rutt's Cafe,Los Angeles
Denver Biscuit Co.,Denver
Bubby's,NYC
H Bar,Toronto
Peppercorn,Las Vegas
The Argo,Vancouver
-----
I keep similar lists for books and movies.
I could use an App I suppose or some services, but them I'm locked into their tools making it tricky to blend that information with other data I have.  Those Apps also don't allow me to visualize and analyze the data in my own way.

The other reason why Business users might push back on using CSV is that Excel allows them to easily blend Metadata such as text formatting (e.g. underline, bold, italics) and formulas (e.g. A1+B1).

But Metadata is just data, and if Microsoft wanted to, they could separate out all this information into one or more additional CSV files that are then bundled with the underlying data CSV files, in  a single folder and ZIP file while presenting everything to the user in a neatly integrated form. Microsoft does do something like this behind the scenes already. Although they are doing so in proprietary formats, rather they are doing what could have been accomplished in zipped CSV files in proprietary formats instead.

These proprietary formats like .xlsx and .xls and .xlsm are all well and good if you are working entirely within the Microsoft Office ecosystem, but even if you move to Microsoft's Azure platform, the Excel workbook is persona-non-grata.  So it's limited there.

Microsoft's decision to store structured data in proprietary formats was probably made during a time where the performance of opening and saving an Excel document was a major concern.  This is the same thinking that led many mainframe developers to store dates using a 2-digit year as opposed to the full 4-digit year. 
Since the whole 'Y2K' problem (which is still ongoing - just look at your credit card's expiry date) went down, very few people now would see it being sensible to store a 2-digit year given that we are routinely storing movies, pictures, and other forms of unstructured data in volumes that dwarf most structured data stores.
Storing all structured data and Metadata in CSV format would surely annoy some technologists.  But it would also provide a way of flattening away today's database bureaucracy which does far more damage than any minor performance issue.
But if Microsoft were to do that, they might lose their grip on spreadsheets. So they don't much motivation to see this happen.  It would only benefit the vast majority of people, but just not them.
Oh well.

Only the humble CSV files seems to bridge both the Business and Technology worlds.  It is the closest thing we have to a lingua franca for structured data.

But it goes deeper than just this encoding.  CSVs also nudge us towards 'perspective neutral' data models (i.e. normalized models) rather than lock us into hierarchical perspectives.

Take for example the JSON format which is popular among developers and is also a text format similar to CSV. Yet, unlike CSV it prefers a hierarchical representation of data.
Using my above example of Restaurant and City which has one row per relation, a JSON file format might decide to encode this as a parent-child relationship with the parent being the City and the child being the Restaurant, especially if there were many restaurants listed under a single city.
To look up any Restaurant I would first need to traverse the parent City in order to get to it.  This is also known as "pointer chasing" and is the reason we shifted from Network and Hierarchical databases (like IDS and IMS) to Relational databases starting in the 1970s and throughout the 1980s and 1990s.  The problem with hierarchical data models is that they tend to lock you into a certain perspective whereas the relational model (which CSVs align to) work more like the human mind, where we can jump from list to list to list without any preconceived hierarchy.

Because most data developers now prefer the relational model, I routinely see JSON documents that are encoded in such a way as to not be hierarchical and simply be lists just like CSV files. Ironic, but expected.  CSVs on the other hand are much more compact and readable than any JSON document for this purpose.

It's also for this reason that spreadsheet tools like Excel cannot directly open JSON documents.  In order to analyze a JSON document in a tool like Power BI, it is necessary to "flatten" the hierarchies into explicit rows and columns.  Defenders of hierarchical formats like JSON point out that they are useful because they allow us to blend a number of related entities into a single document. It's similar to the reason why Business users like to bundle multiple worksheets into a single .xlsx workbook.
My response to this is: Why not just store the CSVs in a single folder? You can then easily move the folder as a ZIP or RAR file, and its contents can more easily be analyzed as rows and columns, and more easily combined in novel ways.

As a data architect, the CSV feels like that Mesopotamian clay in hand.  I can employ the power of relational algebra to combine and analyze CSVs at the speed of thought - pivoting from entity to entity as my stream of consciousness takes me.  Tools like Power BI and Qlik Sense allow me to rapidly connect these sets together like snapping together Lego blocks.
Give me a set of CSVs and some interesting questions at 9 AM, and I'll have your answers and insights by noon.
But give me a hierarchical format, and I'm mucking around with with an unwanted homework assignment of picking apart hierarchies and flattening them into CSVs that can be manipulated like my beloved Lego blocks.

Why does any of this matter?!?
Well this blog post is really just warming up for something much bigger: A rethink of how we manage and access all structured information.

And what I will attempt to demonstrate in the near future is that it is possible to manage ALL structured information through the humble CSV, if we have the collective will to make this happen. Think of CSVs as universal well worn data Lego blocks.

And why does that matter?
Most information problems can be boiled down to a combination of human and technology bureaucracies leading to data gate-keeping and information asymmetry.  In many cases gate-keeping is desired and with merit.  But in most cases I come across, data gatekeeping is enabled through a by-product of technological bureaucracies built on an alliance of information system vendors who build the systems, consultants who implement these systems, and custodians that operate these systems.
Show me a software vendor working together with consultants and custodians, and I'll show you a locked-in service layer that often results in confusion and bureaucracy to those on the wrong side of the service.

Perhaps that's the way it has always been with Information Technology. Cyber Crud has been with us for a very long time, just ask Ted Nelson.
I hope to show you that this is not how it has to be.

Stay tuned...


No comments: