Sunday, October 23, 2022

Happy Paths: Why I am Looking Forward to Azure Synapse Gen3

Before staring I should mention that my main contract ends in December this year (2022) and I am looking for contract work for the 2023 year. I am a lifelong learner and data & enterprise architect with 28 years experience. I also love tinkering with new technologies.

I can be contacted at:  neilhepburnjob@gmail.com or on LinkedIn at: www.linkedin.com/in/costie

On with the article… 

If you have been paying attention in the data and analytics space you may have noticed a shift towards a concept often referred to as “Data Lakehouse”. The technology essentially allows a structured database to be scaled with no limits by perfectly isolating compute from storage. This solves old problems and opens up a host new possibilities including AI and ML.
If you think it’s a flash in the pan or something akin to the Data Lakes which have proliferated since the release of Hadoop going back to the late 2000s, then you should take another look at what is happening now.

There are three technology trends that are converging for the purpose of solving some big problems that have plagued data management since, well, the invention of the DBMS going all the way back to Charles Bachman’s Information Data Store from the early 1960s.

What are these technology trends and what are these problems you ask?

The trends are:
  1. Data Lakehouse (which we have already mentioned)
  2. Analytic Workspace
  3. Infrastructure as Code

The challenges are mostly around bureaucracy and happy paths’.
The hard technical problems themselves have been solved since at least the early 2010s - anyone who genuinely needs“Big Data” capabilities for whatever purpose can obtain these technologies to solve their problem. Companies like Google and Meta simply couldn’t exist if they hadn’t been solved.
So what are the challenge I speak of then?
It all comes down to a lack of happy paths (or too many dilemmas or choices).

We need:
  1. A better ‘happy path’ for deploying Analytic Workspaces
    1. Currently the act of deploying a new Analytic Workspace is at best a dreaded “Change Request” or is at worst a big one-off project
  2. A better ‘happy path’ for onboarding users into those Analytic Workspaces
    1. Currently it can be difficult and time consuming to onboard a new user into a “big data” environment. 
      1. Often fiddly desktop software configurations are required or if the user has special needs resulting in a long uphill trek. Those that can get to the top of the hill are often praised and seen as brave bureaucracy warriors - and a warning to others to avoid the same journey
  3. A better ‘happy path’ for securely obtaining, collaborating, and sharing data sets
    1. This might be the hardest challenge of all. Sure you might get some copy of the data. But good luck on getting it refreshed or better yet ensuring you are working with an authoritative source respected by the business data owners (who are often disinclined to share their most precious assets and knowledge)

So why am I looking forward to Azure Synapse Gen3 then?
Well, before I answer that question I should point out that there are plenty of other alternatives out there (including Azure Synapse Gen2).
I will list these alternatives and then explain what I think the alternatives are.
Oh and I should point out - I have no idea what Azure Synapse Gen3 will actually entail. Everything I am writing here is based on pure speculation and conjecture. But my hypothesizing is informed by these factors:
  1. Microsoft has a history of getting things right on their 3rd attempt and learn from their mistakes (and the shortcomings of others)
  2. Microsoft understands corporate governance (i.e. bureaucracy) in ways that other big players like Google and Amazon seem to lack (or see as beneath them)
  3. All of the big players are aggressively investing in Data Lakehouse technologies:
    1. Meta (Facebook) is the pioneer here and has been investing in Apache Hive, Apache Presto, and related open source technologies since 2010 and this stack continues to improve
    2. Amazon is investing in Apache Hudi and (and possibly Apache Iceberg)
    3. Google is investing in Google BigLake
    4. Snowflake is investing in, well Snowflake (and Sigma) and probably other stuff I’ve yet to be made aware of
      1. I won’t mention Oracle here except to say that I tend to think of Snowflake as Oracle 2.0 (I have a lot of respect of Oracle - and many feelings of cognitive dissonance which extend to Snowflake)
    5. Microsoft (along with Databricks) has for quite some time been investing in Databricks’ Delta Lake

I think it’s entirely possible that those other vendors will have something that eclipses Azure Synapse rendering it obsolete. The possibility of disruption is always around the corner.
But let me explain why Synapse Gen2 is quite impressive but also slightly lacking:

  1. The Synapse Workspace is a browser based environment. Once configured (along with requisite AAD security groups), users may be onboarded to singular Synapse Workspace by simply adding them to a single AAD group.
  2. In one simple request, a Synapse Gen2 Workspace gives the user:
    1. Access to an MPP RDBMS (i.e. a super powerful SQL database optimized for analytical workloads)
    2. Access to a Data Lake and Data Lakehouse (Delta Lake)
    3. Access to Azure Data Factory (EL/TL) including Data Flow (ETL) and Wrangling Dataflows (business Data Prep based on Power BI’s Power Query ‘M’ language)
      1. These are truly best-of-breed tools which come with a “deep bench”
    4. Access to Spark Python Notebooks along with horizontally scaleable clusters which can be scaled to virtually any size (assuming you have the $$$)
    5. Access to Power BI
      1. Another best-of-breed BI tool (the only tool that is better is Qlik Sense - as I have written about in the past. Both Power BI and Qlik share the more flexible “linked models” [as opposed to cube based models like what Tableau and Microstrategy rely on])
  3. Azure tools like Azure Data Factory, Azure SQL Database, and Azure Databricks all have a committed (some may zealot - which is a positive here) developer base - that’s a good thing
    1. Unless you have a bit of a religion going with your technology, you will find yourself bowled over by people who are religious about their technology (in the tech industry has always been the case, but it’s more explicit these days.)

So what’s wrong with Azure Synapse Gen2 then?
  1. Deploying a new Synapse Workspaces is a bit complicated and requires a lot of decisions around whether to use the SQL database or the Delta Lake
    1. Dilemmas are the enemy of the Happy Path
  2. Onboarding new users can be made easy if you have set up the AAD groups correctly, but I think there is room for improvement here. Again there are more choices than I think are necessary
  3. It’s difficult to share data with external parties
  4. It’s not obvious as to whether we should be using the “severless” Data Lakehouse SQL database”  (i.e. Delta Lake) or the more mature MPP Dedicated SQL Pool
    1. This is in my view the biggest challenge of all for Microsoft to solve and the one I have highest hopes for

So what am I expected for Gen3 then?
  1. Quickstart templates (maybe as Azure CLI scripts or Hashicorp Terraform scripts) for common Azure Synapse Workspace patterns
    1. One thing that would be great is to have as inputs the Data Lake folders and Delta Lake tables that users should have access, along with the appropriate permissions
    2. Another thing would be some way of better managing all the AAD groups that need to be created to accomodate the various roles within the Workspace (e.g. Data Engineer, Data Scientist, etc).
  2. A simpler onboarding experience for new users
    1. If we could do away with the requirement that Power BI Desktop be required (and any other lingering desktop software requirements) that would be great
      1. Hey I like Windows - but many Data Scientists work on Macs these days
  3. A better solution for sharing data (like Databricks’ Delta Share)
  4. A single unified DBMS based on Delta Lake - no secondary copies of data in MPP Dedicated SQL Pools (“singleversion of the truth”)
    1. This is the biggest challenge of all

On that last point, I have a feeling MS is already moving in this direction.
I believe this because they have already built out something they are calling HyperSpace Indexes”.

Backing up a bit, in case you forgot what a Data Lakehouse is, it is basically the pure separation of compute from storage. At its core, everything must be managed through documents and trusted actors.
It’s a great idea, but comes with some trade-offs.

Sure it’s possible for a single vendor like Databricks or Microsoft to ensure data consistency by coordinating within themselves. But I am a wee bit skeptical if we have achieved this goal when it comes to multiple vendors writing to the same table at the same time at high frequency. Yes it’s possible to take advantage of HDFS file locking and whatnot, but I have yet to see a good demonstration that is as “bet your business” as a traditional SQL RDBMS (like Azure SQL Database or Oracle or PostgreSQL) system can provide.
In a sense this has always been the dirty little secret of NoSQL databases: NoSQL DBMSs lack managed integrity controls and rely on trusted client applications to manage integrity for them. This gives better performance, but you can’t manage data as a separate concern. Similar challenges follow the Data Lakehouse.

To repeat: One of the primary goals of data management is to be able to manage data as a separate concern. So the Delta Lake needs to up its game and Microsoft is doing this.
Those HyperSpace tables however kind of muddy the waters a bit.
Yes, they are indexes, and yes they will improve query performance.
But there is no guarantee that other vendors will maintain these indexes because they are not part of the core Delta Lake protocol.

Nevertheless they do point us in the right direction (assuming there is only one choice).
Competition is good, but dilemmas… not so much.

But what about all those other technologies I just mentioned?
Well the honest answer is I have done some hands-on evaluation of those tools, and in the case of the open source Apache stack I have very much lived in that world for quite some time.
I’ve also dabbled a bit with Snowflake and Sigma and am very impressed with their data sharing capabilities.
Quite frankly I could see Snowflake winning this game (if it is a winner-take-all game) based solely off their approach to data sharing.
And for the record, I would be more than happy to work with Snowflake and Sigma.

I suspect though all of the five vendors I mentioned above will continue to push their own vision and tech stack and again, and as a realist I would be happy to work with any and all of these technology stacks. Would even love the opportunity to make them work together (with the appropriate ‘happy paths’).

As a reminder, I am currently seeking work for the 2023 year starting in January.
My contract ends in December, and would love to hear from you if you are curious about this stuff or want to hire me on contract.
I can be contacted at:  neilhepburnjob@gmail.com or on LinkedIn at: www.linkedin.com/in/costie