We all know that data lineage is a complex and challenging topic. In a future blog, I will address lineage types, lineage consumers, and the value they are seeking. I am sidestepping those important topics for the moment and in this blog, I am drilling into something I've been thinking about and studying for a long time, namely the fundamental approaches to lineage creation and maintenance.
There are several reasons why I am compelled to address it:
I continue to meet people who don't understand how to frame and put the challenge in context. They are at the beginning of the learning curve and are often somewhat hopeful that there is some magic 'easy button'.
There are a lot of exciting changes and evolution that warrant being investigated.
I find it fascinating and am convinced that a fundamental shift in approach is needed to drive the value of data and a culture of analytics to a new level of effectiveness.
What do I mean by lineage creation and maintenance? For the purposes of this blog, it means the process of understanding that things are related. What things?
Most of the time we think about data fields/files, columns/tables, reports/dashboards. And we think about how they are manipulated using some form of processing. We also think about how all of these are strung together to form long 'chains' of dependencies in pipelines and orchestrations.
The challenge today is to think more broadly about what these 'things' could or should be. I am not going to address this here but it’s important to realize that we need lineage/relationships between all data and data-related assets including business terms, metric definitions, policies, quality rules, access controls, algorithms, etc., etc., etc. My focus is on the fundamental approaches used to understand what physical data assets exist and the processing that relates them to one another. Before I go any further, I must also stipulate that it's critical that you consider not just how these things are discovered and created the first time, but how they are maintained on an ongoing basis. There are a lot of one-off approaches including 'brute force' that can be used to create lineage the first time. The complex and difficult challenge is to have the lineage be intelligently updated as the data landscape and processing dynamically bubbles and changes daily across an enterprise. The fundamental value proposition of lineage is increased productivity. Those responsible for the maintenance of data feeds and pipelines as well as those responsible for issue remediation can use lineage to identify dependent data and processing more quickly. There is also a value pitch for increased trust and confidence of data consumers by seeing where data comes from. However, I believe that is a red herring that I will address in a future blog. For now, I will explore how I define the two fundamental approaches to data lineage creation and maintenance.
Non-Deterministic Lineage Creation
I define this as a post-processing effort to discern the existence of data assets and discover how the assets are related through processing logic. Said in a more direct way, it's an effort to examine code and parse out what is in it, including how it's related. This has been the dominant approach for nearly 50 years and in my opinion, was born out of the work of Thomas McCabe in the 1970s to measure the complexity of Cobol programs. His work produced control-flow graphs with nodes and edges as a visual representation of complexity.
Fast forward and we still see the same basic approach with lineage tools either connecting to a 'source' or being fed files and then trying to tease out the nodes and edges. The big difference is the explosion of 'sources'. This can include hundreds of technologies including a wide range of coding languages, pipeline environments, data storage/streaming representations, and API endpoints. Of course, the other big change is the complexity of the modern application and data architecture. This places a premium on the need to understand 'cross-system' lineage.
For non-deterministic lineage to work well, it requires a significant investment in building parsers that very deeply understand specific environments and languages. Attempting to stitch together a representation of cross-system lineage requires an even deeper investment in a proprietary approach to mapping to maintain reasonable accuracy.
Typically, it's been VC-backed vendors who have taken on this challenge and over the years several have come and gone. They have done some good work but have generally struggled to meet false expectations that lineage would be an easy 'push button' exercise and have an accurate representation of the entire data landscape.
A realist will see the fatal flaw. By definition, the non-deterministic approach is an attempt to extract highly accurate meaning from a wide range of code and technical artifacts without a premeditated way to do that. Even with the most powerful, ML/AI it will always be an after-the-fact approach to divine understanding with significant gaps and inaccuracies. Deterministic Lineage
I define deterministic lineage as the use of lineage markup embedded in processing logic to construct a view of lineage. In other words, lineage is premeditated, and the developer and maintainers of pipelines proactively insert lineage indicators along with code that performs data movement and manipulation. Because the lineage markup uses a consistent 'dialect', no matter the coding environment, the production of a visual representation of lineage is a much lighter lift. All that's needed is a parser that looks for the markup in everything it's handed. One way of looking at it is that the burden of responsibility and effort is shifted from the 'magic' of parsers to the developer. At first, that may seem like a bad trade-off, after all, who wants more work to do, but let's examine the benefits.
Precise and accurate lineage - Lineage will represent exactly what is embedded in the code. If it's wrong, it's within your power and control to change as opposed to relying on a vendor to enhance their 'black box'.
Control of granularity - You can decide what level of granularity you want to represent in your lineage ranging from high-level representation down to field-level transformations and mapping.
Support for all technologies - If a language can include comments (and they all can) then markup can be inserted. This means all your tools could embed lineage and you don't have to worry about including any new ones as well.
Consumption flexibility - The markup can be extracted and used in a wide array of visual tools. This could also include storage in a graph database or being mashed up with other technologies.
Current Industry State of Play
There are several interesting pockets of active lineage investment.
Catalog and a few lineage specialist vendors are primarily pursuing a non-deterministic lineage approach by continuing to make their parsers ‘smarter’.
DataOps-centric vendors such as DataOps.io, Databricks, dbt, Ataccama, etc. who control and orchestrate an entire processing stack are building in lineage to their tooling. That is great if you can limit your organization to the use of their tooling, but breaks down if you can’t. I don’t consider this to be true deterministic lineage the way that I define it above, because it fails the openness test.
The only thing that I see that is close to the development of a deterministic approach is the OpenLineage framework (https://openlineage.io/). It includes an API that uses ‘emitters’ to push metadata from assets to a metadata collection API that can send it anywhere, including catalogs. It's very promising and something that I will continue to track, but it is primarily sponsored by a vendor (marquez) and has a whiff of being self-serving. I would like to see some major corporations leading it and a broader array of larger vendors embracing it.
Final Thoughts & Wrap-Up
Deterministic and Non-deterministic lineage are not mutually exclusive. You can easily imagine a man/machine partnership where parsers do a great deal of heavy lifting for well-known environments (intra-system) and humans provide the vital markup for linkage (cross-system) and specialized processing. Lineage is an area that is in need of fundamental change. If we can achieve that, I am convinced it can drive new levels of productivity, fueled by data.
I hope you found this thought stimulating. You can subscribe to my blog site and reach out to me using my contact page or via LinkedIn.
Interesting perspective. I think this is the route being followed by MANTA - they have connectors that physically scan code for 50+ environments and the ability to add your own tags for lineage in other environments.... Sounds similar to what is described here