In my last blog, I explored a new perspective on how to create and maintain lineage. I also artfully sidestepped any definition or discussion of lineage classifications.
I owe you. So in this blog, I step forward to close that gap and share how I think about lineage. I don’t claim it's the ‘only’ or ‘best’ framework, but if it helps you even a bit it’s worth the time it took me to write it down.
It was born out of my frustration and struggles to try to help others find the boundaries between business and technical lineage.
Spoiler alert: As much as I would like to give you an ‘easy button’ answer, that is not possible. My goal is always to be as practical and pragmatic as possible, but data lineage is a tough nut, so I start by laying a contextual foundation and then share my classification framework.
What’s the point of lineage? It's complex and requires constant time and attention, so why bother, and what value do people expect to derive from it?
I posit that there are two primary drivers for lineage:
Increased Productivity – Lineage promises to reduce the amount of time required to identify and understand dependencies when we need to investigate, maintain, and remediate issues.
Increased Assurance/Trust – Lineage promises to help explain the point-of-origin of data, including who provided it and how it was manipulated. All of which can drive up confidence and trust.
To better understand these two drivers, I will explore them within the context of specific roles.
A non-technical, business consumer of data in the form of reports, spreadsheets, dashboards, etc.
Business Analyst & Visualization Developers
A functional domain expert specialized in translating business requirements and delivering consumable reports, spreadsheets, dashboards, etc. This includes the use of business-specific calculations and aggregations.
A team of data analysts, data engineers, and data quality experts responsible for the creation and maintenance of data pipelines needed by business analysts and visualization developers. These pipelines include cross-data source mappings and transformations.
A team of operations engineers and platform support engineers who are responsible for the creation and maintenance of the sequenced execution of dependent pipelines. This includes monitoring, alerting, and execution issue remediation.
The table below shows the key lineage-related activities and uses for each role.
Simple Lineage Classification
After studying the table above, it should be clear that each role has different motivations for using lineage. Often these differences are classified as business vs. technical lineage.
Business Lineage – Lineage that supports data consumers, business analysts, and visualization developers.
Technical Lineage – Lineage that supports DataOps and DevOps.
It’s easy to accept this logical mapping between roles and these two classifications. The problem is that it’s very vague and leads to problems in understanding how to actually implement anything.
To address this, I drilled down to think about each role's requirements.
Improved Lineage Classification
The requirements above make it even clearer that a binary classification of either business or technical lineage is too simplistic and not that helpful.
A more useful set of classifications or types is:
Notice that lineage is a continuum from left to right with an increased level of technical detail and complexity. There are also two sets of overlaps. One is the intersection of each lineage type. The other is how the roles use lineage.
This adds complexity, but it can’t be avoided and represents the reality of roles that have a broad spectrum of responsibilities and skill levels in most organizations.
Lineage Types and Views
One way to think about lineage, which may help provide clarity, is to think of the types as views that expose the appropriate detail for the appropriate roles when they need them.
Of course, this assumes that all lineage levels are captured and maintained. It also assumes that the tools being used support a variety of user roles and views of lineage.
The ideal vehicle for this is the enterprise catalog which serves as the authoritative system of reference. It is uniquely capable of representing lineage at all levels and across all systems, pipelines, reporting, and analytic tools.
Recommended Lineage to Role Mapping Exercise
If you find yourself needing to crisp up how your team defines lineage and planning for its use, I suggest working together to complete the following matrix.
The goal is to define two things at the intersection of each role and lineage type.
Lineage requirements/use when consuming, creating, maintaining, or remediating issues. You can and should decompose each of these, but try to also stay focused on the requirements for each scenario.
The attributes needed to support each requirement. Think about what information each role needs for each scenario and list them. These represent the foundation of each lineage 'view' that needs to be supported. These will turn into the configuration of custom fields, templates, and uses of automated and manual lineage in your catalog platform.
Finally, aggregate all the attributes by role and across roles to study the overlaps and drive into a deeper level of planning where you decide how to prioritize rolling them out using a phased approach.
Wrapping it up
I know this is a lot of work. I’m sorry. I would like nothing better than to tell you that you could flip a switch and achieve the mythical business and technical lineage, but it just doesn’t happen that way.
In order to manage expectations, understand the value delivered for specific roles, and plan an orderly rollout, this deeper analysis and planning has to be done.
You don’t have to use this framework, but you will need one to pull that off.