Data reliability starts with data lineage
The technological advancement in the data field created solutions that should have made it simple to create reliable data products. But any data professional knows that rapid growth, frequent changes and increasing demands makes data stacks chaotic. In such a dynamic environment chaos is normal. We shouldn't fight it, we just need better tools to handle it. In our view, the first step to making sense of the mess is to clearly understand it, visibility is key for reliability and governance.
What is data lineage?
The term data lineage is not new, and was probably coined decades ago when understanding the flow of data in the first relational databases was already a challenge. When thinking of data lineage, most would see in their mind a data flow diagram of arrows and components, but this is just the visual representation of it.
Data Lineage is the journey of a dataset in the data stack, the source, transformations and destinations of it. It is achieved by mapping the dependencies of the different components of the dataset flow.
Sounds simple, but In reality, this is a challenge that keeps increasing as our data stacks become larger in complexity and scale, and more systems and people are involved in building and maintaining them.
Effective data lineage
So why the diagrams? Because in order to be useful, the data lineage needs to be clear and accessible, and visualization is a great tool for that. But it is not enough. In 2010 I saw a data lineage graph for the first time, printed on the wall of the DBA's team. It was huge and detailed, but had many corrections and remarks that were added to it over time, using way too many different pens and markers colors. You can imagine how hard it was to figure out something out of it, and that is in the good scenario when people were responsible enough to document the change on the poster.
In order for data lineage to be useful, it has to be accessible (visualization is a good start), automated, and updated in real time.
Cool, but why?
There are many use cases in which data lineage can be leveraged to increase reliability in data professionals day to day work:
Confident changes - When making changes in data collection, transformation or usage, data lineage can be used to make sure the change will not impact existing flows. This is useful for preventing data reliability problems.
Impact analysis - When there a data reliability issue is detected, the lineage may be used to understand which downstream datasets are impacted.
Root cause detection - When a problem is detected, the lineage can be used to understand which upstream datasets could be the cause of it. Sometimes the cause is a change in the lineage itself (e.g. table was dropped, name of dataset was changed), and looking at previous versions of the graph could assist in detecting the root cause and the required fix.
Knowledge sharing - As the data team grows and the data stack evolves, it gets harder to maintain a knowledge base of available datasets, their usage and dependencies. Also, these change frequently, and creating manual documentation, especially a visual one, becomes an impossible mission. An automated up-to-date visual data lineage saves the need to document, and is especially useful for sharing knowledge with team members and introducing it to new team members.
Data operations - An important aspect of data operations is visibility of how data is being utilized. This enables day-to-day maintenance and governance, and makes significant processes like migrations and performance optimization easier.
We believe data lineage will power an increase in reliability of data stacks, and it is the base for solutions to detect data issues and provide visibility that would be built on top of it. This will enable data professionals to navigate the built-in chaos of modern data stacks and manage data operations efficiently and reliably.