MLOps Blog

Data Lineage in Machine Learning: Methods and Best Practices

7 min
Samadrita Ghosh
19th April, 2023

Data is supposed to be an organization’s most treasured asset. However, it wasn’t this way until recently, so very few people have experience in handling data and leveraging it to create more value.

As managers are becoming more data-fluent, many organizations are adopting the practice of tracking data lineage, which has become steady support for driving organizations towards data efficiency.

Read later

A Comprehensive Guide to Data Preprocessing

What is data lineage?

Data lineage is the story behind the data. It tracks the data from its creation point to the points of consumption. This pipeline of dataflow involves input and output points, transformation and modeling processes the data has undergone, record analysis, visualizations, and several other processes which are constantly tracked and updated.

The objective of data lineage is to observe the entire lifecycle of data such that the pipeline can be upgraded and leveraged for optimal performance.

Data Lineage
Source: Dremio

Data lineage vs. data provenance 

Data lineage is often confused with data provenance, as the difference is quite subtle and easy to miss. Both have the common objective to serve optimization through tracking and observation, but on a high level, data lineage is a subset of data provenance.

While data lineage focuses specifically on the data by tracking its journey including the origins, destinations, transformations, and processes it has undergone, data provenance additionally tracks all the systems and processes that play a part in influencing the data.

So, Data Lineage takes care of data about the data (metadata), while Data Provenance takes care of the information about the processes that influence the data.

Why is data lineage necessary?

Knowledge about the journey of data and the circumstances it has been through offers a lot of control over the present status of the data, which can be used to navigate the optimal routes for data solutions.

However, what makes data lineage absolutely necessary is the growing competition and data efficiency or expertise across organizations and industries. A few years ago, smart data or statistical solutions were good-to-have insights for the organization which offered a competitive edge. However, today most organizations want to optimize their data assets. Control and knowledge about the data is gradually becoming a competitive advantage.

Here are a few reasons to consider Data Lineage as one of the key contributors to data efficiency:

  • Data gathering

Data is a dynamic asset that keeps evolving, and relying on static data can be harmful to business decisions and outcomes. Constantly gathering, updating, and validating data is a critical process. Data lineage can be very useful in the process of upgrading data especially because of its tracking abilities. Upgrading data is not just about fetching new data, but also about understanding the relevance of older datasets. Data Lineage can help in combining new data with older relevant data such that the data consumers like developers, business teams, and stakeholders can derive the maximum value from the data assets.

  • Data governance

The metadata that gets recorded due to data lineage automatically offers details that can be used in compliance audits or can help to improve the security of the data pipeline. It also helps to understand the structure and process of dataflow and allows ample room for improvisations. Tracking the metadata consistently also reduces the technical debt and, therefore, the overall cost of risk management and compliance.

  • Standardized migration

In practical scenarios, there is a frequent need for data migration from one source to another. This implies that during migration several details such as location, formatting, storage, bandwidth, security concerns, etc. must be known beforehand. Data lineage is perfect for such a setting since it provides in-depth details almost instantly and saves both cost and time. WIth the available set of metadata, it’s also possible to automate and standardize migration processes depending on the parameters of the destination and source systems.

  • Rich business insights

Data lineage is essential to maintain the integrity of data which is crucial to businesses. By constantly tracking details about the data, it’s possible to instantly update and alert users in case of data discrepancy, irrelevance, or staleness. Several departments such as development, sales, marketing, and business operations depend on data to improve their process. Fresh and healthy data can bring immense speed and value to business decisions.

Who benefits from data lineage and how?

Since several departments in an organization rely on Data Lineage, let’s get a closer view of the dependents to understand the need for Data Lineage further.

  • ETL Users/Developers

Every organization that deals with data has an ETL department that looks after the Extract, Transform, and Load process. It’s the process of extracting data from a source, applying required transformations, and then sharing it with the destination system. ETL developers deal with heavy volumes of data, and data lineage comes in handy to detect bugs in the ETL process, and to create detailed reports of the transfer.

  • Security Teams

Security experts and developers are often devising ways to fool-proof the data pipeline’s vulnerable end points. Data lineage brings critical information on such endpoints consistently, which helps in experimentation through permutations and combinations. Records on vulnerabilities, fatalities, and processes that caused them, offer more scope for improvements in security.

  • Business Teams

Business teams work with multiple reports, and with the help of data lineage, they can easily navigate to the source of the reports and validate the data whenever necessary.

  • Data Stewards

A data steward is an individual responsible for the quality of an organization’s data asset(s). Therefore, a data steward is expected to know the ins and outs of the data he/she is governing, and data lineage makes this process much more accurate, transparent, and user-friendly.

Methods of data lineage

Here are a few ways to perform Data Lineage tracing:

  • Lineage through Data Tagging

This method works on the assumption that a transformation tool is consistently involved with the data and that it tags the data after executing transformations. To track the tags, it’s important to know the format of the tag so that it can be spotted across the pipelines. This method is only reliable in closed systems where only the known transformation tool is deployed.

  • Self-contained Lineage

This is when lineage is traced in a closed environment that is completely controlled by the organization. Almost every type of data support (data lakes, storage tools, data management, and processing logic) is a part of the environment. So, the lineage is also restricted to the boundaries of the data environment and is unaware of the processes outside. 

  • Parsing

Data lineage through parsing reads the code or the transformation logic to understand how the data came to its current state. On processing the transformation logic, it has the ability to trace back to previous states and complete the end-to-end lineage tracing. Parsing-based lineage is not technology agnostic since it needs to understand the coding language and transformation tools deployed on the data. This directly imposes restrictions on the flexibility of the process, in spite of it being one of the most advanced lineage-tracing techniques.

  • Pattern-based lineage

This method of data lineage doesn’t work with the code that’s responsible for data transformations. Instead, the process only observes the data and looks for patterns to trace the lineage, making it completely technology- or algorithm-agnostic. This is not a very reliable method, because it tends to lose out on patterns that are deep-rooted in the code.

Check also

What’s So Modern About the Modern Data Stack?

Data lineage across the pipeline

To capture end-to-end lineage, data has to be tracked across every stage and process in the data pipeline. Here are the stages across which data lineage is performed:

  • Data Gathering Stage

The data gathering or ingestion stage is where data enters the core system. Data lineage can be used to track the vitals of the source and destination systems to validate the accuracy of the data, mappings, and transformations. Tracking the systems closely also helps in the easier identification of bugs.

  • Data Processing Stage

Data processing takes up a huge percentage of the entire process of creating data solutions. It involves multiple transformations, filters, data types, tables, and storage locations. Recording metadata from each step doesn’t just help in compliance and production speed but also makes the development process richer and more productive. It enables developers to analyze the causes behind the success or failures of processes in higher detail.

  • Data Storing and Access Stage

Organizations usually deploy large data lakes to store their data. Data lineage can be used to track the access permissions, vitals of endpoints, and data transactions. This will increase the degree of automation of security and compliance, which is a huge bonus given the size and complexity of data lakes.

  • Data Querying Stage

Users raise multiple data queries with a range of functions like joins and filters. Some functions can be heavy on the processors and therefore, less efficient. Data lineage can observe the queries to track and validate the processes and different versions of data resulting from them. Meanwhile, it also helps in optimizing the queries and provides reports including instances of optimal solutions. 

Best practices of data lineage

Data lineage is an evolving discipline and the processes are improving at great speed. Here are a few fundamental best practices that can keep the momentum going:

  • Automation

The general practice of organizations until now has been to record lineage manually. Given the dynamic and fast-paced nature of production, manual tracking is no longer feasible. Best-in-class data catalogs are also recommended to boost automation. They integrate AI and ML to combine metadata sourced from multiple systems to form a logical flow of lineage. It also has the capabilities to extract and form conclusions from the metadata.

  • Metadata validation

Data is always susceptible to errors, which is why it’s important to include the owners of different processes and tools in lineage tracing. The owners are closest to and most aware of the details generated by their applications and can be resourceful in pointing out the bugs or errors in the records or processes.

  • Inclusion of metadata source

Including the data generated by the different processes that process, transform or transfer the data, is vital to tracing data lineage most accurately. Therefore, metadata that is created by these processes on the data should be pulled into lineage tracking.

  • Progressive extraction and validation

To map the lineage most accurately, it’s recommended to record metadata consecutively as per the stages of the data pipeline. This creates a well-defined timeline and organizes the huge log of metadata in a much more readable format. Progressive validation of this data also becomes easier so that the high-level connections can be verified first. Once they’re clear, the deeper intricacies can be validated level-wise. The progressive approach maintains a logical pattern and reduces errors while reading or extracting the data.

Data lineage tools

Data lineage, even though a relatively new discipline, has been evolving in the background over the years. Initially, the tools were all version control systems which eventually expanded into the larger discipline of data lineage. Let’s take a tour through the generations of version control tools to understand how it gave us Data Lineage.

Generations of Data Lineage Platforms:

  • 1st Generation: The 1st generation version tracking was primitive, entirely manual and mostly managed by one person who had access to the documents via “locks”.
  • 2nd Generation: The 2nd generation was a huge improvement since it allowed social collaboration, enabling multiple users, usually in-house, to work on the code. The only fatal drawback was that it was inefficient with code mergers and required the developers to work on merging the code externally before making the final commits.
  • 3rd Generation: The 3rd generation worked on the drawbacks of 2nd generation tools and allowed developers, not just in-house but across the globe, to collaborate and merge their respective versions with differences at later stages after commits. The world-wide network and easy-merge facilities allowed huge scaling abilities, especially in the open-source community.
  • 4th Generation: The final and present generation of version control is a part of data lineage platforms like Pachyderm. It improvises on the 3rd generation by enabling version control with the rather open-ended process that goes into production of AI solutions. The job of the 4th generation is to keep track of all processes and tools involved in the system, such as the cloud, the storage, the data versions, the algorithms, and much more, while maintaining immutability. Overall, it successfully tracks the end-to-end flow of the data pipeline.

Tools/Platforms for Data Lineage:

Here are a few popular picks for Data Lineage tools.

Talend data catalog is a one-stop source for getting details on the processes that have acted on your data. It can search, extract, govern, and secure the metadata from multiple sources and has the ability to automatically crawl the sources.

IBM DataStage combines analytics, cloud environments, governance, and DataOps in a single platform built for AI development. It delivers high-quality data through a container-based architecture.

Datameer offers a visual no-code experience for building data pipelines and allows collaboration for data experts to discover, access, model, and transfer data. The high-quality user experience along with great tech support makes Datameer a strong contender.

Neptune offers a single platform and dashboard to keep track of metadata, data logs and offers simple features such as Namespaces and basic Logging methods to organize and combine several types of ML metadata. The user interface of Neptune displays data version control files as well and users need only specify the type they want to log. The overall visual experience of Neptune includes charts, images, and tables that make the platform extremely user-friendly for data tracking.

Learn more

Explore more tools for Machine Learning data lineage.

Future of data lineage

Several new disciplines have started growing during the last few years to such an extent that their impact has touched the boundaries of several industries. Technologies such as 5G, edge computing, Internet of Things, and of course artificial intelligence, are set to generate loads of invaluable data. To leverage this volume, a well-defined tracking system is the need of the future, and the foundation of that infrastructure is the responsibility of the present.

Once these technologies expand further along with the cloud, the data is bound to get exposed to external agents like physical systems connected to IoT, edge servers, and the cloud in general. Transformations on data will take place at far-off and disconnected locations which are often vulnerable endpoints that have the potential to put the data pipeline in jeopardy. 

Data Lineage will become a competitive advantage for the early takers by securely governing the huge data landscape that is growing rapidly even today. With the lineage in place, it will be much easier to track the errors and vulnerabilities such that the systems responsible can be upgraded with haste. Furthermore, the extended benefits of adopting data lineage like reduced cost, high scalability, compliance, high data, and process quality are expected to become must-have benefits for data-based industries.

Resources