Let Robots Do the Hard Work: Data Governance in the Age of AI
This talk explores how GenAI transforms data governance from compliance to innovation, driving data discovery, access, and automation at Metaphor
Metaphor leads with an innovative approach, offering critical insights into data flow and transformations.
Tracking data lineage is a MUST for modern organizations. Metaphor leads with an innovative approach, offering critical insights into data flow and transformations. Read on to see how we do it.
Data lineage is vital for navigating the complex data paths within an organization. It moves beyond a broad overview with table-level lineage (TLL) to detailed insights with column-level lineage (CLL), revealing transformations at the column level. This granularity enables our customers to pinpoint data ownership, trace issues back to their source, and identify who relies on the data downstream. Enhanced data governance and decision-making are the direct benefits.
Metaphor's journey towards building robust data lineage begins with seamless integration with diverse data systems. Using a set of connectors, Metaphor extracts metadata from various sources, including databases, data warehouses, and dashboards. While some systems offer TLL information, the absence of CLL poses a challenge. To bridge this gap, Metaphor uses connectors to extract query logs associated with data assets and employs backend SQL parsing to build the CLL graph.
Metaphor's quest for excellence in lineage tracking led to a series of iterations and adaptations. Initially, we used sql-metadata as the parser, and the focus was primarily on TLL. However, recognizing the importance of CLL, we transitioned to the open-source sqllineage library and forked our version with added capabilities and various bug fixes. It served as our backend parser for a while.
Despite its capabilities, sqllineage is limited by its underlying sqlparse library which is SQL dialect agnostic. Its complex lineage-building logic makes bug-fixing difficult. We knew we needed a change. This time around our tool of choice is SQLGlot. It is a dialect-aware SQL parser and is easier to use.
Our first attempt to rewrite the parser was inspired by what DataHub did. We tried to optimize the whole SQL expression before using SQLGlot to parse it. But it didn’t work well. We ended up missing important details, and our lineage was incomplete for some important test cases.
So, we tried something else. We manually went through the Abstract Syntax Tree (AST) in our second attempt. It worked for some tests, but it wasn’t scalable. Fixing bugs was a nightmare, and we needed something better.
That's when we had a breakthrough. We realized we could use SQLGlot’s built-in function for lineage traversal. We decided to focus on preparing the SQL expression instead of trying to do everything ourselves.
Here's what we did:
This new approach worked much better. It simplified our process and made it more efficient. Modifying the lineage parser according to our customer's requirements becomes achievable.
The integration of SQLGlot enabled Metaphor to build a dialect-aware lineage parser with high accuracy and coverage. It passed rigorous testing benchmarks and rolled out to all customers. By actively addressing parsing errors and customer feedback, Metaphor ensures continuous enhancement of data lineage.
In the end, SQLGlot turned out to be a great choice for us. By letting SQLGlot handle the hard part and focusing on preparing the SQL expression, we found a good balance between simplicity and functionality. This new way of doing things sets us up for a more sustainable and efficient future in lineage parsing.
As we move forward, one thing is clear: embracing change opens doors for innovation in data management. With SQLGlot on our side, we’re excited about what the future holds for lineage parsing.
The Metaphor Metadata Platform represents the next evolution of the Data Catalog - it combines best in class Technical Metadata (learnt from building DataHub at LinkedIn) with Behavioral and Social Metadata. It supercharges an organization’s ability to democratize data with state of the art capabilities for Data Governance, Data Literacy and Data Enablement, and provides an extremely intuitive user interface that turns even the most non-technical user into a fan of the catalog. See Metaphor in action today!