Simplifying Lineage Parsing with SQLGlot

Metaphor leads with an innovative approach, offering critical insights into data flow and transformations.

Software Engineer
Founding Engineer
 min. read
April 11, 2024
Simplifying Lineage Parsing with SQLGlot

Tracking data lineage is a MUST for modern organizations. Metaphor leads with an innovative approach, offering critical insights into data flow and transformations. Read on to see how we do it.

The Significance of Data Lineage

Data lineage is vital for navigating the complex data paths within an organization. It moves beyond a broad overview with table-level lineage (TLL) to detailed insights with column-level lineage (CLL), revealing transformations at the column level. This granularity enables our customers to pinpoint data ownership, trace issues back to their source, and identify who relies on the data downstream. Enhanced data governance and decision-making are the direct benefits.

Metaphor's Approach to Lineage

Metaphor's journey towards building robust data lineage begins with seamless integration with diverse data systems. Using a set of connectors, Metaphor extracts metadata from various sources, including databases, data warehouses, and dashboards. While some systems offer TLL information, the absence of CLL poses a challenge. To bridge this gap, Metaphor uses connectors to extract query logs associated with data assets and employs backend SQL parsing to build the CLL graph.

Evolutionary Path

Metaphor's quest for excellence in lineage tracking led to a series of iterations and adaptations. Initially, we used sql-metadata as the parser, and the focus was primarily on TLL. However, recognizing the importance of CLL, we transitioned to the open-source sqllineage library and forked our version with added capabilities and various bug fixes. It served as our backend parser for a while. 

Despite its capabilities, sqllineage is limited by its underlying sqlparse library which is SQL dialect agnostic. Its complex lineage-building logic makes bug-fixing difficult. We knew we needed a change. This time around our tool of choice is SQLGlot. It is a dialect-aware SQL parser and is easier to use.

Building Lineage Parser

First try: see how others are doing it

Our first attempt to rewrite the parser was inspired by what DataHub did. We tried to optimize the whole SQL expression before using SQLGlot to parse it. But it didn’t work well. We ended up missing important details, and our lineage was incomplete for some important test cases.

Second try: DIY everything

So, we tried something else. We manually went through the Abstract Syntax Tree (AST) in our second attempt. It worked for some tests, but it wasn’t scalable. Fixing bugs was a nightmare, and we needed something better.

Third try: only focus on things that matter

That's when we had a breakthrough. We realized we could use SQLGlot’s built-in function for lineage traversal. We decided to focus on preparing the SQL expression instead of trying to do everything ourselves.

Here's what we did:

  • We carefully looked for output columns in the SQL expression.
  • We figured out which columns were in the tables by looking at the expression. For example, if we had a query like SELECT * FROM table WHERE foo = bar, we knew foo and bar were columns in table.
  • We dealt with SQLGlot's problem of not handling columns that didn’t exist in the SQL expression. We came up with a solution. If SQLGlot didn’t recognize a column that we knew existed, we ran the lineage parsing again without telling it anything about the table columns.

This new approach worked much better. It simplified our process and made it more efficient. Modifying the lineage parser according to our customer's requirements becomes achievable.

Achieving Excellence

The integration of SQLGlot enabled Metaphor to build a dialect-aware lineage parser with high accuracy and coverage. It passed rigorous testing benchmarks and rolled out to all customers. By actively addressing parsing errors and customer feedback, Metaphor ensures continuous enhancement of data lineage.

Final thoughts

In the end, SQLGlot turned out to be a great choice for us. By letting SQLGlot handle the hard part and focusing on preparing the SQL expression, we found a good balance between simplicity and functionality. This new way of doing things sets us up for a more sustainable and efficient future in lineage parsing.

As we move forward, one thing is clear: embracing change opens doors for innovation in data management. With SQLGlot on our side, we’re excited about what the future holds for lineage parsing.

About Metaphor

The Metaphor Metadata Platform represents the next evolution of the Data Catalog - it combines best in class Technical Metadata (learnt from building DataHub at LinkedIn) with Behavioral and Social Metadata. It supercharges an organization’s ability to democratize data with state of the art capabilities for Data Governance, Data Literacy and Data Enablement, and provides an extremely intuitive user interface that turns even the most non-technical user into a fan of the catalog. See Metaphor in action today!