Recently there has been a lot of buzz—and confusion—in the data community on the topic of metadata management. You may have heard of it in various contexts such as data discovery, data provenance, data observability, data governance, data privacy (GDPR, CCPA, etc.), and DataOps/MLOps. Perhaps you spotted its shadow in the new Data OS or Data Mesh paradigms championed by numerous thought leaders. Or maybe it reared its head from Gartner's Active Metadata Management or Forrester's Enterprise Data Fabric market guides. There also seems to be an incessant stream of news in and around this area, dominated by new startups entering the fray and existing ones attaining astronomical valuations. However, when you take a closer look, every SaaS offering appears the same—yet they all claim to be doing something different.
Some of you may be wondering, "Isn't metadata management a solved problem? I remember vividly using one of those enterprise products from a decade ago. What's the fuzz?" while others simply throw up their hands and protest, "Why does it have to get so complicated? Why can't there be a single tool that solves it all?"
The reality is—yes, metadata management is not a new problem, but it has taken on a whole new dimension since the introduction of the Modern Data Stack. What used to be a small, esoteric issue that only concerned the core data team has exploded into complex, organizational challenges that plagued companies large and small. In other words, if you have adopted the Modern Data Stack to democratize the creation, processing, and analysis of data, you need a companion tool to prevent your data ecosystem from turning into an unwieldy beast due to the decentralized and organic growth. You need a Modern Metadata Platform (MMP).
Metadata management is not a new problem, but it has taken on a whole new dimension since the introduction of the Modern Data Stack.
So what is MMP exactly? How is it different from traditional metadata management solutions? To answer these questions, it helps by first explaining the metadata platform we built at LinkedIn—DataHub, a system that had many key ingredients of an MMP.
LinkedIn was one of the pioneers in Big Data and data democratization. Its open data culture has fostered the creation of several advanced data infrastructures such as Kafka, Pinot, and Gobblin, as well as innovative data products like Economic Graph, PYMK, and LinkedIn Salary. However, as the data function grew and became more democratized, the very thing that empowered citizen data scientists also became the very reason why the data ecosystem became less useable over time. It had reached a point where there were literally millions of datasets in the system, and no one had any clue about what they were, how they were computed, where to find the right data, and even who to ask if there was a question.
In 2016, the metadata team was officially formed and was tasked to solve the data search & discovery problem for LinkedIn. The team created WhereHows, a simple data discovery portal, and subsequently open-sourced it as the very first OSS project of its kind. Came 2017, the team's focus shifted into data privacy with the imminent enforcement of GDPR. Not only did the metadata we collected for search & discovery form the bedrock for LinkedIn's GDPR compliance, but our humble little web app also became the go-to tool for PII tagging, data masking, access requests, and data stewardship lifecycle management. In order to handle the scale and reliability of GDPR enforcement at LinkedIn, the team evolved the simple monolith into a first-class three-tier architecture powered by microservices and modern storage systems.
The exercise taught us two important lessons. First, it's more efficient to build a common platform to support multiple use cases instead of building point solutions. One of the main reasons we could solve GDPR for LinkedIn within the strict deadline was leveraging what we have built. A lot of the backend infrastructure needs were common for search & discovery and compliance, and therefore could be consolidated into the platform. The second lesson was on the power of centralizing the metadata, which unlocked use cases that were otherwise difficult to solve. For example, by joining the data catalog with PII tags, the obfuscation system was able to mask out sensitive columns when the data was accessed. By tracing the data stewardship along the lineage, we were able to find out who to contact when a deprecation or migration was planned. By comparing the PII tags against the output of an ML-based data profiling system, we could identify the fields that were likely to be tagged incorrectly. And that was just the beginning. Imagine how much more value we could create by collecting an even richer set of metadata!
Imagine how much more value we could create by collecting an even richer set of metadata!
And collected metadata we did. In the following 18 months, we integrated with more than 40 teams & projects, gathering more than 200 types of metadata, and turned the whole thing into a true metadata platform. Since GDPR, DataHub powered numerous new use cases at LinkedIn, including data provenance, data governance, data integration, MLOps, and API development. This was the essence of an MMP. It's a platform that integrates, processes, and serves rich metadata at scale to tackle the many complex organizational data challenges.
I hope that I've succeeded in proselytizing the virtue of a metadata platform by now. However, you may still have doubts about the need for a modern version. "Why isn't traditional metadata management solutions good enough?" The simple reason is—scale and complexity.
Prior to the advent of the Modern Data Stack, the data ecosystem was a lot simpler. Most companies adopted a single end-to-end solution that Extracts, Loads, and Transforms data. Some were even equipped with Business Intelligence (BI) capability to provide a one-stop-shop experience. Life was easy with metadata since it was mostly generated and consumed in a single system. In fact, many solutions offered a data catalog and metadata management software right out of the box.
Fast forward a few years, the old data stack was turned on its head. Companies started flocking to vendors such as Snowflake, Databricks, Looker, and Fivetran for specialized SaaS solutions. The more courageous teams deployed open-source solutions like Spark, Presto, and Airflow. Even the cloud vendors jumped on the bandwagon by rolling out a hodgepodge of data services. Quickly, the once fairly uniform data infrastructure is now made up of a collection of products each store or produce specialized metadata in silos. Centralizing and standardizing the metadata is no longer a walk in the park. To further complicate the matter, many companies create their own proprietary metadata—be it stored in a spreadsheet, a YAML file, or some form of a registry or service. This metadata often brings a unique business angle to the data and makes it meaningful. After all, what's the point of storing all that data in the first place if it were not to improve the business?
Not only has the metadata become more complex and heterogeneous, but its scale has also grown significantly. Imagine a world where every version of a table schema is captured and stored, along with every column, every dashboard, every dataset in the lake, every query, every job run, every access history, etc. Very quickly, metadata starts to look and smell like a Big Data problem. Oh, did I forget to mention that you also need to traverse the metadata graph made up of 10s of millions of vertices and 100s of millions of edges? Still think you can hold all that "measly" metadata in a MySQL or PostgreSQL database?
Very quickly, metadata starts to look and smell like a Big Data problem. You also need to traverse the metadata graph made up of 10s of millions of vertices and 100s of millions of edges. Still think you can hold all that “measly” metadata in a MySQL or PostgreSQL database?
So why do you need an MMP? Because your metadata can be as large and as complex as your data and deserve to be treated with the same respect.
Simply put, a great metadata platform looks awfully similar to a great data platform. What you'd typically expect from a great data platform—scalable, reliable, extensible, and offers rich APIs—also apply to a great metadata platform. Additionally, to make an MMP truly useful, it should also be drop-dead simple to integrate new metadata sources and provide full visibility into the integration process.
Now that we've established the potential scale of metadata, it's important to design a system capable of storing and serving at the same scale. Fortunately, this is largely a solved problem thanks to the technologies developed for the consumer Internet. NoSQL databases that can practically scale to infinity are readily available from various cloud vendors. Not a fan of NoSQL? No problem. Many NewSQL vendors will be happy to provide distributed SQL databases that keep scaling to your heart's content.
Another interesting aspect of scalability is indexing. While most database systems are perfectly capable of scaling secondary indexes, it still gets bogged down easily by complex joins that involved a large number of tables/collections or multi-level self joins. In fact, this is exactly where graph databases excel—traversing multiple hops that involve millions of relationships in the blink of an eye. Similarly, most secondary indexes quickly become useless when performing free-text searches fancier than exact keyword matches.
So what's the solution? In order to support a broad spectrum of query patterns at scale, you'll need to employ specialized data systems, such as graph databases and search engines, instead of abusing your transactional databases. Unfortunately, this will require the extra complexity of synchronizing the content between the data systems and various consistency challenges. Let's hope one day there will be a true multi-model database that will magically satisfy all the scalability requirements.
The serving layer of an MMP also needs to be made scalable. While employees clicking in a web app won't generate a significant amount of traffic, programmatic access of the metadata, e.g. by a data pipeline, can easily bring down an under-provisioned MMP. Luckily, scaling this part of the system is a lot simpler once we have a scalable storage backend. If the serving layer is kept stateless, we can keep scaling it by throwing more machines at the problem—assuming the storage can keep up.
Reliability and scalability often go hand in hand. Both are achieved by essentially adding more machines. Similar to scalability, building reliable infrastructure is also considered a solved problem in the age of cloud computing. What we need to focus more on is “data reliability” for MMP.
Data reliability extends beyond just the primary storage. As mentioned in the previous section, MMP needs additional specialized data systems, e.g., search engines and graph databases, to cope with the scale and complexity. It is therefore critical to synchronize the data across all these systems. A change in the primary storage should be replicated to other data systems in near real-time. Furthermore, there must be a way to easily bootstrap new indexes from the primary storage without introducing any downtime.
Another aspect to consider is the audit history of metadata changes, especially metadata authored by humans. The change history is often as important as the latest value. For example, one of the most commonly seen causes for broken pipelines is a change in the table schema. Knowing what the exact changes were can help reduce the time-to-detect and time-to-solution. As a result, a good MMP should capture all the changes and provide an easy way to access them.
Making APIs extendable brings flexibility, customizability, and longevity to the platform. This often boils down to adopting an extensible data model for the APIs. Extensibility is especially important for MMP, given the gamut of rich metadata it's capturing and the ever-evolving landscape of the data ecosystem.
One common approach is to add a "dictionary" field to hold arbitrary key-value pairs or a string/byte array field to store a serialized complex object. While this does bring near-unlimited extensibility to the model, the "schemaless" nature of these fields also makes the model extremely difficult to use and provides little way to evolve in a backward-compatible fashion. Finally, it's also very difficult to index any content of these "black-box" fields when they're stored, which leads to poor query performance.
Another way is to use strongly typed data models that evolve in a backward-compatible way. This is not very different from evolving the schema of a table or an Avro/Parquet file without breaking old readers. Some data formats, such as Protocol Buffers, take it one step further by forcing all the fields to be optional to guarantee that all changes are backward & forward compatible.
In many cases, the platform and its API are almost synonymous. It’s the APIs that enable other applications, processes, or technologies to be developed on top of the platform. For a great MMP, APIs are not just prerequisites but the keys to unlocking the many metadata-powered use cases.
However, unlike many platforms where a simple web API would suffice, a great MMP must provide multiple "modalities" of APIs:
Ease of integration is no doubt a vital ingredient of a great MMP. After all, without bringing in metadata from various sources, a metadata platform simply becomes another silo it's supposed to be breaking.
Many would immediately link the word "integration" to a REST API. Clearly, sending HTTP requests to an API is so commonplace it would be strange to think that there's a better way to exchange data with a service. However, there are several issues with REST APIs that make them a less ideal choice for MMP:
There is actually another way to exchange data with a service without the above drawbacks—through an intermediary buffer. In the case of DataHub, we used Kafka as the buffer. It did offer an evolvable schema but offered limited auditability & backfilling ability due to finite log retention. Kafka also added another layer of complexity to the system and is not trivial to maintain.
A better alternative is using cloud storage (S3, GCS, etc.) as the buffer. Most cloud storage systems provide unlimited versioning and audit history, making debugging and backfilling a breeze. The reader and writer can also agree on a file schema that evolves over time. This approach also enjoys other benefits offered by Kafka, such as producer-consumer decoupling, high scalability, fault tolerance, and durability, while offloading the engineering and operation costs to the cloud vendors.
In this post, we discussed what the Modern Metadata Platform is, why it is needed in the first place, and how to build a great MMP. We truly believe that MMP is an essential addition to the Modern Data Stack and will be adopted by more and more companies as they move up the data maturity ladder.
We're extremely passionate about this subject and welcome any feedback and inquirers. Feel free to reach out to me directly on LinkedIn or Twitter. We'll also keep sharing more of our thoughts in this and other related areas through our official LinkedIn Page and Twitter account.