Data Documentation Neglect: The Hidden AI Killer Lurking in Your Organization
Skip the painful guesswork in data documentation—Metaphor’s AI-powered platform keeps you compliant, collaborative, and innovative.
This talk explores how GenAI transforms data governance from compliance to innovation, driving data discovery, access, and automation at Metaphor
Event Transcript
Hello, good afternoon everyone, and welcome to the Data Science Speaker Series at the University of San Francisco. These talks are sponsored by the Data Institute, which is the umbrella organization for all things data science and data engineering at the University of San Francisco.
I am delighted to welcome you all to today’s event, where we will explore the exciting intersection of data governance, AI, and innovation. We are thrilled to have with us today, Mars Len, a distinguished leader in the tech industry and the co-founder and CTO of Metaphor Data. Mars has had a remarkable career holding pivotal roles at leading tech companies like LinkedIn and Google. At LinkedIn, he spearheaded the development of their metadata infrastructure, which played a critical role in projects like GDPR compliance. This led to the creation and open sourcing of DataHub, a widely adopted metadata platform. Before his work at LinkedIn, Mars was a senior software engineer at Google, where he contributed to the development of Google Assistant and Google Cloud. Mars holds a PhD in Computer Science from UCLA, alongside degrees from the University of Auckland and the New York Institute of Technology.
We are also pleased to have Rishi Moohan, a talented machine learning engineer at Metaphor, with us today. We also know Rishi as he is currently pursuing the Data Engineering concentration as part of his MSDS program here at USF. He holds a bachelor’s degree in industrial engineering from the University of Illinois Urbana-Champaign, and we are excited to have him share his insights. Our discussion today will explore how generative AI is revolutionizing data governance, transforming it from a mere compliance framework into a dynamic enabler of innovation and value creation. We’ll hear about how AI tools at Metaphor are automating data management, unlocking hidden insights, and fostering a data-driven culture. Additionally, Rishi will share his journey from a USF intern to machine learning engineer at Metaphor, highlighting some exciting projects and potential opportunities for some of you future Metaphor hopefuls.
So, thank you all for being here. Please join me in welcoming Mars and Rishi!
Mars:
Hi everyone, glad to be back again. I was here two years ago—most of you wouldn’t remember that, and if you do, then maybe something’s wrong with you! But yeah, happy to be back and to give a different talk from what I gave two years ago.
First, let me clarify that the bio Elizabeth just read was written by ChatGPT, not me! So, I didn’t call myself a distinguished leader in the tech industry—that was all AI. Anyway, no need to repeat what’s been said, so let me dive into today’s topic.
Let me give a quick plug to our company, Metaphor. We’re an early-stage startup. And what does that mean? It means we’re young—less than four years old—and we’re small—less than 20 people. We also move very, very fast. Rishi can probably attest to that from his practicum experience. The learning curve is steep. During his internship, he had to learn a brand-new programming language in a short time, and the impact he made was significant. I’ve interned three times myself at two different big tech companies, and I can tell you, the work I did in those internships can’t compare to what we’re doing at Metaphor.
We’re also a remote-first company. We started during the pandemic, never had an office, and probably never will. We have people in the Bay Area, but we’re spread around the world. We try to do in-person offsites every three to six months, and maybe that will happen during your internship if you join us.
Now, let me quickly talk about our tech stack, because that might be interesting to some of you. As I mentioned, we’re a big TypeScript shop—both our front-end and back-end are written in TypeScript. We also use Python extensively for AI applications. We do development primarily in AWS, where we leverage serverless infrastructure. We also use modern technologies like OpenAI, LangChain, and vector indexes to drive our AI capabilities.
As for our team, the people at Metaphor come from companies like LinkedIn, Google, Amazon, and Facebook. We’re also backed by well-known investors. Over the past two years, we’ve hired USF interns—two each year—and one of those interns eventually joins us full-time. Rishi, who you’ll hear from later, is a perfect example of this.
Now, what exactly is Metaphor? In short, we’re a data catalog for every team. Our platform is designed to make data governance accessible and manageable for companies of all sizes. Let me take a moment to explain data governance. I know some of you raised your hands when asked if you understand data governance, but let me break it down further.
Data governance is essentially keeping your metadata clean. Notice that I said metadata, not data. Data governance isn’t about ensuring data consistency or removing invalid entries—that’s data quality. Data governance is about making sure your metadata, the data about your data, is organized and accessible. For example, you want to know who created a dataset, what the different columns mean, where the data comes from, and when it was last updated. That’s metadata.
And why is metadata important? Let me give you an example. In data science, it’s common to encounter datasets in the form of Excel files or CSV files. You might look at these files and see a bunch of columns with names like "ID," "Name," "Date," and "Value." But what do those columns actually mean? You’ll likely have questions—who created this data? What do the columns represent? Where did the data come from? When was it last updated? Without answers to these questions, the dataset is pretty much useless.
That’s where data governance comes in—it helps you keep track of this metadata. In the real world, there are different types of metadata. First, there’s technical metadata, which describes the structure of the data, such as the schema, column names, and data types. Then there’s business metadata, which explains why the data exists in the first place, such as business use cases and metrics. Finally, there’s behavioral metadata, which tells you who is using the data, how it’s being used, and what conversations are happening around it.
Collecting and managing this metadata is hard work, but it’s essential to ensure your data remains useful. And that’s where Metaphor’s AI comes in—we automate this process so that you don’t have to manually track every single piece of metadata.
I’m going to hand it over to Rishi now, who will talk about some of the AI models we’re using at Metaphor.
Rishi:
Hi, everyone! It’s great to be back at USF, where I’m currently a data engineering student. I’ve been working with Metaphor for a while now, and during my practicum, I worked on some cool AI features that I’ll briefly describe today.
At Metaphor, we focus on leveraging embeddings to power AI-driven insights. Embeddings allow us to condense text into vectors, which helps us retrieve similar data quickly and efficiently. For example, if a user has a question about a specific dataset, instead of going through thousands of tables manually, we can retrieve the relevant data using embeddings.
One of the projects I worked on was integrating documentation tools like Notion, Confluence, and SharePoint into Metaphor. These platforms often store important metadata about datasets—who owns them, what they’re used for, and what business processes they support. By pulling this information into Metaphor, we’re able to offer a more complete picture of the data landscape within an organization.
Another important feature we’re working on is retrieval-augmented generation (RAG). Essentially, we use RAG to ensure that when the AI generates answers, it’s pulling from relevant business documentation rather than just relying on general knowledge. This keeps the answers grounded in the context of the specific organization’s data, ensuring accuracy and relevance.
Mars already touched on how we automate a lot of the metadata collection process, but let me emphasize the importance of getting clean, high-quality data. We often say, "garbage in, garbage out." If we don’t bring in quality data, we won’t get quality insights. That’s why we’re so focused on making sure the data we work with is clean and well-organized.
And yes, as Mars mentioned, we love AI, but unfortunately, it doesn’t bake cookies—yet. Maybe one day!
Mars:
Thanks, Rishi. You did a great job explaining the AI side of things. One of the challenges we face in data governance is dealing with the human element. While AI can automate a lot of the work, there’s still a need for human oversight. Someone has to review and validate the AI’s suggestions to make sure they’re accurate.
Another exciting project we’re working on is integrating our AI directly into tools like Slack and Teams. Imagine being able to ask a question in Slack and having our AI jump in with the answer—no need to leave your workflow or switch tools.
Finally, we’re also working on automating column descriptions in data catalogs. With thousands of datasets, manually documenting each one can be a huge burden. By automating this process, we’re reducing the workload on data engineers and analysts, allowing them to focus on more high-value tasks.
Q&A Session
Audience Question: How real is the idea of removing the human element from data governance?
Mars: That’s a great question. In the current state of technology, you can’t completely remove the human element. AI can assist and automate a lot of the work, but there’s still a need for human validation. Think of AI as a tool that speeds up the process—it’s there to help, not to replace.
Audience Question: How do you deal with unstructured data at Metaphor?
Mars: We primarily work with structured and semi-structured data, like JSON and Parquet. However, when it comes to fully unstructured data like images or video files, we don’t currently support that. That said, text-based unstructured data is something we do handle, and we’re exploring ways to expand our capabilities in the future.
Audience Question: How does Metaphor manage access control for sensitive data?
Mars: We only deal with metadata, not the actual data itself. But for organizations with strict access controls, we’re working on a feature that will allow users to query datasets through their own credentials. This way, access permissions are enforced by the underlying system, and we don’t need to worry about managing access control ourselves.
Mars: Well, that brings us to the end of our presentation. No doubt, you’re hearing the term AI everywhere—in class, outside of class, in the news—and today was no different. But I’ll say it one more time: AI is transforming the way we work, and it’s an exciting time to be in this field. Thank you all for your time, and if you have any more questions, feel free to stick around.
The Metaphor Metadata Platform represents the next evolution of the Data Catalog - it combines best in class Technical Metadata (learnt from building DataHub at LinkedIn) with Behavioral and Social Metadata. It supercharges an organization’s ability to democratize data with state of the art capabilities for Data Governance, Data Literacy and Data Enablement, and provides an extremely intuitive user interface that turns even the most non-technical user into a fan of the catalog. See Metaphor in action today!