Data Documentation Neglect: The Hidden AI Killer Lurking in Your Organization
Skip the painful guesswork in data documentation—Metaphor’s AI-powered platform keeps you compliant, collaborative, and innovative.
Check out this engaging seminar on data, metadata, and GenAI, exploring their significance, challenges, and innovative solutions in today's tech landscape
In this talk, our CTO Mars Lan shares his experience of solving various Data Governance challenges at LinkedIn and Metaphor using metadata, and how he leveraged Generative AI to supercharge the process.
All right, so like I said, today is probably a little different from your typical seminar. I’ll cover a topic that I’ve been working on, but I’ll stay high-level rather than diving into an academic talk. I'll make sure everyone can follow along, and I'm happy to take any questions at the end.
Today, I’m going to talk about three things only. If you walk away from this seminar learning more than one of these three things, then it’ll be a success. The first thing is data, the second is metadata, and the third, which is probably the hardest topic right now, is generative AI.
So, who am I? I graduated from the University of Auckland with both an undergraduate bachelor’s degree and a master’s degree. Then, I went to the US and did my PhD at UCLA. After that, I worked at Google, followed by LinkedIn, and most recently, I started a company focused on these three things.
Data is a simple word, but everyone probably has a different idea about what data is. Why is it important? A famous article from 2017 talks about how oil is no longer the most valuable resource—data is. Companies like Google and Facebook, the biggest in the world, are data-driven. Data is the lifeblood of these businesses; without data, they couldn’t have grown to where they are today.
Just to show that data is more important than oil, here’s a Google search trend from 2014 to now. The term "petrol" has been overtaken by "data science" in searches about 10 years ago.
The official definition of data is long-winded: a collection of discrete or continuous values, basic units of meaning, etc. But for a layperson, data is often thought of as a spreadsheet with columns and values recorded in them.
Moving on to the next topic: metadata. What is metadata? Using a reference from the movie "Inception," where they talk about going deeper, metadata is data about data. Let that sink in—data about data.
Think of a spreadsheet. When you receive it, the first thing you ask is, where did this data come from? Who created it? What does each column mean? What do the values represent? When was this data created and last updated? How was it computed? The answers to these questions are metadata. It gives data its meaning. Without metadata, a spreadsheet is just a collection of values with no context.
You can classify metadata into three types. The first type is technical metadata, which defines what the data is. The second type is business metadata, explaining why the data exists and its purpose. Finally, there’s usage metadata, which looks at who uses the data, how, and where. These three types of metadata help give data its meaning and value.
Why is metadata important? In a typical enterprise setup, data consumers, like analysts and business users, need to know what data they have, how to find it, how to use it, and whether they can trust it. Metadata answers these questions, enabling them to make important decisions.
Data producers, like data engineers, spend a lot of time supporting others who ask questions about the data. Metadata can enable these producers to help others use data productively. Leadership, like COOs and heads of data, want to ensure data is used effectively and provides value. They need to know the ROI of their data initiatives.
As companies mature, they go through a value pyramid. At the bottom, they collect and store data. Next, they ensure data governance to use data reliably for business decisions. Then, they build reports and make predictions based on data. At the top is where AI and ML come in. Metadata sits in the data governance layer, helping people manage and trust the data.
Looking at macro trends, why has this problem become important? First, data democratization means enabling as many people as possible to make data-driven decisions. In the past, CEOs made decisions based on gut feelings. Today, decisions should be backed by data. This shift requires everyone in the company to have some level of data literacy.
New job titles have emerged, like data scientists and analysts, specializing in using data. Big Data technology advancements have also played a role, making it easier to process and query data. The advent of cloud computing has made data storage nearly free, encouraging more data creation.
Governments have also started regulating data use with laws like GDPR and CCPA, ensuring companies respect privacy and use data responsibly. These trends create challenges in finding, trusting, and managing data complexity.
How do we solve these challenges? Through centralized metadata management. This involves extracting metadata from various sources, centralizing it, and using it to drive different use cases.
Now, onto the third topic: generative AI. This is a hot topic lately. Generative AI can create things like this picture, generated by an AI model from a description. The AI model can generate surprisingly accurate results, although there can still be mistakes.
Many of you have probably used ChatGPT. It can provide good answers and even have human-like conversations. At the heart of generative AI is something called a large language model (LLM). LLMs predict text based on patterns learned from vast amounts of data.
A large language model works by understanding text contextually. For example, if you’re reading a Sherlock Holmes novel, the LLM can predict what happens next by understanding the entire text. This allows AI like ChatGPT to provide coherent and relevant responses.
However, LLMs can hallucinate, meaning they sometimes generate incorrect information. To prevent this, we use techniques like Retrieval-Augmented Generation (RAG). Instead of relying solely on the LLM, RAG combines it with a database of verified information, ensuring more accurate responses.
Using LLMs and generative AI, we can improve tasks like summarization and documentation. For example, AI can summarize long email threads or create documentation from semi-structured data. This offloads tedious tasks from humans to AI, making processes more efficient.
In enterprise settings, quick and accurate answers are crucial. Generative AI can provide natural language answers to questions, reducing the need to sift through lengthy documentation. It can also proactively suggest related questions, enhancing the user's understanding.
Finally, while LLMs are powerful, they are expensive to train. RAG is a lighter approach that augments LLMs with external information, making it more accessible. This method allows organizations to leverage AI capabilities without the high cost of training models from scratch.
That's all from me today. Hopefully, you learned a thing or two. I'm happy to take any questions.
Thank you for explaining so well. I have a question regarding the last topic. How do you create those embeddings for retrieval-augmented generation?
You use the same embedding model to summarize the text into a set of numbers representing the text’s meaning. You then store these embeddings in a vector database. When you ask a question, the embedding model generates an embedding close to the relevant information in the database, allowing accurate retrieval.
Yes, large language models are expensive to train, costing tens or hundreds of millions. RAG is a cost-effective way to enhance LLMs with external data, making it easier for companies to adopt AI.
Vector databases efficiently handle embeddings, quickly finding the closest points in high-dimensional space. These databases are becoming more common, with traditional database vendors starting to offer vector database capabilities.
I tried ChatGPT and noticed it sometimes changes answers when asked if it’s sure. Is that because of hallucinations?
Yes, that’s the hallucination part. The model tries to complete the next word based on patterns, but if it lacks information, it can generate incorrect responses.
Regarding large language models, how do they determine relationships between words?
They use statistical models to learn relationships from training data. For example, "All Blacks" and "Black Caps" may be close in one dimension but far apart in another. Embedding models specialize in computing these relationships.
Can large language models reference the data they generate answers from?
Currently, LLMs don’t have built-in metadata to reference their output. This could be an interesting area for future development.
Some people online asked for your slides. Are you happy to share them?
Yes, definitely. I’ll share the slides.
Any other questions? What programming language should we use for generating or working with data science?
Python is a popular choice due to its extensive libraries and ease of use for data science applications.
If you offer this as a cloud service to enterprises, how do you deliver it to private data inside organizations?
Services like Azure provide private AI solutions, ensuring data security and compliance.
When AI reads enterprise private data, does it become part of the global knowledge database, or can it be isolated?
Data privacy is a significant concern. Solutions should ensure that private data remains isolated and not disclosed.
Thank you for joining. If you have more questions or need further clarification, feel free to ask.
The Metaphor Metadata Platform represents the next evolution of the Data Catalog - it combines best in class Technical Metadata (learnt from building DataHub at LinkedIn) with Behavioral and Social Metadata. It supercharges an organization’s ability to democratize data with state of the art capabilities for Data Governance, Data Literacy and Data Enablement, and provides an extremely intuitive user interface that turns even the most non-technical user into a fan of the catalog. See Metaphor in action today!