Data Documentation Neglect: The Hidden AI Killer Lurking in Your Organization
Skip the painful guesswork in data documentation—Metaphor’s AI-powered platform keeps you compliant, collaborative, and innovative.
At Metaphor we strongly believe the social context is a vastly untapped resource.
When asked about data documentation a common reaction from data engineers tends to be an uncomfortable giggle and an equally uncomfortable muttering of "Yeah sure, some of our stuff is documented." This is quickly followed by "but a lot of it isn't." Some common sentiments that follow are: "We have data all over the place and don't have time to do it all", "Things move so fast that it's hard to keep up", or "We document all the technical stuff, but the business users don't really care."
Documenting data is challenging at many levels. Tooling in the data world is disjointed, the average data professional isn't motivated to document their knowledge consistently, and the technology and data landscape is by nature fluid. As anyone with the word "Data" in their titles would say, "It sucks, surely there's a better way."
However, through all these pains, companies at every level of data maturity still move forward and execute their business. Even the most data-forward companies whose core business is built around data, struggle with documenting data at some level but manage to keep the ball rolling. How do they do it? It's simple, no matter how well or poorly data is documented, people are constantly communicating with each other and filling in the gaps of their collective knowledge.
This article will talk about the missing piece of the puzzle in the data world - the social context.
As technology in the data world evolved, making it trivially simple to set up data lakes, warehouses, dashboards, etc., the proliferation of data assets skyrocketed. Finding the right datasets, dashboards and other data assets among all the variety of copies that existed all over the enterprise became a never ending problem.
The first generation of tools to help solve some of these problems were called Data Catalogs. The zeitgeist of the day was, "We need a catalog, and we'll get the business users to come in to help with documenting the data." A great many budget cycles were spent arguing about the amazing problems data catalogs would solve, and many checks were written to a handful of companies who built these tools. Status meetings at enterprises started with updates on the famed catalog project, and the data teams cheered when the project successfully went live.
Catalogs scour through metadata from various data assets and allow people to find tables and columns. Folks managing the data systems liked these solutions because they were very familiar with the names of the tables or columns. This was their lingua franca. However, the average business user could care less about 'CUST_DATA.CUST_ID_v2' and how it's a better version than 'CUST_DATA_OLD.CUST_ID_OLD'. All they said was, "Thanks for all the gobbledygook, but just show me customer data."
When users did find the tables or dashboards (usually many copies of them) at worst most of them were undocumented, and at best the data engineers wrote the documentation. This meant, for the most part, they were exactingly accurate, technically brilliant even, but utterly useless for answering the questions the business user cared about. Time went by, and by and large business users stopped showing up to the party.
Not only were business users tapering off, people in the data teams naturally progressed along their respective career trajectories - i.e., moved to other roles or companies. A year or two after the initial implementation, companies discovered a natural waning of use and interest in the catalogs. Therefore, documentation was no longer kept up to date.
The first couple of use cases the catalog was implemented for worked, somewhat. But as the use cases scaled and evolved, new crops of people creating and consuming data cycled through. Most importantly, as older data technologies gave way to low friction, infinitely scalable, cloud-based data solutions none of the traditional/technical catalogs kept up.
Catalog adoption grew across the industry because the implementation of the first use cases made for excellent customer success stories, yet adoption beyond these first use cases didn't pan out.
The current generation of catalogs focuses on exposing aspects of the data stack the data engineer cares about.
The new requirements for the data catalog are to highlight the underlying data quality, security, privacy, freshness, and overall observability of the data. The narrative is that if we can see more technical indicators of the wellness of the data, users are more likely to get value from the catalog. After all, who can deny that users will want to know that they can indeed trust the data they are trying to use.
While meaningful insights into the health of the data are invaluable, the catalog is fundamentally a data engineering-centric view of the world. Most consumers of data want nothing more than a simple indicator of trust. If the data is good, they'll use it. If it's not, they might dig around a little to figure out the issue, but any degree of complexity upstream of that asset is usually far outside their purview.
When in doubt most users will turn to a support channel or a thread on Slack to look for whoever is responsible for maintaining the data. The situation gets far messier if the data seems right but doesn't actually present the result they are looking for. The average data consumer then reaches out to their peers or the data team for an often laborious fact-finding mission.
In a fast-moving environment, no matter how rigorously data is documented, it becomes obsolete as soon as it is published. Data and its interpretations can change on a nightly basis, and people depending on that data may not be fully aware of changes by reading an outdated document. In most organizations, particularly non-regulated ones, not every edge-case or ephemeral scenario warrants an immediate update to the canonical documentation. Unfortunately, the gap between documentation and what's actually in the data widens over time, negatively impacting business.
No matter what state the official documentation is at, be it non-existent or sparse or relatively up to date; people continue the conversation on platforms like Slack and try to fill in the gaps of their collective knowledge. Data discovery via tribal knowledge becomes the new norm.
Conversations about data are not just an exercise of troubleshooting problems after they occur. Many decisions about changes and new developments also happen on these conversation threads. Both the producers and consumers of data comment, approve, voice concerns, and in general, participate collectively around data. In essence, a vibrant social network exists that constantly shapes the trajectory of data in an organization.
At Metaphor we strongly believe the social context is a vastly untapped resource. When done right, it can be a powerful tool for data teams and their users. Every relevant conversation about data on an external communication tool is an opportunity to associate it with actual data within the systems of record.
We've architected Metaphor to be a metadata platform that brings in a technical context and combines it with business contexts such as higher-order metrics or KPIs AND the social context of how people are talking about data.
Suddenly "data discovery" takes on new meaning, not only can people search for data assets they care about, but they can also see the thinking that went into decisions around that data. Seeing what similar data or topics others are searching for or talking about can lead to collaboration opportunities between people in different parts of the organization who normally would not have a natural avenue for connecting with each other. Data then becomes truly democratized and is a common thread that ties these disparate groups together.
As with any social network, the one centered around enterprise data also gives insights into the overall sentiment of users that's invaluable for the data team supporting those users. It also gives data leaders a unique vantage on the emergent uses of data, potentially crowdsourcing ideas on use-cases or even data monetization opportunities.
If expensive data catalogs or the lack thereof have not served your organization's best interests, and you'd like to get a better handle on your fast-moving data environment - reach out to us; we'd love to show you what we've built.
The Metaphor Metadata Platform represents the next evolution of the Data Catalog - it combines best in class Technical Metadata (learnt from building DataHub at LinkedIn) with Behavioral and Social Metadata. It supercharges an organization’s ability to democratize data with state of the art capabilities for Data Governance, Data Literacy and Data Enablement, and provides an extremely intuitive user interface that turns even the most non-technical user into a fan of the catalog. See Metaphor in action today!