Data Documentation Neglect: The Hidden AI Killer Lurking in Your Organization
Skip the painful guesswork in data documentation—Metaphor’s AI-powered platform keeps you compliant, collaborative, and innovative.
Despite the belief that OSS is safer with many eyes on it, recent hacks prove that hidden vulnerabilities can still slip through, risking major disasters.
Many believe that open-source software (OSS) is inherently more secure due to the “many eyes” hypothesis—the idea that with more developers and users scrutinizing the code, vulnerabilities are more likely to be spotted and fixed. That’s why more than 70% of phones and 80% of servers are powered by Linux and a plethora of OSS tools and libraries, right?
Unfortunately, this assumption doesn’t always hold true. As the recent XZ Utils hack shows, even a widely-used OSS project can harbor serious vulnerabilities that go unnoticed for years—and almost caused a $8.8T infrastructure disaster.
Here are a few other notable vulnerabilities discovered in popular OSS projects and are likely still being widely exploited:
Let us illustrate our points further by using two popular OSS projects we’re most familiar with—DataHub, which we created and rewrote, and OpenMetadata. The two projects boast nearly 15k stars and 4k forks combined and have been adopted by hundreds of companies, including Apple, LinkedIn, Netflix, Pinterest, Visa, BlackRock, Genworth, Klarna, and Optum.
Firstly, let’s take a look at the public reports. In March 2023, GitHub conducted a security audit on DataHub and found 10+ security vulnerabilities. In April 2024, Microsoft Threat Intelligence published a blog on OpenMetadata’s critical vulnerabilities, which were actively exploited by hackers for crypto mining on Azure. As you can imagine, these security issues were fixed pretty quickly by the community.
However, high-visibility research and publications happen only once in a while—and often after the vulnerabilities have been exploited. On the other hand, new vulnerabilities and threats emerge constantly, especially for projects with complex dependencies like DataHub and OpenMetadata. Fortunately, GitHub provides supply chain scanning, which automatically surfaces issues as Dependabot alerts.
For good reason, these alerts are only visible to repository owners in the hope that they’ll address them quickly and publish public security advisories. Yet, there’s a very simple way to see these alerts—even before they’re fixed—by forking the repository, and that’s exactly what we did. Here are our findings at the time of writing.
As shown below, DataHub currently has 16 High-Severity and 16 Moderate-Severity alerts with CVSS scores as high as 8.8/10. The oldest CWE dates back to 2006—that’s a year before the original iPhone was released! 🤯 If you set up a DataHub instance at Black Hat, it’d be pwned within minutes.
OpenMetadata fared marginally better with 9 High-Severity and 4 Moderate-Severity alerts. It is also plagued by the same 2006 CWE and the 8.8/10 vulnerability.
[False positive. See 1. in footnote] Worse yet, the repository contains two cleartext secrets! While these may be credentials for testing purposes, this is precisely how 12 million secrets and keys were leaked and exploited on GitHub in 2023. It also indicates a general lack of security awareness and best practices for the team behind the project.
Some may dismiss these findings thinking, “So what? These are just data catalogs containing metadata. There’s no sensitive data at risk even if they’re compromised.” Sadly, this is not the case. Depending on the integrations/connectors, you may need to store the super user credentials in the database. Just imagine the consequence of leaking the admin credentials for your Databricks or the credentials with SELECT privilege to all your Oracle tables!
Another argument may be, “So what? My DataHub/OpenMetadata instance runs safely behind the firewall. No one can access it from the Internet.” Well, any Zero Trust advocate will happily preach to you why a firewall and VPN alone are never enough. In fact, that’s exactly how Google got hacked in 2009 and prompted its eventual exit from China. I rest my case.
While OSS offers many advantages, its openness can also be its Achilles’ heel. The examples we’ve explored illustrate that security in OSS is not guaranteed by visibility alone. Vigilance, robust practices, and constant monitoring are essential to protect against the ever-evolving threat landscape.
At Metaphor, we’re serious about security and provide services to some of the world’s largest financial institutions with stringent security requirements. Please don’t hesitate to reach out to us to discuss this further.
________________________________
The Metaphor Metadata Platform represents the next evolution of the Data Catalog - it combines best in class Technical Metadata (learnt from building DataHub at LinkedIn) with Behavioral and Social Metadata. It supercharges an organization’s ability to democratize data with state of the art capabilities for Data Governance, Data Literacy and Data Enablement, and provides an extremely intuitive user interface that turns even the most non-technical user into a fan of the catalog. See Metaphor in action today!