ChatGPT experimentation is going on everywhere. As a long-time data person and someone who has been very focused on data cataloging (namely Alation’s catalog) for several years, I have started using it in my work in ways that you might find interesting. Before I get to some examples, it's essential that I set a bit of context.
The best implementation of a catalog is not as a modern-day data dictionary primarily serving technical users but as a knowledge base and repository of a broad network of data and data-related assets. This breadth includes business process descriptions, metric/KPI definitions, common term definitions, sharing of BI reports, briefing books, and social interaction helping build a community of expertise. When approached this way, the catalog becomes the platform supporting the development of a rich data culture and data literacy. In doing so it provides as much value for non-technical as technical users.
Given that, it will come as no surprise that I am always looking for opportunities to automate the enrichment of knowledge (call it metadata if you prefer) in the catalog. Enter ChatGPT. Here are a few ways that I’ve started doing that.
Terms and their definitions are one of the most powerful ways to share a common understanding and language within a data culture. No one wants to start from scratch, so I’ve often been asked if I had a standard set of glossary terms and definitions for specific industries.
What I’ve started doing is asking ChatGPT something like “Provide a comprehensive list of property and casualty insurance terms in a delimited format.” What it responds with is a respectable list of industry-specific terms and their definitions in a catalog load-ready format.
My approach to industry/org-specific metrics and KPI definitions is the same. I simply ask ChatGPT, “List common human resource organization metrics in a delimited format.”
Sure, it does not have the actual calculations, but that’s no problem. Just loop through each metric and ask for it.
I like to do the same with reports, asking ChatGPT for the common reports used in a specific organization or industry. While the list will need to be tweaked for a specific organization, it is a great starting point for comparative and gap analysis.
Table and Column Descriptions
A huge amount of time is spent by both producers and consumers of data trying to simply figure out what things are. Reducing that level of effort can directly impact everyone's productivity.
The challenge is the volume of data assets (think the number of schemas, tables, columns, files, and fields) in the typical organization is in the millions, and there is no cost-effective way to maintain descriptions for all of it. In recent years, we have stopped assuming it was all the responsibility of data stewards and have driven towards a crowd-sourcing approach. That has helped, but it still can’t keep up.
What I’ve started doing is asking ChatGPT to describe columns and tables based on their names. In the screenshot below you will notice a technical and logical name. That is because Alation always collects the technical name and may or may not have a logical name depending on if Alation’s lexical algorithm or a user gave it one.
When I run my script I ask for the description of the column using the logical name if it exists. I don’t show the python script here, but I simply ask ChatGPT to describe a database column name called [X]. Sometimes I even make the question/command more specific based on what I know such as an application, industry, master data (location, customer, product), etc.
The descriptions are not perfect, but I can run these in batches and provide data consumers with a starting point. They are much more likely to change or suggest changes to a description rather than spend time coming up with them from scratch.
One way that the Alation catalog provides value to both data producers and consumers is to allow them to write, share, and publish queries for each other. In fact, Alation will also pull in all the queries that have been run against a data source and tell you which is the most popular, who ran them, and more.
The problem is that understanding what the queries are doing takes time. That is not a big deal for the best query writers, but it's a significant barrier for those of us (including myself) who are marginal.
What I’ve started doing is asking ChatGPT to describe the queries. Again it's not perfect, but it's a much better starting point than a blank screen or spending time cajoling experts to write it.
If that is a bit too detailed, a request for a simple description provided this.
Closed Loop Automation
One of the most exciting aspects of ChatGPT is that we now have a tool that can be used to automate and scale descriptive metadata using ChatGPT (OpenAI) APIs. This means that we no longer have to rely solely on asking busy people to slow down and become writers. Instead, they can simply work as editors as they run across improvements they would like to see.
What I am thinking about next is how to set thresholds for requiring humans to validate AI-provided content. The approach I am testing out is a dynamic calculation of what is considered critical (critical data elements) based on actual usage and other factors. The vision is a closed loop and improvement process where ChatGPT describes everything but only routes validation requests to humans for the top percentage of critical elements. Then use their feedback to further train the ChatGPT model, so it keeps improving.
ChatGPT and other emerging generative API models offer an exciting opportunity for us to evolve how we describe, share, manage, and govern data so the organizations we work for can better mitigate risk and maximize business agility and growth.