When it comes to managing datawe need to know where it is – but we also need to know what it is.

With the rise in regulatory controls, enterprises now pay more attention to data sovereignty, especially when it comes to data in the cloud, but to know exactly what information they hold is equally important.

This concept – data classification – is not new. But with the growth of unstructured data in particular, to have a clear picture of all data assets is essential. And increasingly, firms now look to artificial intelligence (AI) tools to help with this.

What is data classification and why do we need it?

Organizations have long organized data by function or “descriptive classifier”, such as whether it is an HR file or sales records. They then categorized by sensitivity, also known as a control requirement. Then there is context-based information, such as when and where data was created, and technical attributes such as file type or size.

Lower cost cloud storage allows organizations to store more data for longer, allowing them to use the data for business intelligence, which nowadays increasingly means to train AI models.

But that data must be organized well so that it is not hard to find and use. Protecting that data is also vital. Data governance and data stewardship depend on effective data classification. Data storage is also less efficient unless the business has a solid data classification plan.

Manual data classification, while possible, is inefficient, unreliable and hard to scale. Although organizations can create policies that require users to classify data by adding labels, tags or keywords, this really only works for the broadest classifications – such as sensitivity – and for newly created files.

As organizations bring in more data from external sources such as web applications, customers and the internet of things, effective data classification really needs to be automated. Data classification is a key part of data lifecycle management and is essential for data security.

Data classification tools

As analysts Gartner point out, manual data classification can lead to misclassification due to human error. Also, labels and tags are “one dimensional” and “do not provide sufficient context for increasing regulatory data controls”. They fail to capture context and are usually static. Data can also be used for different purposes during its lifecycle.

Automation solves some of this by adding context, as well as looking at the content of the data, its location and adjacent documents. According to Gartner, standard classification tools work well with standard data types and in organizations that already have well-formatted data. The task becomes harder as organizations make more use of unstructured data.

Increasingly, vendors are using machine learning to look into datasets and documents, to discover elements they can identify, record and track. But, as Gartner notes, their performance can be limited when it comes to handling proprietary data.

Nonetheless, the market offers a range of data classification tools, from standalone applications to those integrated into databases or enterprise applications, especially business intelligence. These are sometimes described as enterprise data catalogues.

Another approach is to bundle classification and cataloging as part of wider enterprise data governance and compliance applications. Unsurprisingly, vendors are now looking to integrate AI into their tools, to improve accuracy and reduce the need for manual tagging.

AI input, data outputs

Data classification is a natural application for artificial intelligence. Vendors have used machine learning in data cataloging tools for a while. It is not a use case that relies on generative AI (GenAI) or large language models (LLMs), although some tools now use them.

Some tools vendors use machine learning and neural networks, decision trees and logistic regression. These train AI models to find patterns in data, especially unstructured data. The models can then be used to apply automated tagging to the data.

Customers can then test and refine models before deployment. This is important because customer datasets differ and an out-of-the-box tool might not understand the specifics of that customer's data or the relationship between different data within the organization. An effective AI model can be used to enrich the metadata associated with a file or document.

The metadata can then be used to create a catalog of enterprise data and, in turn, more effective controls. A further advantage of automated and AI-based systems is that they are dynamic. If the enterprise reclassifies data – due to regulatory changes, for example – the data classification tool should be able to update the catalog on the fly.

The metadata and catalog can then be used for data retention and in security and data loss prevention tools, as well as to meet rules for data residency. This is hard to do with unstructured data, but solid data management is vital for business intelligence and AI development.

Key data classification providers

Microsoft provides AI-based data classifiers through its Purview product. These, it says, are pre-trained on business data, Microsoft domain knowledge and synthetic data. Purview is a wider data governance, compliance and risk management service that runs on Azure.

IBM offers its Knowledge Catalog for data classification and management using AI and ML. It runs as a SaaS application, or in IBM's Cloud Pak for Data. IBM uses LLMs for metadata enrichment.

SAP's Document Classification tool was retired in 2023 and replaced by its generative AI-based Document Information Extraction service.

Oracle Cloud Infrastructure provides “metadata harvesting” from cloud-based sources, and OCI Data Catalog for on-premise and private networks.

Google Cloud's data classification options include Data Catalog, which builds data asset inventories from Google Cloud sources including BigQuery and its AI offerings, from cloud storage, and from custom data sources through an API.

AWS has the Glue Data Catalog, which includes automated data discovery.

There is also a wide range of specialist data and analytics platforms that provide data classification and management, either directly or as part of business and data intelligence platforms. These include Alatian, Ataccama, Atlan, Collibra, Databricks (through its Unity Catalog), Qlik, Tableau as well as data stalwart Informatica and data security vendor Varonis.

Leave a Reply

Your email address will not be published. Required fields are marked *