Connect the digital thread with the best AI-powered search for manufacturers Watch the demo

Data Discovery with Dynamic Data Classification

Posted by Charlotte Foglia

Data Discovery with Dynamic Data Classification

If you’re tired of hearing about data, you need some tough love. You’re never going to stop hearing about data. Data is too important, too valuable and too dangerous for it to be any other way.

Data is at once a great source of strength and risk. And, the volume of data held by companies is not going to get any smaller.

Ideally, a company will be able to leverage its data for insights that will allow it to grow and get better at serving its customers. At the same time, a business has to protect its data from being stolen or misused. This is a challenging proposition, mostly because few companies have a clear idea of what’s in their data and where it’s all kept.

Data classification offers a solution to this dilemma. The practice of data discovery assigns data classifications, such as “sensitive,” “contains personal identifiable information” (PII), “intellectual property” (IP) and so forth. So far, so good.

There’s a problem, however, which is that existing data classification processes are not up to the job. Manual data classification is hopelessly slow and labor intensive. Automated processes are faster, but are generally not accurate enough.

A new approach, known as dynamic data classification, works better. It uses machine learning (ML) to classify data based on what’s in the data itself, rather than on externally imposed rules.

Understanding data classification

Before delving into dynamic data classification, it’s worth reviewing what is meant by data classification in general. Data classification is a process that analyzes data—structured and unstructured—and organizes it into categories based on content, file type and other predefined criteria.

Structured data is found in databases. Unstructured data comprises documents like PDFs, email messages and the like. Because it is user-generated and highly varied, unstructured data can be difficult to classify accurately.

There are different solutions that perform the data classification task. Some are purpose-built. Others are part of Data Loss Prevention (DLP) solutions.

Enterprise search tools like Sinequa can perform data discovery and analyze it in the process. More on this in a moment.

Manual data classification

Users can classify their own data. If an employee creates a PDF, he or she can assign it a category, assuming a system has been established to tag and classify new files. In some cases, the manual classification system is integrated with DLP to prevent storing sensitive data in the wrong place, e.g., blocking a file containing IP from being uploaded into Dropbox.

Manual classification is good, in theory, because people understand the data they’re looking at, especially if they created it. The problem is manual data classification is slow and tedious. It’s the kind of task that employees love to skip. No one ever retroactively classifies data after it’s been generated. Relying exclusively on manual data classification will inevitably lead to incomplete categorization of files.

Automated data classification

Automated data classification occurs through some sort of classification engine, which is software that matches string data (i.e., words) in each file to a set of defined search parameters. It’s a far speedier and more efficient process than manual classification. The problem with automated data classification is that its accuracy can be uneven.

What is dynamic data classification and how does it work?

Dynamic data classification takes automated data classification to the next level. The technique uses either rules or machine learning to assess and categorize data more accurately than pure string-to-parameter matching.

  • Rules-based – The rules-based approach to dynamic data classification establishes categorization rules and then classifies data according to those rules. Multiple rules can affect the way a piece of data is classified. That’s what makes the process dynamic. For example, rules might compare the presence and density of certain words in a document with the identity of the document’s creator in order to generate an accurate classification. It the word “patent” appears in a file created in the legal department, that means it might contain IP. If the word “patent” shows up in purchase order for shoes, that might refer to patent leather, so it’s not sensitive IP data.
  • Machine learning-based – With machine learning, dynamic data classification learns how to classify data by reading it and learning from it. The Sinequa intelligent search platform does ML-based dynamic data classification using natural language understanding (NLU) along with relevance scores and search relevance optimization based on artificial intelligence (AI). Working this way, ML-based dynamic data classification can continually improve its ability to categorize data. It learns from the documents it evaluates.

To what extent is dynamic data classification useful?

Dynamic data classification is a useful process. Having a more complete and accurate understanding of your data, and where it is stored, offers several benefits. It helps with security, compliance and risk mitigation. Business decision making improves, and a company’s profitable use of data can also become more widespread.

Done right, dynamic data classification enables stakeholders to understand the relative importance of different elements of the overall corporate data set. Everyone who needs to know will have certainty about what data is sensitive and what is valuable.

From there, they can make informed decisions about data security. For instance, accurate, dynamic data classification reveals which types of data will cause risk exposure it is breached. Given that data security comes at a cost, a company that knows what data needs the highest level of protection will avoid overspending on data security.

Well-classified data furthers the mission of data analytics, as well. At one level, a data-driven business can only analyze data it knows about. If data is lost due to being unclassified—and is therefore invisible—it cannot offer the benefits of analytics.

Going further, accurate classification of unstructured data is an important step toward utilizing unstructured data in business analytics. When classified, the data can be part of the analytics process.

Data classification also benefits the IT department. Data storage and management represent significant investments. With dynamic data classification, IT and storage managers can gain a firm grasp on how much of each type data they have to manage. Higher priority data usually goes onto higher performing, more expensive storage.

Knowing how much data needs higher level storage helps avoid over-spending on such premium storage, to name one benefit. Classification can also help with finding duplicate data, which translate into savings on storage and backup costs.

Some use cases of dynamic data classification

Dynamic data classification aligns with a range of use cases in corporate and public sector settings. Risk mitigation is one of them. By accurately classifying all data in an organization, it is possible to limit access to PII, which reduces compliance risk and associated liabilities having to do with data breaches. Complete data classification enables organizations to control location and access to IP, which limits risk exposure for loss of trade secrets.

In general, the ability to identity sensitive data translates into a reduction in an organization’s attack surface area—assuming the right people deploy appropriate countermeasures.

Governance and compliance are additional use cases for dynamic data classification. Stakeholders who are responsible for governance related to GDPR and other regulatory frameworks need to know where relevant data is stored so they can be compliant. Dynamic data classification makes this a reality by identifying data that needs to be governed under these regulations.

The process can apply metadata tags to affected data sets, marking them for governance policies. Then, when legally mandated processes like legal holds, “right to be forgotten” and Data Subject Access Requests (DSARs) arise, the organization can easily comply with them.

Sinequa’s intelligent search platform provides a powerful engine for dynamic data classification. Using AI, ML and NLU, Sinequa can efficiently index large amounts of structured and unstructured data. From this AI-driven indexing process, Sinequa can apply data classification tags dynamically, based on the content itself. The result is data classification that is accurate and holistic.