Using data to gaining insights into a business is by now a well-understood tactic in the corporate world. However, while the idea is easy to understand, its execution can be challenging. There are many reasons for this, including a lack of qualified people to do data analytics, deficient toolsets and faulty assumptions. One of the biggest obstacles for success, though, involves not looking at the entire data set. It’s tempting to build data warehouses from existing databases and use the resulting data for analytics. The problem with this approach is that relies too much on structured data. Unstructured data, such as that found in emails and documents, tends to get ignored—undercutting the accuracy and impact of the data analytics process.
What is unstructured data?
To compare structured vs unstructured data, one must first understand the nature of structured data. Structured data comprises numbers or text that can fit into the predefined fields of a relational database management system (RDBMS) such as Oracle or Microsoft SQL Server. Structured data takes the form of a database’s rows and columns. Examples include names and addresses, demographic statistics, smart phone locations and so forth. Structured data is easy to work with and search, but it represents just one fifth of all data in a corporate setting.
Unstructured data, in contrast, consists of information that does not fit neatly into an RDBMS. It lacks the uniformity of structured data. For example, a customer database has a defined structure that includes first name, last name, phone number and so on. Unstructured data might be found in a PDF document, an email thread or a social media post. It’s text and numbers—and in some cases video, sounds and images—that are not arranged according to a row and column schema.
Why analyze unstructured data?
Why should you care about unstructured data? It is more difficult to collect, process, search and analyze than its structured counterpart. Yet, 80% of data is unstructured, and it tends to contain a great deal of hidden value. For example, much of what marketers call “brand sentiment” is buried in unstructured data.
You might be able to detect a problem in customer loyalty using structured data sets like sales ledgers. If customers are making fewer repeat orders, that might indicate a brand sentiment problem. Negative brand sentiment could be a lot easier to spot if an analysis of social media posts revealed that nine of out ten comments contained phrases like “this product is awful.” To see those sentiments, you have to have a means to do analytics on unstructured data.
Another compelling reason to analyze unstructured data is for the purpose of data classification. Data classification is the practice of identifying and then labeling data according to classifications, such as “intellectual property” or “IP,” “confidential,” “personally identifiable information” or “PII” and the like. Data classification is foundational for data security and compliance. It is nearly impossible, after all, to be effective in protecting data if you don’t know what it is or what it means.
Most serious data security programs make a priority of defending a company’s “crown jewels,” its most valuable and sensitive information. To know what is, and what isn’t, in the crown jewels, it is first necessary to look at all possible data sets and identity which pieces belong in this highly protected classification. Doing this right means examining unstructured data.
For example, your company might put a premium on protecting its patents. That sounds simple enough, but what if information that supports your patent applications is spread out across your entire enterprise? Documents lurking in file drives and cloud volumes could hold rich intellectual property, such as engineering drawings and research reports competitors could steal. You are vulnerable through your unstructured data. To protect your IP, you need to analyze your unstructured data and find out where all your IP is hidden, and then classify it as IP so it can be properly guarded.
Compliance presents a similar use case. Regulations like the HIPAA law, which protects patient privacy, and the GDPR and CCPA laws that aim to keep consumers’ PII private, require analysis of unstructured data. For instance, PII data might easily exist inside email messages or PDF documents. If you don’t know it’s there, you can’t guard it against data breaches or unauthorized access. You are then exposed to the risk of significant financial penalties. With the California Consumer Privacy Act (CCPA), for example, a breach of just 400 PII records can lead to a $1 million fine!
What tools can be parsed to analyze unstructured data?
Analyzing unstructured data requires the right tools. You have a variety of options, but in general, a modern enterprise search solution will do the job. An enterprise search solution can conduct data discovery for unstructured data. This means having software “crawlers” go through all of your organization’s unstructured data. The crawlers will parse the contents of Microsoft Office documents, PDFs, email servers and any other source of unstructured data.
As data about the documents comes back to the enterprise search engine from its crawlers, the solution can build a searchable index of unstructured data. Then, using either built-in features or third-party tools, the enterprise search solution can add data classifications to the unstructured data is has indexed. In this workflow, it can be helpful to employ a Natural Language Processing (NLP) function that reads the documents like a human being, not a machine. A good NLP solution picks up on nuances within unstructured data that might elude a more traditional, mechanical search application.
Unstructured data analytics may also require an analytics platform that is compatible with unstructured data. An analytics platform with this capability usually comprises a connected set of applications, rather than a single solution. It might combine data analytics and visualization with Business Intelligence (BI), NLP and enterprise search. The analytics platform might also include Artificial Intelligence (AI) functionality that enables it to improve its accuracy and recognize patterns in unstructured data.
Unstructured data is, or should be, a major part of an organization’s data analytics strategy. It should also figure prominently into data security and compliance efforts. The consequences of ignoring are potentially quite severe. A modern enterprise search solution is a key element in any project that sets out to discover, classify and analyze unstructured data.
Related blog posts.
Recently, we announced that Sinequa Search Cloud is now available through Microsoft Azure Marketplace. This blog post explains how acquiring our solution directly from the marketplace brings additional benefits to your ...