Structuring the Unstructured Data to Become Information Driven
Structured versus unstructured data. Why and how?
As the amount of enterprise data has increased significantly, it has become a real IT matter, and the concept of structured and unstructured data has emerged.
Over time, it became apparent that company data could be split into two subsets depending on the type of data, defined mainly by example rather than by a formal, standard definition.
Structured data was typically found inside Databases, ERPs, CRMs, PLMs, Directory systems, and other content management tools containing people data, financial transactions, part numbers, clinical trial datasets, etc.
On the other hand, the quantity of text found inside Patents, Scientific articles, Websites, project deliverables, and contracts has led knowledge managers to label them as unstructured data.
And what about the two gray zones in between:
- Documents made up of large, unstructured content but are managed inside content management systems to organize them better using categories, metadata, and properties. These types of documents led to the term “semi-structured” data.
- Short content comprises multiple text parts hosted inside social networks, instant messaging systems, and even several columns within database tables. Should they be considered structured data? Unstructured data? Semi-structured? Nobody? Both?
So let’s consider why this classification attempt occurred and why a new approach should be taken to manage them all.
Why are there two main categories of data and how should we deal with them?
The main reason two data categories exist is to specify better the software systems that will best manage them.
Beginning with Excel and databases in general, many products have been developed to manage structured data properly.
In the meantime, content management systems (starting with shared drives) were developed to better host Word, PDF, and other textual documents (a.k.a. unstructured documents).
The list of structured/unstructured document management systems is extremely long, depending on the purpose and the business expectation. All have a wide variety of features, capabilities, strengths, and weaknesses.
Unstructured data is a more significant challenge
What makes unstructured data more complex to manage?
While database content is formatted simply inside table cells, following a more or less strict schema, unstructured documents can include hundreds of binary formats that are written in many native languages.
Dealing with database content is simple once you have identified the information contained in the database. Dates are properly stored in date formats, people names are clearly written into appropriate fields and amounts of money, category names, quantitative values, and so forth are all stored in the proper formats.
If we now consider a plain text document written in, let’s say, German, Russian or Japanese, how can we identify the same types of named entities (e.g., dates, person names, quantitative values, etc.)?
Most of the time, basic search engines allow you to perform a full-text search, but you need to know what you are looking for. More importantly, you will have to read the result carefully to retrieve the precise information that is located inside a sentence on a given page of the document, even when the most relevant document is found.
This complex challenge is the core reason why unstructured content is very often underused in many companies and why many companies claim that their “data-driven” strategy is still far from becoming “information-driven”.
An intelligent enterprise search engine as a solution
At this point, you may be thinking… “ok, I get the point. Let’s just set up the most advanced Enterprise Search solution .”
With broad connectivity, we’ll be able to index both structured and unstructured documents to provide access to truly unified information based on all of our data, whatever the document management system
Because we can work with any document, the text becomes easily accessible, and any user can search broadly across any piece of information, whatever the binary format.
With embedded Natural Language Understanding technologies, we no longer have to be afraid of documents and data in multiple written languages. And the built-in text mining capabilities can identify named entities. Data like people’s names, amounts of money, part numbers, locations, and company names can be easily identified and surfaced for any qualitative and quantitative post-processing.
Using machine learning, documents can be automatically organized into categories and the user’s query intent can be detected and correlated at search time to maximize user satisfaction.
Indeed, this is exactly what an advanced search engine can do.
But is that all?
Let’s conclude with some icing on the cake!
The ability to process both structured and unstructured data allows us to go above and beyond a simple federated search across multiple data sources.
A simple example will illustrate this conclusion.
Imagine a company with an employee directory, a CRM system to manage their client data, an ERP, and several business applications to precisely describe products, suppliers, manufacturing plants, etc.
You will probably think – “this is the case most of the time.”
Now let’s also imagine a platform that can be fine-tuned and enhanced with business vocabularies to improve the text mining capabilities and search features to provide best-in-class enterprise search.
Then, and only then, will this company be able to close this loop.
By making use of any proprietary structured data to better help mine the unstructured data, the enterprise search platform approach will not only allow you and your users to search across all company data. Moreover, it will improve your ability to structure the unstructured data, helping users surface all relevant facts, entities, and relations previously hidden inside your millions of unstructured documents.
And that’s what it takes to truly become Information-Driven.