Natural Language Text Processing
Sinequa's NLP semantically enriches content in any language and powers an intelligent employee experience for search and analytics.
At indexing time, NLP applies to:
- Automatic detection of document languages (more than 135 detected languages) with a language splitter to manage documents that are switching from one language to another
- Real Multilingual Analysis at the highest level within many languages such as English, French, German, Arabic, Chinese (simplified), Chinese (Traditional), Korean, Danish, Spanish, Finnish, Greek, Italian, Japanese, Dutch, Polish, Portuguese, Romanian, Russian, Swedish, Thai, Norwegian
- Part of Speech tagging
- Concept extractions for related terms
- Depending on the languages, several speech technologies must apply, such as transliteration (Japanese), compound word splitting (German, for instance), model-based disambiguations, etc.
- Semantic analysis
- Content enrichment with variants, standard or custom rewritings, etc.
Sinequa utilizes advanced information retrieval techniques to provide relevant and contextualized results. Sinequa embeds a sophisticated variation of well known TF-IDF and BM-25 algorithms, enhanced by multiple factors including:
- Proximity: a measure of the "closeness" of multiple search terms in a document
- Proximity to the head content
- Fidelity to the original form with regards to all linguistic variants
- Neural search footprint from document passages
- Relevancy corrections based on document freshness, data source weighting, document ratings, and feedback models
- Text part weighting
- Business rules
Intelligently identifies and extracts entities for document classification and tagging.
- Extensive entity identification and extraction capabilities including geographic entities (such as countries, cities, and states), people names, companies, numerals, dates, times, amounts, distance, quantities, measure units, phone numbers, coordinates, URLs, e-mail addresses, hashtags, cashtags, at tags, date spans, time spans, and many other types of Personal Identifiable Information (PII)
- Supplemental extractors integrated from third-party partners solutions such as SciBite, Refinitiv Intelligent tagging, Linguamatics, or MS Azure Media Services
Text mining with NLP skills
Sinequa provides advanced capabilities to detect patterns in text, specifically for entity extraction, including:
- Lists of named entities, cooccurrences, relationships
- Regular expressions
- Complex patterns
- Code-based extraction with inline C# custom developments
Advanced capabilities are included with the Sinequa platform to significantly simplify creating complex extraction rules, enhancing native capabilities with help from a dedicated descriptive language.
Easily classify documents using our embedded machine learning models and semantic techniques without being a data scientist.
When content cannot be easily organized based on its location, existing metadata or associated properties, the dynamic classification may help surface structure out of the apparent chaos. Two technologies are combined to make this happen:
- Deep-learning-based classification: The Sinequa platform enables administrators and subject matter experts to manage the complete lifecycle of their classification projects from inception to production and management of prediction accuracy over time (Active Learning), with a labeling application that enables subject matter experts to provide ongoing feedback. Neural networks power this technology, implemented with help from a transparently embedded Tensorflow framework and BERT transfer learning language models.
- Rule-based classifiers. Classifiers are decision trees that can classify whether documents would be retrieved by typing one amongst several queries. As an asynchronous post-processing task, documents are spread across categories depending on their ability to match simple or complex search criteria.
Multi-layer index data structure
Sinequa relies on its comprehensive and efficient index structure to deliver superior relevance from even the most extensive content and datasets without compromising performance.
If enterprise search were all about matching a keyword, a single index would suffice. While this may be sufficient for narrow applications of highly classified and structured data, it would fail if applied to unstructured data. Since the vast majority of information is captured in everyday language, no single index can serve as an optimal measure of the information contained in a corpus. Therefore, there is no one “ideal” index for every potential information query. The best results are achieved when multiple indexes are combined, each providing a different perspective or emphasis and a comprehensive view of the information available – thus deriving the best possible understanding of the meaning it carries.
When indexing unstructured data, Sinequa automatically generates a variety of indexes to provide the most comprehensive assessment of the text content. Sinequa also provides the ability to tailor how the different indexes are used in a search (by changing their weightings), allowing search results to be fine-tuned for the best results in highly specialized contexts. Therefore, the Sinequa index is a dynamic combination of the indexes: full text, structured, semantic.
Sinequa can query any combinations of indexes with different schemas, searching through all structured and unstructured data at once to offer the best data discovery modes.
Sinequa does not rely on any data structure derived from Apache Lucene. Multiple NLP layers process the raw text and optimally enrich it at the lowest possible level, ensuring rich functionality does not impact search performance because of external and supplemental packages or libraries.
Sinequa indexes comprise full-text parts, typed columns to store and retrieve associated metadata or entities extracted at indexing time, columns dedicated to security aspects, etc.
The index data structure is strongly optimized to ensure elasticity. It mitigates competition between simultaneous updates and searches. It is also secured with safe transactions, redundancy, and internal reorganization capabilities.