Natural Language Processing (NLP) and Machine Learning power the advanced features of Sinequa's Intelligent search platform. After more than 25 years of NLP research, we are experts at making sense of each piece of text, whatever the native language. In addition, the platform embeds state-of-the-art Deep Learning frameworks to close the gap between the experience of classical enterprise search and today's web search engines. The resulting proprietary index is optimized to cope with huge volumes and intensive usage.
Natural Language Text Processing

Natural Language Text Processing

Sinequa's NLP semantically enriches content in any language and powers an intelligent employee experience for search and analytics.

At indexing time, NLP applies to:

  • Automatic detection of document languages (more than 135 detected languages) with a language splitter to manage documents that are switching from one language to another
  • Real Multilingual Analysis at the highest level within many languages such as English, French, German, Arabic, Chinese (simplified), Chinese (Traditional), Korean, Danish, Spanish, Finnish, Greek, Italian, Japanese, Dutch, Polish, Portuguese, Romanian, Russian, Swedish, Thai, Norwegian
  • Part of Speech tagging
  • Concept extractions for related terms
  • Depending on the languages, several speech technologies must apply, such as transliteration (Japanese), compound word splitting (German, for instance), model-based disambiguations, etc.
  • Semantic analysis
  • Content enrichment with variants, standard or custom rewritings, etc.
Statistical Analysis

Statistical analysis

Sinequa utilizes advanced information retrieval techniques to provide relevant and contextualized results. Sinequa embeds a sophisticated variation of well known TF-IDF and BM-25 algorithms, enhanced by multiple factors including:

  • Proximity: a measure of the "closeness" of multiple search terms in a document
  • Proximity to the head content
  • Fidelity to the original form with regards to all linguistic variants
  • Neural search footprint from document passages
  • Relevancy corrections based on document freshness, data source weighting, document ratings, and feedback models
  • Text part weighting
  • Business rules
Semantic Extractors

Semantic extractors

Intelligently identifies and extracts entities for document classification and tagging.

  • Extensive entity identification and extraction capabilities including geographic entities (such as countries, cities, and states), people names, companies, numerals, dates, times, amounts, distance, quantities, measure units, phone numbers, coordinates, URLs, e-mail addresses, hashtags, cashtags, at tags, date spans, time spans, and many other types of Personal Identifiable Information (PII)
  • Supplemental extractors integrated from third-party partners solutions such as SciBite, Refinitiv Intelligent tagging, Linguamatics, or MS Azure Media Services
Text Mining with NLP Skills

Text mining with NLP skills

Sinequa provides advanced capabilities to detect patterns in text, specifically for entity extraction, including:

  • Lists of named entities, cooccurrences, relationships
  • Regular expressions
  • Complex patterns
  • Code-based extraction with inline C# custom developments

Advanced capabilities are included with the Sinequa platform to significantly simplify creating complex extraction rules, enhancing native capabilities with help from a dedicated descriptive language.

Data Classification

Data classification

Easily classify documents using our embedded machine learning models and semantic techniques without being a data scientist.

When content cannot be easily organized based on its location, existing metadata or associated properties, the dynamic classification may help surface structure out of the apparent chaos. Two technologies are combined to make this happen:

  • Deep-learning-based classification: The Sinequa platform enables administrators and subject matter experts to manage the complete lifecycle of their classification projects from inception to production and management of prediction accuracy over time (Active Learning), with a labeling application that enables subject matter experts to provide ongoing feedback. Neural networks power this technology, implemented with help from a transparently embedded Tensorflow framework and BERT transfer learning language models.
  • Rule-based classifiers. Classifiers are decision trees that can classify whether documents would be retrieved by typing one amongst several queries. As an asynchronous post-processing task, documents are spread across categories depending on their ability to match simple or complex search criteria.
Multi-layer Index Data Structure

Multi-layer index data structure

Sinequa relies on its comprehensive and efficient index structure to deliver superior relevance from even the most extensive content and datasets without compromising performance.

If enterprise search were all about matching a keyword, a single index would suffice. While this may be sufficient for narrow applications of highly classified and structured data, it would fail if applied to unstructured data. Since the vast majority of information is captured in everyday language, no single index can serve as an optimal measure of the information contained in a corpus. Therefore, there is no one “ideal” index for every potential information query. The best results are achieved when multiple indexes are combined, each providing a different perspective or emphasis and a comprehensive view of the information available – thus deriving the best possible understanding of the meaning it carries.

When indexing unstructured data, Sinequa automatically generates a variety of indexes to provide the most comprehensive assessment of the text content. Sinequa also provides the ability to tailor how the different indexes are used in a search (by changing their weightings), allowing search results to be fine-tuned for the best results in highly specialized contexts. Therefore, the Sinequa index is a dynamic combination of the indexes: full text, structured, semantic.

Sinequa can query any combinations of indexes with different schemas, searching through all structured and unstructured data at once to offer the best data discovery modes.

Sinequa does not rely on any data structure derived from Apache Lucene. Multiple NLP layers process the raw text and optimally enrich it at the lowest possible level, ensuring rich functionality does not impact search performance because of external and supplemental packages or libraries.

Sinequa indexes comprise full-text parts, typed columns to store and retrieve associated metadata or entities extracted at indexing time, columns dedicated to security aspects, etc.

The index data structure is strongly optimized to ensure elasticity. It mitigates competition between simultaneous updates and searches. It is also secured with safe transactions, redundancy, and internal reorganization capabilities.

Discover what Sinequa can do for your business.

Schedule a personalized demo to show how Sinequa can benefit your organization.
Get started
Sinequa
@2021 Sinequa. All rights reserved | Privacy policy