Sinequa prides itself on our state-of-the-art technology and our continuous innovation - and our crack team of machine learning experts, developers, and linguists are constantly breaking new ground to improve the information retrieval techniques used in Sinequa’s platform. This team has published several articles explaining the challenges and solutions for advancing information retrieval in modern enterprises. Today we’re highlighting three of these articles, starting with the most recent publication.
The most recent article from this month explores Meta-Learning for Keyphrase Extraction, which delves into the how and why of KeyPhrase Extraction (KPE) - extracting phrases/groups of words from a document to best capture and represent its content. The article details the approach taken for a model to learn the general principle of “what is a good keyphrase?”
The team sought to get high scores for content used not only during training but more importantly good scores for previously unseen content from other subjects. To do this, the team trained the KPE model on several varied datasets to expose it to different content, domains, and highly varied text structures. They kept adding to the training set as long as the additional material didn’t worsen the model’s previous predictions, made without the additional content. The result was a broader and more general model that was highly predictive not only on material contained in the training content but also new, never-before-seen content, also known as zero-shot learning.
As the article concludes, this "meta learning" means that the model didn’t just learn to detect keywords (from specific content), but it learned how to learn to detect keywords (in any content). Tests using this approach showed great results on many content types, including noisy, out-of-domain, and more complicated datasets.
The second article, published in December 2020, focused on Classifying long textual documents (up to 25 000 tokens) using BERT. It covers how the application of deep learning crushes existing baseline metrics for classification by leveraging modern language models.
This achievement did not come without its challenges, however. Modern, transformer-based language models cannot efficiently deal with very long text sequences. The core limitation is the memory footprint that grows quadratically with the number of tokens and pre-trained models when dealing with long documents. Most academic research is less constrained than commercial applications, which require faster response times using less hardware. BERT is very large, and too computationally complex for fast, cost-efficient use at scale. To overcome this challenge, Sinequa trained a smaller, more efficient, multilingual language model that is much less computationally demanding than BERT but with comparable accuracy.
The article details a novel (at the time) approach for classifying long documents by leveraging a variety of feature types, including long text (using longer tokens than the 512 handled by BERT), additional textual metadata, and categories. The trick is to split the long content into shorter sequences, process those sequences, and merge the output with a smaller transformer block, all without introducing a considerable computational cost.
Although it may sound straightforward, some of the details represent breaking new ground in this space. ML-based classification has traditionally relied on text only, but Sinequa created an approach to include categorical metadata that supplements the textual features. Perhaps most notable is that Sinequa created a classification model architecture for long textual documents before it was known in the industry - Google (with slightly more resources at their disposal) hinted at a similar technique (but used for document similarity) in October of 2020.
The first article made its debut in April 2020 as Query Intent: Few-Shot Learning & Out-of-domain, addressing the need of a large training set when performing machine-learning-based query intent. The magnitude of this problem only increases when providing a training service instead of a trained model, and a service is necessary to provide high accuracy across a broad range of applications. The article advances the state-of-the-art in two specific challenges: how to create greater model specificity with only a few examples (few-shot learning) and how to discern when to not use the model, because the model hasn’t been trained for it (out-of-domain detection).
Few-shot learning makes it easy to train query intent, because it greatly reduces the size of the training set. Sinequa created a two-phase approach in which the model is first pre-trained on public datasets and later fine-tuned on specifics provided by the customer. We trained the model on a big public dataset to provide decent performance out-of-the-box, so that the customer only has to define a small training set of only 10 to 20 examples to get excellent results.
Of course, it’s impossible to train the model for every possible scenario and topic. Therefore, the system must know when to apply the intent, and when to skip the model and provide traditional results (when the query is out-of-domain). The approach to addressing this challenge is tricky. It relies on a combination of a) the accurate classification of datasets and b) dynamically computed confidence thresholds needed to perform non-intent detection. In practice, the customer can always define "non-intent" queries, but the goal is to provide a solution that works well out-of-the-box, without requiring the customer to try to think of all the topics where the model won’t work. Sinequa’s approach worked well in testing, and is notable for advancing real-world use of query intent, as out-of-domain detection is rarely covered in current literature.
We hope you find these articles sufficiently enlightening to give them each a “clap” on Medium and encourage you to share them all with the search nerds in your life. Also, keep an eye out for more cutting-edge tech; we will soon be publishing about how we’ve advanced the state-of-the-art in passage ranking for neural search!