Home / Resources / Assets / Search Relevance Guide

Guide
All industries

Search Relevance Guide

Posted by Charlotte Foglia

Table of contents.

Navigate guide

Let’s say I just typed “new york times square dance” into a search bar. Am I looking for information about the upcoming dance happening in Times Square? Or an article from the New York Times about square dancing? This is one of the many challenges that search relevancy must solve with every query.

Achieving relevance is all about understanding intent, and it’s complicated by syntax, like in the example above, and by many other factors. Different users may phrase the same query in different ways. On the other hand, different users might enter the same query even though they are looking for different results. With the same query, one user might be looking for a document, while a second user is searching for a very precise answer, and a third is seeking a broad overview of a given subject.

Search relevance is a complicated, but extremely valuable part of a high-performing enterprise search tool. Let’s review what search relevance is, why it matters, and how it can be measured.

Search relevance definition

Search relevance measures how closely the search results relate to the user’s query. When a user enters a search, they’re looking for answers. Results that better match what the user is hoping to find are more relevant and should appear higher on the results page.

In recent years, advanced internet search engines such as Google have raised user expectations. When users enter a query, they expect a high-quality user experience (UX) with highly accurate and relevant results every time. But Google’s search prowess comes from more than 20 years and billions of dollars spent on research and development.

Most enterprise search platforms today haven’t had the same time and resources to dedicate to search as Google. Many cannot understand the user’s intent and deliver relevant search results to the same degree.

Enterprise search is different. In enterprise search, relevancy is about more than just the order of the answers or how well they align to the query. How well the search experience helps the user accomplish a goal or task matters. Relevant enterprise search systems succeed in boosting innovation, increasing productivity, improving customer service, and helping decision making. Hence Enterprise search relevance is much more about how the solution will help the user save his time and accomplish his tasks thanks to the best search experience.

Brief history

Search relevance was born almost in tandem with the internet. In the early 1990s, while Information Retrieval was already a kind of science for librarians and researchers, some of the first search engines–Archie and Gopher–were introduced by researchers at McGill University and the University of Minnesota, respectively, to help researchers search the systems of other universities that they were connected to. These systems, however, were reserved mainly for academics with advanced knowledge of computers.

When the world wide web started to pick up speed, more user-friendly search engines like WebCrawler and AskJeeves proliferated to help users sift through the burgeoning number of websites and pages. The relevancy of these early search engines was based solely on the number of times the search query appeared on a web page–a far cry from today’s experience. Google arrived in the late nineties and significantly stepped up the search engine game, building intricate algorithms to assess and rank relevant content.

Today, internet and enterprise search engines are employing sophisticated tools like natural language processing and machine learning to take search relevance to the next level.

Don’t confuse Enterprise Search with Web Search

We live in an era where just about everyone has used Google (or Bing or Yahoo, etc.) to find things on the internet, and the quality of results is usually excellent. This sets a high bar, and people often expect enterprise search to behave similarly.

However, searching for web pages on the internet is extremely different from looking for content in an enterprise. Finding content within an enterprise is a much, much more complex problem and, therefore, much more challenging to do well. Consider some of the differences between internet search and enterprise search:

Internet Search	Enterprise Search
Mostly one content type: HTML pages on websites	Many different content types (e.g., documents, records, etc.) in various, secluded and heterogeneous repositories (e.g., ECM, WCM, DB, File Systems, etc.)
Content is curated and designed to be found (SEO)	Written to be used, not to be found; often little or no curation.
Embedded, reliable metadata (SEO)	Minimal and often incorrect metadata
The number of incoming links can easily determine authoritative content to enhance relevance (i.e., Google PageRank)	Authoritative content not really well defined.
20+ years of search history incorporated into results	Lack of search history to (re-)use
No security	Multiple, multilayered security models must be enforced
An extremely high volume of activity identifies the most valuable content and enables learning (ML)	Volume too low for reliable self-learning
Continuously tuned and monitored by thousands of employees	How many are there in your organization?

Since enterprise content is much more diverse, is not as “findable” as web pages on the internet.It isn’t created with the intent to be found, and doesn’t have a wealth of user behavior to learn from, the results from enterprise search often feel disappointing compared to the experience we have come to expect.

That doesn’t mean that enterprise search is poor. A good deployment of search vastly improves the ability to locate and retrieve information critical to the enterprise’s success . Instead, it means we’re evaluating it against the wrong standard.

Enterprise search should not be deemed unsuccessful just because the experience isn’t as smooth as Google. Instead, it should be compared against the existing standard in the enterprise, which could be file share search, Intranet search, SharePoint search, or perhaps even no search at all.

The benefits of search relevance

Search results with a higher relevance score lead to more satisfied and engaged users. Users who can quickly find the information they’re looking for are more likely to take the next step and repeat the search experience regularly. Depending on the audience, that could mean prospects are converting into customers, team members are completing work assignments, or executives are making informed, mission-critical decisions.

Many factors can cause users to see results unrelated to their query and make it difficult to get the answers they need. Frustrated users are more likely to use a different channel to find the right information, even if it takes more time. This is why search relevance is an essential part of the enterprise search user experience (UX).

Relevancy and UX

One challenge in optimizing search relevance is the lack of a clear distinction between where relevancy ends and the user experience begins. Relevancy and UX are often intertwined notions. Relevant results will improve usability and UX, but a bad UX might affect the users’ judgment regarding the quality of search results.

Users often conflate suboptimal relevance with a suboptimal user experience. Many external factors can introduce bias when judging the relevance of the system, such as the readability of the user interface, its responsiveness to user actions, the quality of the snippets in the results, or the absence of actionable links.

If the user is confused by the display, has to scour a crowded screen to locate the nugget they need, has to click many times, or simply doesn’t “see” the result they’re looking for, they will eventually give up in frustration.

Similarly, suppose users are left on their own to guess and experiment to find the exact terminology and syntax needed for “just the right” query. In that case, relevance will be perceived as poor compared to a system that is able to perform some interpretation and provide suggestions that guide the user to ask a better question.

How is search relevance assessed?

Relevance is not only complex but also a gray area. What is relevant to one person may not be relevant to another (think about the different interests of someone in Legal vs. someone in Research). Queries can be phrased in many different ways, and the nuances of query formulation can bring about differences in results. Because of the users’ subjectivity, the plurality of needs and expectations, and even the data being indexed, there is no such thing as an absolute value for it.

Relevancy is hard to measure accurately and cannot be easily compared between different systems, even if benchmarks can be conducted with help from numerical indicators. In general, only the trend of your indicator(s) will help you understand if and when your system improves and when it does not.

How to evaluate search relevance: scoring and metrics

We all want to make our users happier, our search experience better, and our relevance greater, but how do we determine that the system improves? We’ll outline two approaches here:

Evaluating search relevance with explicit feedback

Precision and Recall is one of the most well-known indicators to evaluate search effectiveness. Though it is not easily implementable because of the tremendous effort it requires, we will introduce it briefly since most of the indicators are derived from it.

Precision is the proportion of relevant results in the list to all returned search results. Recall is the ratio of the relevant results returned to the total number of relevant results that could have been returned. These two quantities compete against each other. When you improve the precision, the recall worsens and vice versa.

This is a supervised process where the user is front and center. For each query, the user (a Subject Matter Expert) will not only have to scan the whole result set and determine whether or not every single result is relevant to the query above in order to compute the precision but will also have to know the extensive list of documents that are relevant to this query in the corpus in order to compute the recall.

As the number of documents in the corpus becomes larger and as we increase the number of queries for evaluation, this task can quickly become unmanageable.

A popular measure is the Normalized Discounted Cumulative Gain (NDCG) when using this approach. In a nutshell, users have to grade the first K documents from the result set (usually K = 20). For each document, the users will assign a score (the Gain) on a 4-point scale, for instance (from 0 – irrelevant to 3 – perfect). The score for each result will be weighted by its rank on the result list to reflect the importance of the position (Discounted).

Finally, all the scores will be summed up (Cumulative) and Normalized (by computing the ideal ranking to reflect that some queries are more complex to answer than others).

Rank K=4	Judgment (Gain)	Discounted Gain	Discounted Cumulative Gain (DCG)	Ideal Discounted Gain	Ideal Discounted Cumulative Gain (iDCG)	Normalized Discounted Cumulative Gain (NDCG@4)
1	2	2/1	2	3/1	3.0	0.67
2	0	0/2	2	2/2	4.0	0.5
3	3	3/3	3	2/3	4.67	0.64
4	2	2/4	3.5	0/4	4.67	0.75

Since it is a manual and repetitive effort, this process should be automated as much as possible to avoid entry errors and let the users focus on the added value part of the process: grading the results.

Note that measuring relevance in this way is great for a single point-in-time relevancy improvement effort but is usually not practical for regular ongoing maintenance because:

It is time-consuming (hence costly). Users/SMEs are usually not available on-demand, and planning is required.
The risk of sampling bias increases with fewer SMEs, but the larger the number of SMEs, the riskier it gets.

After a one-time effort to maximize relevancy, it’s common to use more cost-effective techniques on an ongoing basis to seek relevancy improvements rather than to attempt to re-measure absolute relevancy multiple times.

Evaluating search relevance with implicit feedback

Another way to evaluate search relevance is by measuring user satisfaction with user event logs. This is typically a medium- to long-term approach since it requires a high enough volume of interactions for the system to start to be trusted. However, it’s important to capture all the events needed to report on your chosen KPIs from the beginning.

Once the critical mass of events is reached, it might alleviate the burden of grading results manually to understand if the system is improving. Also, by embracing a larger base of users and queries, you’ll avoid the bias of your SME sample and make sure you are optimizing for most of your users.

The most popular metrics to track are:

Search Exits (i.e., percentage of users leaving the search application after viewing the first result page)
Results page views (in combination with filtering actions or not) per search (i.e., how much effort it requires to find the right information before clicking through)
Percentage of search without any results
The average number of queries being reformulated before click-through (Document Navigator or Original URL)
Actual Bounce Rate (as opposed to bounce rate, the ABR comes with a measure of the dwell time. For instance, a user that spends 30 min on a page before returning to the search results page to perform a new action should not be considered as a bounce)

These are just a few examples–additional KPIs can be added based on users’ activity. Additional information can be captured to create new indicators or improve the existing ones such as the time spent on a specific activity, the dwell time, and actions such as page scrolling.

On top of helping you monitor the relevancy of your system, these breadcrumbs of the users’ activity can help you spot exactly what is not working and can be leveraged to improve it.

Do A/B Testing: While it is cheaper to capture users’ clicks stream than to mobilize users, using simple clicks to gauge the relevance of the search results introduces a bias. Regardless of the KPIs chosen, when rolling out a new version of your relevance (i.e., change in parameters / new rules / etc.), you should use an A/B testing strategy to have a sense of how your new system performs before rolling it out to all users.

Understanding the query

At this stage, regardless of how much effort you put into the tuning of your relevancy, your users are still probably not fully satisfied.

Why is that? We’ve been devoting most of our effort to returning relevant results but does our search engine understand our searchers? This critical part is always underestimated and sometimes completely ignored. It can be broken down into two topics:

How effective are the users at communicating their intent to the search engine?
How effective is the search engine at understanding their intent?

Regarding the first topic, there are a few quick wins:

Guide the users: Use a suggest-as-you-type feature to guide them. Be careful not to bait them with keywords that won’t return the expected content.
Suggest query refinements: Educate the users about the query syntax/the user interface. Like Google’s, “Did you mean…?”

The natural next step is to take a look at the actual queries being entered to understand what users are looking for. If you cannot understand the users’ intent by looking at it, it is doubtful that the engine will outperform you.

This is the recommended next step for any effort to improve relevancy because it allows you to focus on the most widespread actual uses of the system. This helps to ensure that you’re improving real-world results AND focusing on benefiting the most people and the most common uses.

Identify the query type

As you look into the actual queries, it’s helpful to bear in mind the approach that internet search engines use to classify user’s queries. The three query types identified in SEO literature are “Navigational,” “Informational,” and “Transactional” (i.e., Google’s Do-Know-Go). These will help you figure out what the users are looking for even if sometimes you won’t be able to assign one of these categories to your queries when the high-level user intent is unclear or ambiguous.

Transactional (“Do”): The user wants to do something. In the Enterprise context, Transactional is not meant to be taken literally (i.e., employees are usually not trying to buy something). This most often means submitting a request or a form. It could be an action on behalf of the company such as approving an invoice, signing a contract, or submitting a purchase request. The user could also wish to perform an employee action such as submitting a vacation request or an expense or changing one’s tax withholding.

Informational (“Know”): The user wants to obtain some information. Ultimately, they may act on this information, but if that action is separate from the query process or is a physical action rather than an online action, then it’s not a “Do” query because the extent of the system’s response is to provide information. Informational queries can be further subdivided ad nauseam, which is beyond the scope of this guide.

Navigational (“Go”): The user wants to know where to go to accomplish something or locate context-specific information. In the Enterprise context, this most often means going to a particular system or site (e.g., to log in to the payroll portal or to the contract management system). This query type is usually well represented among the top queries but lacks variety in Enterprise Search (limited set of applications).