Thursday, January 16, 2025

Software engineering flipped on its head.

Evolve your thinking into its optimal form: the sloth.

Home OpenSearch3. Querying in OpenSearch [Tutorial] Querying the Keyword Field Data Type

[Tutorial] Querying the Keyword Field Data Type

by Trent
0 comments

This article will explore the many ways in which we can query the Keyword field data type using the .Net OpenSearch.Client NuGet package. Please read the Keyword Field Data Type Indexing Deep Dive article prior to completing this tutorial, as it contains useful prerequisite information.

At time of writing, Elastic.co documentation is far richer than OpenSearch, so a combination of links between the two vendors may be provided to reference the concepts discussed. These offerings are currently functionally equivalent.

The Code Sloth public repository contains complete code for the snippets from this article. Check out the OpenSearch section of the code samples page for a GitHub link.

Let’s get started!

Searching on Keyword Fields

The most obvious use case for writing a query is to execute a search request. There are many types of search requests that we can perform on keyword fields, some of which are a bit ambiguous.

Let’s dig into each of these cases after first examining the shape of our document under test:

namespace KeywordDemo.Documents
{
    /// <summary>
    /// A sample document that contains a single keyword field that is explored during multiple tests within the suite
    /// </summary>
    public record ProductDocument
    {
        public ProductDocument(int id, string name)
        {
            Id = id;
            Name = name ?? throw new ArgumentNullException(nameof(name));
        }

        /// <summary>
        /// The Id field of a document is automatically used for the document id at indexing time
        /// </summary>
        public int Id { get; init; }

        /// <summary>
        /// This string property will be mapped as a keyword
        /// Conceptually this property may represent the name of a product
        /// </summary>
        public string Name { get; init; }
    }
}

It is a simple record, with minimal data:

  • An integer Id property to represent the document’s id numerically
  • A string Name that conceptually represents a product’s name

Keyword search will only produce a match if the given string is identical to a value on a document.

Term Query

Let’s start by observing the happy case for a term query.

The test case below will:

  • Index two documents with the names mouse and mouse pad respectively
  • Run two test cases that each issue a single term search using one of the above names per case
  • Assert that we retrieve a single document whose name matches the given term
    • This is despite the word mouse appearing in both documents, which could see multiple documents returned in the mouse pad test case in an analysed search
[Theory]
[InlineData("mouse", "Only the document with name mouse will match")]
[InlineData("mouse pad", "Only the document with name mouse pad will match")]
public async Task KeywordMapping_ExactlyMatchesWholeTermQuery(string termText, string explanation)
{
	var indexName = "keyword-index";
	await _fixture.PerformActionInTestIndex(
		indexName,
		mappingDescriptor,
		async (uniqueIndexName, opensearchClient) =>
		{
			var productDocuments = new[] {
new ProductDocument(1, "mouse"),
new ProductDocument(2, "mouse pad"),
};

			await _fixture.IndexDocuments(uniqueIndexName, productDocuments);

			var result = await opensearchClient.SearchAsync<ProductDocument>(selector => selector
				   .Index(uniqueIndexName)
				   .Query(queryContainer => queryContainer
					   .Term(term => term
						   .Field(field => field.Name)
						   .Value(termText)
						   )
					   )
				   .Explain()
			   );

			result.IsValid.Should().BeTrue();
			result.Documents.Should().ContainSingle(doc => string.Equals(doc.Name, termText), explanation);
		}
	);
}

This query produces the following DebugInformation in the response object:

Valid OpenSearch.Client response built from a successful (200) low level call on POST: /keyword-index3534a67e-048c-40b4-938b-0a31e2fd9f77/_search?pretty=true&error_trace=true&typed_keys=true
# Audit trail of this API call:
 - [1] HealthyResponse: Node: http://localhost:9200/ Took: 00:00:00.1988437
# Request:
{"explain":true,"query":{"term":{"name":{"value":"mouse pad"}}}}
# Response:
{
  "took" : 20,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.6931471,
    "hits" : [
      {
        "_shard" : "[keyword-index3534a67e-048c-40b4-938b-0a31e2fd9f77][0]",
        "_node" : "l7YV4K5YSFuy_CFfGwt8ig",
        "_index" : "keyword-index3534a67e-048c-40b4-938b-0a31e2fd9f77",
        "_id" : "2",
        "_score" : 0.6931471,
        "_source" : {
          "id" : 2,
          "name" : "mouse pad"
        },
        "_explanation" : {
          "value" : 0.6931471,
          "description" : "weight(name:mouse pad in 1) [PerFieldSimilarity], result of:",
          "details" : [
            {
              "value" : 0.6931471,
              "description" : "score(freq=1.0), computed as boost * idf * tf from:",
              "details" : [
                {
                  "value" : 2.2,
                  "description" : "boost",
                  "details" : [ ]
                },
                {
                  "value" : 0.6931472,
                  "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details" : [
                    {
                      "value" : 1,
                      "description" : "n, number of documents containing term",
                      "details" : [ ]
                    },
                    {
                      "value" : 2,
                      "description" : "N, total number of documents with field",
                      "details" : [ ]
                    }
                  ]
                },
                {
                  "value" : 0.45454544,
                  "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                  "details" : [
                    {
                      "value" : 1.0,
                      "description" : "freq, occurrences of term within document",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.2,
                      "description" : "k1, term saturation parameter",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.75,
                      "description" : "b, length normalization parameter",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "dl, length of field",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "avgdl, average length of field",
                      "details" : [ ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

# TCP states:
  Established: 44
  TimeWait: 68
  CloseWait: 1

# ThreadPool statistics:
  Worker: 
    Busy: 1
    Free: 32766
    Min: 12
    Max: 32767
  IOCP: 
    Busy: 0
    Free: 1000
    Min: 12
    Max: 1000

This output confirms that for a term query of value mouse pad, we only find the document whose name is mouse pad. Interestingly, despite the fact that a term query on a keyword field is essentially a boolean (true or false) matching operation, we can also see that the result is scored based on term frequency and the occurrence of the term across indexed documents. We’ll explore how to suppress this shortly.

In the interim, let’s move onto an unhappy case.

Our next test will:

  • Index the same two documents
  • Run three test cases using the product names that have one mismatched character each
    • The first case is missing the letter e
    • The second case is missing some whitespace
    • The third case starts with a capital letter
  • Assert that we retrieve no documents
[Theory]
[InlineData("mous", "Missing a letter")]
[InlineData("mousepad", "Missing a space")]
[InlineData("Mouse pad", "Missing a space")]
public async Task KeywordMapping_DoesNotMatchOnSlightlyMismatchedTerms(string termText, string explanation)
{
	var indexName = "keyword-index";
	await _fixture.PerformActionInTestIndex(
		indexName,
		mappingDescriptor,
		async (uniqueIndexName, opensearchClient) =>
		{
			var productDocuments = new[] {
new ProductDocument(1, "mouse"),
new ProductDocument(2, "mouse pad"),
};

			await _fixture.IndexDocuments(uniqueIndexName, productDocuments);

			var result = await opensearchClient.SearchAsync<ProductDocument>(selector => selector
				   .Index(uniqueIndexName)
				   .Query(queryContainer => queryContainer
					   .Match(term => term
						   .Field(field => field.Name)
						   .Query(termText)
						   )
					   )
			   );

			result.IsValid.Should().BeTrue();
			result.Documents.Should().BeEmpty(explanation);
		}
	);
}

The DebugInformation information makes this clear to us:

Valid OpenSearch.Client response built from a successful (200) low level call on POST: /keyword-indexcef9bf85-296c-4d67-ae39-1fbdcf4ce855/_search?pretty=true&error_trace=true&typed_keys=true
# Audit trail of this API call:
 - [1] HealthyResponse: Node: http://localhost:9200/ Took: 00:00:00.0831517
# Request:
{"query":{"match":{"name":{"query":"mousepad"}}}}
# Response:
{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

# TCP states:
  Established: 51
  TimeWait: 7
  CloseWait: 1

# ThreadPool statistics:
  Worker: 
    Busy: 1
    Free: 32766
    Min: 12
    Max: 32767
  IOCP: 
    Busy: 0
    Free: 1000
    Min: 12
    Max: 1000

In the example above our search term was only missing a whitespace character. This was enough to cause a mismatch and return no documents at all. This is because token matching must be exact for term searches. Given that a keyword is indexed as a single token, we must provide the entire name to retrieve a match.

Boolean Filter Query: Removing the Score

We saw above that our raw term query was calculating a score, even though our search was either a match, or not a match. This makes calculating a score feel redundant, as there is no level of subjectivity to the filtering of our data.

Luckily we can remove this component of the search entirely by using a boolean filter query. A bool query by itself would score a resulting document by summing its matching query hits together, however, we can apply a filter subquery to skip this from happening.

This type of search is useful when a user knows exactly what they want, such as selecting from known categories of products in a catalog.

[Theory]
[InlineData("mouse", "Only the document with name mouse will match")]
[InlineData("mouse pad", "Only the document with name mouse pad will match")]
public async Task KeywordMapping_CanBeFilteredOnWithBooleanQuery(string termText, string explanation)
{
	var indexName = "keyword-index";
	await _fixture.PerformActionInTestIndex(
		indexName,
		mappingDescriptor,
		async (uniqueIndexName, opensearchClient) =>
		{
			var productDocuments = new[] {
new ProductDocument(1, "mouse"),
new ProductDocument(2, "mouse pad"),
};

			await _fixture.IndexDocuments(uniqueIndexName, productDocuments);

			var result = await opensearchClient.SearchAsync<ProductDocument>(selector => selector
				   .Index(uniqueIndexName)
				   .Query(queryContainer => queryContainer
						.Bool(boolQuery => boolQuery
							.Filter(filter => filter
								.Term(term => term
								.Field(field => field.Name)
								.Value(termText)
								))
						   )
					   )
				   .Explain()
			   );

			result.IsValid.Should().BeTrue();
			result.Documents.Should().ContainSingle(doc => string.Equals(doc.Name, termText), explanation);
		}
	);
}

This produces the following DebugInformation

Valid OpenSearch.Client response built from a successful (200) low level call on POST: /keyword-index3a8694ac-0767-46ae-870f-ec02b341ecad/_search?pretty=true&error_trace=true&typed_keys=true
# Audit trail of this API call:
 - [1] HealthyResponse: Node: http://localhost:9200/ Took: 00:00:00.1764957
# Request:
{"explain":true,"query":{"bool":{"filter":[{"term":{"name":{"value":"mouse pad"}}}]}}}
# Response:
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.0,
    "hits" : [
      {
        "_shard" : "[keyword-index3a8694ac-0767-46ae-870f-ec02b341ecad][0]",
        "_node" : "l7YV4K5YSFuy_CFfGwt8ig",
        "_index" : "keyword-index3a8694ac-0767-46ae-870f-ec02b341ecad",
        "_id" : "2",
        "_score" : 0.0,
        "_source" : {
          "id" : 2,
          "name" : "mouse pad"
        },
        "_explanation" : {
          "value" : 0.0,
          "description" : "ConstantScore(name:mouse pad)^0.0",
          "details" : [ ]
        }
      }
    ]
  }
}

# TCP states:
  Established: 60
  TimeWait: 13
  FinWait2: 1
  CloseWait: 2

# ThreadPool statistics:
  Worker: 
    Busy: 1
    Free: 32766
    Min: 12
    Max: 32767
  IOCP: 
    Busy: 0
    Free: 1000
    Min: 12
    Max: 1000

Here we can see that we have "_score" : 0.0! This produces a much simpler response for an engineer to debug. There’s no maths to understand score relevance, we just see results that were an exact match. A wise Code Sloth likes to keep complexity to a minimum after all!

It is also possible to apply a constant score to our filter query, if the filter feeds into a larger query. In this case, we can construct our query as such:

 var result = await opensearchClient.SearchAsync<ProductDocument>(selector => selector
                           .Index(uniqueIndexName)
                           .Query(queryContainer => queryContainer
                                .ConstantScore(boolQuery => boolQuery
                                    .Filter(filter => filter
                                        .Term(term => term
                                        .Field(field => field.Name)
                                        .Value(termText)
                                        ))
                                    .Boost(3)
                                   )
                               )
                           .Explain()
                       );

The matching document(s) will all contain a score of 3.

Match Query: Query Analysis?

Here’s where things get a little bit confusing. If a term query on a keyword field produces an exact match, what does a match query (whose query is analysed by default) produce for a keyword field?

The analyzer parameter’s description is as follows:

(Optional, string) Analyzer used to convert the text in the query value into tokens. Defaults to the index-time analyzer mapped for the <field>. If no analyzer is mapped, the index’s default analyzer is used.

Analyzer parameter description from elastic.co match query article

This is a little confusing, because it doesn’t seem to be the full story. Follow a maze of links down a rabbit hole and you’ll eventually come across a much more complex sequence of rules that determine the analyser that is used at search time:

At search time, Elasticsearch determines which analyzer to use by checking the following parameters in order:

  1. The analyzer parameter in the search query. See Specify the search analyzer for a query.
  2. The search_analyzer mapping parameter for the field. See Specify the search analyzer for a field.
  3. The analysis.analyzer.default_search index setting. See Specify the default search analyzer for an index.
  4. The analyzer mapping parameter for the field. See Specify the analyzer for a field.

If none of these parameters are specified, the standard analyzer is used.

Taken from: https://www.elastic.co/guide/en/elasticsearch/reference/current/specify-analyzer.html#specify-search-analyzer

After I first read these rules, the standard analyzer seemed like the obvious outcome. Afterall, our next test does not provide an analyser in the given query, nor does it set any analysis on the field or index itself. However this is not the case …

Our next test will:

  • Index two documents, as per our prior tests
  • Run two test cases that each issue a match query using one of the document names
  • Assert that we retrieve a document whose name exactly matches the given string
  • Confirms that the standard analyzer could not have been used for the query, by analyzing the input string and asserting the tokens produced
[Theory]
[InlineData("mouse", new[] { "mouse" }, "Only the document with name mouse will match")]
[InlineData("mouse pad", new[] { "mouse", "pad" },
	@"If the standard analyzer was run on this text it would produce two tokens: mouse, pad. 
	Neither individual token would exactly match the mouse pad document name resulting in no document being returned. 
	However, OepnSearch identifies that the mapping of the field is not Text and does not apply an analyzer at query time. 
	This default behaviour only applies for text field mappings.")]
public async Task KeywordMapping_ProducesNoQueryTimeAnlaysis_ForMatchQuery(string matchText, string[] expectedTokens, string explanation)
{
	var indexName = "keyword-index";
	await _fixture.PerformActionInTestIndex(
		indexName,
		mappingDescriptor,
		async (uniqueIndexName, opensearchClient) =>
		{
			var productDocuments = new[] {
new ProductDocument(1, "mouse"),
new ProductDocument(2, "mouse pad"),
};

			await _fixture.IndexDocuments(uniqueIndexName, productDocuments);

			var result = await opensearchClient.SearchAsync<ProductDocument>(selector => selector
				   .Index(uniqueIndexName)
				   .Query(queryContainer => queryContainer
					   .Match(term => term
						   .Field(field => field.Name)
						   .Query(matchText)
						   )
					   )
				   .Explain()
			   );

			result.IsValid.Should().BeTrue();
			result.Documents.Should().ContainSingle(doc => string.Equals(doc.Name, matchText), explanation);

			// Let's confirm the tokens that WOULD have been generated if we used a match query on a TEXT field mapping
			var analyzeResult = await opensearchClient.Indices.AnalyzeAsync(selector => selector
				.Analyzer("standard")
				.Index(uniqueIndexName)
				.Text(matchText));

			analyzeResult.Tokens.Select(token => token.Token).Should().BeEquivalentTo(expectedTokens);
		}
	);
}

The DebugInformation of the match query is below:

Valid OpenSearch.Client response built from a successful (200) low level call on POST: /keyword-indexb744adeb-fb8a-4a4d-9706-8ac1c63bf041/_search?pretty=true&error_trace=true&typed_keys=true
# Audit trail of this API call:
 - [1] HealthyResponse: Node: http://localhost:9200/ Took: 00:00:00.1556200
# Request:
{"explain":true,"query":{"match":{"name":{"query":"mouse pad"}}}}
# Response:
{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.6931471,
    "hits" : [
      {
        "_shard" : "[keyword-indexb744adeb-fb8a-4a4d-9706-8ac1c63bf041][0]",
        "_node" : "l7YV4K5YSFuy_CFfGwt8ig",
        "_index" : "keyword-indexb744adeb-fb8a-4a4d-9706-8ac1c63bf041",
        "_id" : "2",
        "_score" : 0.6931471,
        "_source" : {
          "id" : 2,
          "name" : "mouse pad"
        },
        "_explanation" : {
          "value" : 0.6931471,
          "description" : "weight(name:mouse pad in 1) [PerFieldSimilarity], result of:",
          "details" : [
            {
              "value" : 0.6931471,
              "description" : "score(freq=1.0), computed as boost * idf * tf from:",
              "details" : [
                {
                  "value" : 2.2,
                  "description" : "boost",
                  "details" : [ ]
                },
                {
                  "value" : 0.6931472,
                  "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details" : [
                    {
                      "value" : 1,
                      "description" : "n, number of documents containing term",
                      "details" : [ ]
                    },
                    {
                      "value" : 2,
                      "description" : "N, total number of documents with field",
                      "details" : [ ]
                    }
                  ]
                },
                {
                  "value" : 0.45454544,
                  "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                  "details" : [
                    {
                      "value" : 1.0,
                      "description" : "freq, occurrences of term within document",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.2,
                      "description" : "k1, term saturation parameter",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.75,
                      "description" : "b, length normalization parameter",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "dl, length of field",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "avgdl, average length of field",
                      "details" : [ ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

# TCP states:
  Established: 50
  TimeWait: 28
  CloseWait: 1

# ThreadPool statistics:
  Worker: 
    Busy: 1
    Free: 32766
    Min: 12
    Max: 32767
  IOCP: 
    Busy: 0
    Free: 1000
    Min: 12
    Max: 1000

A dead giveaway that the standard analyzer was not used, is the scoring explanation.

          "description" : "weight(name:mouse pad in 1) [PerFieldSimilarity], result of:",

Here we can see the name field was matched with the term mouse pad. This term represents a single token, meaning that the keyword analyser (which is a noop) must have been used.

If we look at the DebugInformation of the AnalyzeAsync request, this will become even clearer.

Valid OpenSearch.Client response built from a successful (200) low level call on POST: /keyword-indexb744adeb-fb8a-4a4d-9706-8ac1c63bf041/_analyze?pretty=true&error_trace=true
# Audit trail of this API call:
 - [1] HealthyResponse: Node: http://localhost:9200/ Took: 00:00:00.0232070
# Request:
{"analyzer":"standard","text":["mouse pad"]}
# Response:
{
  "tokens" : [
    {
      "token" : "mouse",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "pad",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

# TCP states:
  Established: 50
  TimeWait: 28
  CloseWait: 1

# ThreadPool statistics:
  Worker: 
    Busy: 1
    Free: 32766
    Min: 12
    Max: 32767
  IOCP: 
    Busy: 0
    Free: 1000
    Min: 12
    Max: 1000

And there we have it. If the standard analyzer was used on our input sting of mouse pad it would have tokenised the string on the whitespace and produced mouse and pad. Our test asserts this by comparing against an expected list of these tokens. If this was the case, this test case would have returned the document whose name is mouse instead of the mouse pad document!

Sloth Summary

Today’s article was an exploration of how we can search on an indexed keyword field. We learned that:

  • Keyword searching behaves the same for term and match queries, neither of which tokenise the given search string (that, or use the keyword analyser which produces a noop)
  • Given that tokens must match exactly, term and match queries on keyword fields must match the indexed string exactly for a document to return in search results
  • Queries on keyword fields will be scored by default
  • We can suppress scoring on our query by using a bool filter query

Don’t forget to head over to the code samples page to continue exploring the content of this article. Also keep an eye out for part two, in which we will cover sorting, scripting and aggregations with keyword fields!

You may also like