This article will explore the many ways in which we can query the Keyword field data type using the .Net OpenSearch.Client NuGet package. Please read the Keyword Field Data Type Indexing Deep Dive article prior to completing this tutorial, as it contains useful prerequisite information.
At time of writing, Elastic.co documentation is far richer than OpenSearch, so a combination of links between the two vendors may be provided to reference the concepts discussed. These offerings are currently functionally equivalent.
The Code Sloth public repository contains complete code for the snippets from this article. Check out the OpenSearch section of the code samples page for a GitHub link.
Let’s get started!
Searching on Keyword Fields
The most obvious use case for writing a query is to execute a search request. There are many types of search requests that we can perform on keyword fields, some of which are a bit ambiguous.
Let’s dig into each of these cases after first examining the shape of our document under test:
namespace KeywordDemo.Documents { /// <summary> /// A sample document that contains a single keyword field that is explored during multiple tests within the suite /// </summary> public record ProductDocument { public ProductDocument(int id, string name) { Id = id; Name = name ?? throw new ArgumentNullException(nameof(name)); } /// <summary> /// The Id field of a document is automatically used for the document id at indexing time /// </summary> public int Id { get; init; } /// <summary> /// This string property will be mapped as a keyword /// Conceptually this property may represent the name of a product /// </summary> public string Name { get; init; } } }
It is a simple record
, with minimal data:
- An integer Id property to represent the document’s id numerically
- A string
Name
that conceptually represents a product’s name
Keyword search will only produce a match if the given string is identical to a value on a document.
Term Query
Let’s start by observing the happy case for a term query.
The test case below will:
- Index two documents with the names
mouse
andmouse pad
respectively - Run two test cases that each issue a single term search using one of the above names per case
- Assert that we retrieve a single document whose name matches the given term
- This is despite the word
mouse
appearing in both documents, which could see multiple documents returned in themouse pad
test case in ananalysed
search
- This is despite the word
[Theory] [InlineData("mouse", "Only the document with name mouse will match")] [InlineData("mouse pad", "Only the document with name mouse pad will match")] public async Task KeywordMapping_ExactlyMatchesWholeTermQuery(string termText, string explanation) { var indexName = "keyword-index"; await _fixture.PerformActionInTestIndex( indexName, mappingDescriptor, async (uniqueIndexName, opensearchClient) => { var productDocuments = new[] { new ProductDocument(1, "mouse"), new ProductDocument(2, "mouse pad"), }; await _fixture.IndexDocuments(uniqueIndexName, productDocuments); var result = await opensearchClient.SearchAsync<ProductDocument>(selector => selector .Index(uniqueIndexName) .Query(queryContainer => queryContainer .Term(term => term .Field(field => field.Name) .Value(termText) ) ) .Explain() ); result.IsValid.Should().BeTrue(); result.Documents.Should().ContainSingle(doc => string.Equals(doc.Name, termText), explanation); } ); }
This query produces the following DebugInformation
in the response object:
Valid OpenSearch.Client response built from a successful (200) low level call on POST: /keyword-index3534a67e-048c-40b4-938b-0a31e2fd9f77/_search?pretty=true&error_trace=true&typed_keys=true # Audit trail of this API call: - [1] HealthyResponse: Node: http://localhost:9200/ Took: 00:00:00.1988437 # Request: {"explain":true,"query":{"term":{"name":{"value":"mouse pad"}}}} # Response: { "took" : 20, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 0.6931471, "hits" : [ { "_shard" : "[keyword-index3534a67e-048c-40b4-938b-0a31e2fd9f77][0]", "_node" : "l7YV4K5YSFuy_CFfGwt8ig", "_index" : "keyword-index3534a67e-048c-40b4-938b-0a31e2fd9f77", "_id" : "2", "_score" : 0.6931471, "_source" : { "id" : 2, "name" : "mouse pad" }, "_explanation" : { "value" : 0.6931471, "description" : "weight(name:mouse pad in 1) [PerFieldSimilarity], result of:", "details" : [ { "value" : 0.6931471, "description" : "score(freq=1.0), computed as boost * idf * tf from:", "details" : [ { "value" : 2.2, "description" : "boost", "details" : [ ] }, { "value" : 0.6931472, "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:", "details" : [ { "value" : 1, "description" : "n, number of documents containing term", "details" : [ ] }, { "value" : 2, "description" : "N, total number of documents with field", "details" : [ ] } ] }, { "value" : 0.45454544, "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:", "details" : [ { "value" : 1.0, "description" : "freq, occurrences of term within document", "details" : [ ] }, { "value" : 1.2, "description" : "k1, term saturation parameter", "details" : [ ] }, { "value" : 0.75, "description" : "b, length normalization parameter", "details" : [ ] }, { "value" : 1.0, "description" : "dl, length of field", "details" : [ ] }, { "value" : 1.0, "description" : "avgdl, average length of field", "details" : [ ] } ] } ] } ] } } ] } } # TCP states: Established: 44 TimeWait: 68 CloseWait: 1 # ThreadPool statistics: Worker: Busy: 1 Free: 32766 Min: 12 Max: 32767 IOCP: Busy: 0 Free: 1000 Min: 12 Max: 1000
This output confirms that for a term query of value mouse pad
, we only find the document whose name is mouse pad
. Interestingly, despite the fact that a term
query on a keyword
field is essentially a boolean (true or false) matching operation, we can also see that the result is scored based on term frequency and the occurrence of the term across indexed documents. We’ll explore how to suppress this shortly.
In the interim, let’s move onto an unhappy case.
Our next test will:
- Index the same two documents
- Run three test cases using the product names that have one mismatched character each
- The first case is missing the letter e
- The second case is missing some whitespace
- The third case starts with a capital letter
- Assert that we retrieve no documents
[Theory] [InlineData("mous", "Missing a letter")] [InlineData("mousepad", "Missing a space")] [InlineData("Mouse pad", "Missing a space")] public async Task KeywordMapping_DoesNotMatchOnSlightlyMismatchedTerms(string termText, string explanation) { var indexName = "keyword-index"; await _fixture.PerformActionInTestIndex( indexName, mappingDescriptor, async (uniqueIndexName, opensearchClient) => { var productDocuments = new[] { new ProductDocument(1, "mouse"), new ProductDocument(2, "mouse pad"), }; await _fixture.IndexDocuments(uniqueIndexName, productDocuments); var result = await opensearchClient.SearchAsync<ProductDocument>(selector => selector .Index(uniqueIndexName) .Query(queryContainer => queryContainer .Match(term => term .Field(field => field.Name) .Query(termText) ) ) ); result.IsValid.Should().BeTrue(); result.Documents.Should().BeEmpty(explanation); } ); }
The DebugInformation
information makes this clear to us:
Valid OpenSearch.Client response built from a successful (200) low level call on POST: /keyword-indexcef9bf85-296c-4d67-ae39-1fbdcf4ce855/_search?pretty=true&error_trace=true&typed_keys=true # Audit trail of this API call: - [1] HealthyResponse: Node: http://localhost:9200/ Took: 00:00:00.0831517 # Request: {"query":{"match":{"name":{"query":"mousepad"}}}} # Response: { "took" : 12, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 0, "relation" : "eq" }, "max_score" : null, "hits" : [ ] } } # TCP states: Established: 51 TimeWait: 7 CloseWait: 1 # ThreadPool statistics: Worker: Busy: 1 Free: 32766 Min: 12 Max: 32767 IOCP: Busy: 0 Free: 1000 Min: 12 Max: 1000
In the example above our search term was only missing a whitespace character. This was enough to cause a mismatch and return no documents at all. This is because token matching must be exact for term searches. Given that a keyword
is indexed as a single token, we must provide the entire name
to retrieve a match.
Boolean Filter Query: Removing the Score
We saw above that our raw term query was calculating a score, even though our search was either a match, or not a match. This makes calculating a score feel redundant, as there is no level of subjectivity to the filtering of our data.
Luckily we can remove this component of the search entirely by using a boolean filter query. A bool query by itself would score a resulting document by summing its matching query hits together, however, we can apply a filter subquery to skip this from happening.
This type of search is useful when a user knows exactly what they want, such as selecting from known categories of products in a catalog.
[Theory] [InlineData("mouse", "Only the document with name mouse will match")] [InlineData("mouse pad", "Only the document with name mouse pad will match")] public async Task KeywordMapping_CanBeFilteredOnWithBooleanQuery(string termText, string explanation) { var indexName = "keyword-index"; await _fixture.PerformActionInTestIndex( indexName, mappingDescriptor, async (uniqueIndexName, opensearchClient) => { var productDocuments = new[] { new ProductDocument(1, "mouse"), new ProductDocument(2, "mouse pad"), }; await _fixture.IndexDocuments(uniqueIndexName, productDocuments); var result = await opensearchClient.SearchAsync<ProductDocument>(selector => selector .Index(uniqueIndexName) .Query(queryContainer => queryContainer .Bool(boolQuery => boolQuery .Filter(filter => filter .Term(term => term .Field(field => field.Name) .Value(termText) )) ) ) .Explain() ); result.IsValid.Should().BeTrue(); result.Documents.Should().ContainSingle(doc => string.Equals(doc.Name, termText), explanation); } ); }
This produces the following DebugInformation
Valid OpenSearch.Client response built from a successful (200) low level call on POST: /keyword-index3a8694ac-0767-46ae-870f-ec02b341ecad/_search?pretty=true&error_trace=true&typed_keys=true # Audit trail of this API call: - [1] HealthyResponse: Node: http://localhost:9200/ Took: 00:00:00.1764957 # Request: {"explain":true,"query":{"bool":{"filter":[{"term":{"name":{"value":"mouse pad"}}}]}}} # Response: { "took" : 6, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 0.0, "hits" : [ { "_shard" : "[keyword-index3a8694ac-0767-46ae-870f-ec02b341ecad][0]", "_node" : "l7YV4K5YSFuy_CFfGwt8ig", "_index" : "keyword-index3a8694ac-0767-46ae-870f-ec02b341ecad", "_id" : "2", "_score" : 0.0, "_source" : { "id" : 2, "name" : "mouse pad" }, "_explanation" : { "value" : 0.0, "description" : "ConstantScore(name:mouse pad)^0.0", "details" : [ ] } } ] } } # TCP states: Established: 60 TimeWait: 13 FinWait2: 1 CloseWait: 2 # ThreadPool statistics: Worker: Busy: 1 Free: 32766 Min: 12 Max: 32767 IOCP: Busy: 0 Free: 1000 Min: 12 Max: 1000
Here we can see that we have "_score" : 0.0
! This produces a much simpler response for an engineer to debug. There’s no maths to understand score relevance, we just see results that were an exact match. A wise Code Sloth likes to keep complexity to a minimum after all!
It is also possible to apply a constant score to our filter query, if the filter feeds into a larger query. In this case, we can construct our query as such:
var result = await opensearchClient.SearchAsync<ProductDocument>(selector => selector .Index(uniqueIndexName) .Query(queryContainer => queryContainer .ConstantScore(boolQuery => boolQuery .Filter(filter => filter .Term(term => term .Field(field => field.Name) .Value(termText) )) .Boost(3) ) ) .Explain() );
The matching document(s) will all contain a score of 3.
Match Query: Query Analysis?
Here’s where things get a little bit confusing. If a term
query on a keyword
field produces an exact match, what does a match query (whose query is analysed by default) produce for a keyword
field?
The analyzer
parameter’s description is as follows:
(Optional, string) Analyzer used to convert the text in the
Analyzer parameter description from elastic.co match query articlequery
value into tokens. Defaults to the index-time analyzer mapped for the<field>
. If no analyzer is mapped, the index’s default analyzer is used.
This is a little confusing, because it doesn’t seem to be the full story. Follow a maze of links down a rabbit hole and you’ll eventually come across a much more complex sequence of rules that determine the analyser that is used at search time:
At search time, Elasticsearch determines which analyzer to use by checking the following parameters in order:
- The
analyzer
parameter in the search query. See Specify the search analyzer for a query.- The
search_analyzer
mapping parameter for the field. See Specify the search analyzer for a field.- The
analysis.analyzer.default_search
index setting. See Specify the default search analyzer for an index.- The
analyzer
mapping parameter for the field. See Specify the analyzer for a field.If none of these parameters are specified, the
Taken from: https://www.elastic.co/guide/en/elasticsearch/reference/current/specify-analyzer.html#specify-search-analyzerstandard
analyzer is used.
After I first read these rules, the standard analyzer seemed like the obvious outcome. Afterall, our next test does not provide an analyser in the given query, nor does it set any analysis on the field or index itself. However this is not the case …
Our next test will:
- Index two documents, as per our prior tests
- Run two test cases that each issue a match query using one of the document names
- Assert that we retrieve a document whose name exactly matches the given string
- Confirms that the
standard analyzer
could not have been used for the query, by analyzing the input string and asserting the tokens produced
[Theory] [InlineData("mouse", new[] { "mouse" }, "Only the document with name mouse will match")] [InlineData("mouse pad", new[] { "mouse", "pad" }, @"If the standard analyzer was run on this text it would produce two tokens: mouse, pad. Neither individual token would exactly match the mouse pad document name resulting in no document being returned. However, OepnSearch identifies that the mapping of the field is not Text and does not apply an analyzer at query time. This default behaviour only applies for text field mappings.")] public async Task KeywordMapping_ProducesNoQueryTimeAnlaysis_ForMatchQuery(string matchText, string[] expectedTokens, string explanation) { var indexName = "keyword-index"; await _fixture.PerformActionInTestIndex( indexName, mappingDescriptor, async (uniqueIndexName, opensearchClient) => { var productDocuments = new[] { new ProductDocument(1, "mouse"), new ProductDocument(2, "mouse pad"), }; await _fixture.IndexDocuments(uniqueIndexName, productDocuments); var result = await opensearchClient.SearchAsync<ProductDocument>(selector => selector .Index(uniqueIndexName) .Query(queryContainer => queryContainer .Match(term => term .Field(field => field.Name) .Query(matchText) ) ) .Explain() ); result.IsValid.Should().BeTrue(); result.Documents.Should().ContainSingle(doc => string.Equals(doc.Name, matchText), explanation); // Let's confirm the tokens that WOULD have been generated if we used a match query on a TEXT field mapping var analyzeResult = await opensearchClient.Indices.AnalyzeAsync(selector => selector .Analyzer("standard") .Index(uniqueIndexName) .Text(matchText)); analyzeResult.Tokens.Select(token => token.Token).Should().BeEquivalentTo(expectedTokens); } ); }
The DebugInformation
of the match
query is below:
Valid OpenSearch.Client response built from a successful (200) low level call on POST: /keyword-indexb744adeb-fb8a-4a4d-9706-8ac1c63bf041/_search?pretty=true&error_trace=true&typed_keys=true # Audit trail of this API call: - [1] HealthyResponse: Node: http://localhost:9200/ Took: 00:00:00.1556200 # Request: {"explain":true,"query":{"match":{"name":{"query":"mouse pad"}}}} # Response: { "took" : 5, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 0.6931471, "hits" : [ { "_shard" : "[keyword-indexb744adeb-fb8a-4a4d-9706-8ac1c63bf041][0]", "_node" : "l7YV4K5YSFuy_CFfGwt8ig", "_index" : "keyword-indexb744adeb-fb8a-4a4d-9706-8ac1c63bf041", "_id" : "2", "_score" : 0.6931471, "_source" : { "id" : 2, "name" : "mouse pad" }, "_explanation" : { "value" : 0.6931471, "description" : "weight(name:mouse pad in 1) [PerFieldSimilarity], result of:", "details" : [ { "value" : 0.6931471, "description" : "score(freq=1.0), computed as boost * idf * tf from:", "details" : [ { "value" : 2.2, "description" : "boost", "details" : [ ] }, { "value" : 0.6931472, "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:", "details" : [ { "value" : 1, "description" : "n, number of documents containing term", "details" : [ ] }, { "value" : 2, "description" : "N, total number of documents with field", "details" : [ ] } ] }, { "value" : 0.45454544, "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:", "details" : [ { "value" : 1.0, "description" : "freq, occurrences of term within document", "details" : [ ] }, { "value" : 1.2, "description" : "k1, term saturation parameter", "details" : [ ] }, { "value" : 0.75, "description" : "b, length normalization parameter", "details" : [ ] }, { "value" : 1.0, "description" : "dl, length of field", "details" : [ ] }, { "value" : 1.0, "description" : "avgdl, average length of field", "details" : [ ] } ] } ] } ] } } ] } } # TCP states: Established: 50 TimeWait: 28 CloseWait: 1 # ThreadPool statistics: Worker: Busy: 1 Free: 32766 Min: 12 Max: 32767 IOCP: Busy: 0 Free: 1000 Min: 12 Max: 1000
A dead giveaway that the standard analyzer
was not used, is the scoring explanation.
"description" : "weight(name:mouse pad in 1) [PerFieldSimilarity], result of:",
Here we can see the name
field was matched with the term mouse pad
. This term represents a single token, meaning that the keyword
analyser (which is a noop) must have been used.
If we look at the DebugInformation
of the AnalyzeAsync
request, this will become even clearer.
Valid OpenSearch.Client response built from a successful (200) low level call on POST: /keyword-indexb744adeb-fb8a-4a4d-9706-8ac1c63bf041/_analyze?pretty=true&error_trace=true # Audit trail of this API call: - [1] HealthyResponse: Node: http://localhost:9200/ Took: 00:00:00.0232070 # Request: {"analyzer":"standard","text":["mouse pad"]} # Response: { "tokens" : [ { "token" : "mouse", "start_offset" : 0, "end_offset" : 5, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "pad", "start_offset" : 6, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 1 } ] } # TCP states: Established: 50 TimeWait: 28 CloseWait: 1 # ThreadPool statistics: Worker: Busy: 1 Free: 32766 Min: 12 Max: 32767 IOCP: Busy: 0 Free: 1000 Min: 12 Max: 1000
And there we have it. If the standard analyzer was used on our input sting of mouse pad
it would have tokenised the string on the whitespace and produced mouse
and pad
. Our test asserts this by comparing against an expected list of these tokens. If this was the case, this test case would have returned the document whose name
is mouse
instead of the mouse pad
document!
Sloth Summary
Today’s article was an exploration of how we can search on an indexed keyword
field. We learned that:
- Keyword searching behaves the same for
term
andmatch
queries, neither of which tokenise the given search string (that, or use thekeyword
analyser which produces a noop) - Given that tokens must match exactly, term and match queries on keyword fields must match the indexed string exactly for a document to return in search results
- Queries on
keyword
fields will be scored by default - We can suppress scoring on our query by using a
bool filter
query
Don’t forget to head over to the code samples page to continue exploring the content of this article. Also keep an eye out for part two, in which we will cover sorting, scripting and aggregations with keyword
fields!