This article will explore the OpenSearch Text
field data type using samples from the CodeSloth GitHub code sample repository. These examples consume the .Net OpenSearch client for interacting with an AWS OpenSearch cluster running in Docker on the local machine. We’ll first dive into defining Text
mappings and then look at the tokens they produce when data is indexed into them.
At the time of writing, Elastic.co documentation is far richer than that published by OpenSearch. Therefore, a combination of links between the two vendors may be provided to reference the concepts discussed. These offerings are currently functionally equivalent, so the words ElasticSearch and OpenSearch may be used interchangeably.
What is an OpenSearch Text Field?
The OpenSearch Text field data type is the foundation of full-text search. Unlike the Keyword field data type, which focuses on exact-match searching, the Text
type can be configured with many possible analysers, tokenisers, token filters and character filters.
These configurations can be used in diverse combinations, producing different full-text search experiences. For this reason, they can become very complex very quickly and even more difficult to reason. This article will focus specifically on the standard analyzer, which is the default analyzer that is used for Text
mappings by OpenSearch.
How is a Text Field Mapped in .Net?
Mapping a Text
field with the default Standard Analyzer is straightforward.
The Standard Analyzer provides grammar-based tokenisation, such as the removal of special characters, the creation of tokens from whole words and lowercase normalisation (i.e. all tokens that are produced will be lowercase).
Let’s start by taking a look at the document that will produce our mappings.
/// <summary> /// A sample document that contains a single text field that is explored during multiple tests within the suite /// </summary> public record ProductDocument { public ProductDocument(int id, string description) { Id = id; Description = description ?? throw new ArgumentNullException(nameof(description)); } /// <summary> /// The Id field of a document is automatically used for the document id at indexing time /// </summary> public int Id { get; init; } /// <summary> /// The string property of this document will be mapped as Text /// Conceptually this property could represent a description of a product /// </summary> public string Description { get; init; } }
An OpenSearch document can be defined with a user-defined data type, such as a record
or class
. This example contains an integer
Id
and a string
property called Description
.
Next, we use this type to define our document mapping in a strongly typed way.
Func<TypeMappingDescriptor<ProductDocument>, ITypeMapping> mappingDescriptor = mapping => mapping .Properties<ProductDocument>(propertyDescriptor => propertyDescriptor .Text(word => word.Name(name => name.Description)) );
The mapping function above defines a Text
mapping. This looks similar to keyword mapping, except we use the Text
method when passing through the property name to map. The linked keyword mapping article provides some detail on the expression tree structure for defining a mapping.
In defining the Text
mapping, we refer to our record
by its type name, define properties for the type, and specify the Description
property of our type when binding the Text
mapping.
Inspecting the OpenSearch Text Mapping Using CodeSloth Samples
Using Visual Studio, launch the codesloth-opensearch-samples
solution file.
Then, run an OpenSearch cluster. You can do this by setting the docker-compose
project in the solution as the startup project, and clicking the OpenSearch
play button. Alternatively, you can try running a local OpenSearch cluster in Docker outside of Visual Studio.
Return to Visual Studio and set a breakpoint in file OpenSearchTestIndex.cs
, where the indexCreationResult
is stored. It looks like the following.
var indexCreationResult = await OpenSearchClient.Indices.CreateAsync(Name, descriptor => createIndexDescriptor);
Find and debug the TextMapping_IndexesUsingStandardTokensiserForGivenString
test (any of the theory cases will do). Inspect the DebugInformation
property of the variable. This contains the raw HTTP request sent to the cluster to put the mappings.
The following contains the HTTP request that was issued to set the mappings on the index
Valid OpenSearch.Client response built from a successful (200) low level call on PUT: /ca221c0c-da0d-4e90-8351-8ecea08456c2?pretty=true&error_trace=true # Audit trail of this API call: - [1] HealthyResponse: Node: http://localhost:9200/ Took: 00:00:00.2611848 # Request: {"mappings":{"properties":{"description":{"type":"text"}}}} # Response: { "acknowledged" : true, "shards_acknowledged" : true, "index" : "ca221c0c-da0d-4e90-8351-8ecea08456c2" } # TCP states: Established: 104 CloseWait: 2 FinWait2: 1 TimeWait: 10 # ThreadPool statistics: Worker: Busy: 1 Free: 32766 Min: 20 Max: 32767 IOCP: Busy: 0 Free: 1000 Min: 20 Max: 1000
A PUT
request is made to the local cluster running on localhost:9200
, specifying the index name (in this case a random GUID ca221c0c-da0d-4e90-8351-8ecea08456c2
) in the path. The body of the request then specifies the mappings:
{ "mappings": { "properties": { "description": { "type": "text" } } } }
Here, we can see that our record’s Description
property has been used to name the field for the Text
mapping. In creating this mapping it has been made lowercase. Pretty straightforward!
To fetch the stored mappings from the index, set a breakpoint after the call to GetMappingAsync
in the TextMapping_IndexesUsingStandardTokensiserForGivenString
method.
public async Task TextMapping_IndexesUsingStandardTokensiserForGivenString(string description, string[] expectedTokensAndFrequencies, string explanation) { await using var testIndex = await _fixture.CreateTestIndex(mappingDescriptor); var mappingRequest = new GetMappingRequest(); var mappingResult = await _fixture.OpenSearchClient.Indices.GetMappingAsync(mappingRequest); ...
Debug the test and inspect the DebugInformation
of the mappingResult
object.
This will contain the mapping information, as below.
"mappings": { "properties": { "description": { "type": "text" } } }
Well… For something that can achieve so much, this is a little underwhelming. It doesn’t even specify the standard analyzer in there!
This mapping by itself does nothing. It simply configures OpenSearch to describe a set of operations that will be performed when data is stored in that field on any given document. These operations produce terms (or tokens) that are then used for full-text searching.
OpenSearch Full Text Search Terms
Continue debugging to the end of the test method, and you will see the terms generated for the given Theory
case. We use the Term Vectors API to describe the terms produced for a given field.
The raw response to this query can be seen further below. However, you can see that the test code parses the result and creates a formatted string of the important information.
var result = await _fixture.OpenSearchClient.TermVectorsAsync<ProductDocument>(selector => selector .Index(testIndex.Name) .Document(productDocument) ); result.IsValid.Should().BeTrue(); // Each token is parsed from the response, against the number of times it appeared in the given string var tokensAndFrequency = result.TermVectors.Values.SelectMany(value => value.Terms.Select(term => $"{term.Key}:{term.Value.TermFrequency}")); tokensAndFrequency.Should().BeEquivalentTo(expectedTokensAndFrequencies, options => options.WithStrictOrdering(), explanation);
This allows us to specify our expected terms in a readable way within the Theory
data and assert them as the final step of the test.
Let’s debug the third XUnit test case. This demonstrates the standard analyser’s multiple operations, as the given description contains lots of punctuation, grammar and casing.
[InlineData( "This is a sentence! It contains some, really bad. Grammar; sentence", // Input new[] { "a:1", "bad:1", "contains:1", "grammar:1", "is:1", "it:1", "really:1", "sentence:2", "some:1", "this:1" }, // Expected tokens "Grammar is removed and whole words are stored as tokens, lowercase normalised" // Explanation )]
The raw response from our term vector query for this string returns the following.
Valid OpenSearch.Client response built from a successful (200) low level call on POST: /7a1f8562-3264-4466-aabd-6305e52ae08e/_termvectors?pretty=true&error_trace=true # Audit trail of this API call: - [1] HealthyResponse: Node: http://localhost:9200/ Took: 00:00:00.0223444 # Request: {"doc":{"id":1,"description":"This is a sentence! It contains some, really bad. Grammar; sentence"}} # Response: { "_index" : "7a1f8562-3264-4466-aabd-6305e52ae08e", "_version" : 0, "found" : true, "took" : 6, "term_vectors" : { "description" : { "field_statistics" : { "sum_doc_freq" : 10, "doc_count" : 1, "sum_ttf" : 11 }, "terms" : { "a" : { "term_freq" : 1, "tokens" : [ { "position" : 2, "start_offset" : 8, "end_offset" : 9 } ] }, "bad" : { "term_freq" : 1, "tokens" : [ { "position" : 8, "start_offset" : 45, "end_offset" : 48 } ] }, "contains" : { "term_freq" : 1, "tokens" : [ { "position" : 5, "start_offset" : 23, "end_offset" : 31 } ] }, "grammar" : { "term_freq" : 1, "tokens" : [ { "position" : 9, "start_offset" : 50, "end_offset" : 57 } ] }, "is" : { "term_freq" : 1, "tokens" : [ { "position" : 1, "start_offset" : 5, "end_offset" : 7 } ] }, "it" : { "term_freq" : 1, "tokens" : [ { "position" : 4, "start_offset" : 20, "end_offset" : 22 } ] }, "really" : { "term_freq" : 1, "tokens" : [ { "position" : 7, "start_offset" : 38, "end_offset" : 44 } ] }, "sentence" : { "term_freq" : 2, "tokens" : [ { "position" : 3, "start_offset" : 10, "end_offset" : 18 }, { "position" : 10, "start_offset" : 59, "end_offset" : 67 } ] }, "some" : { "term_freq" : 1, "tokens" : [ { "position" : 6, "start_offset" : 32, "end_offset" : 36 } ] }, "this" : { "term_freq" : 1, "tokens" : [ { "position" : 0, "start_offset" : 0, "end_offset" : 4 } ] } } } } } # TCP states: Established: 85 TimeWait: 2 # ThreadPool statistics: Worker: Busy: 1 Free: 32766 Min: 20 Max: 32767 IOCP: Busy: 0 Free: 1000 Min: 20 Max: 1000
Each term contains:
- The frequency of the term
- The zero-based index of the token, called the
position
- The zero-based index of the start and end of the token, called
start_offset
andend_offset
For example, the first word in the string This
has:
"this" : { "term_freq" : 1, "tokens" : [ { "position" : 0, "start_offset" : 0, "end_offset" : 4 } ] }
It appears once in the string. It is the first token in the string, at position zero. It starts at offset zero and ends after the last character at position 4.
We can also observe that:
- Grammar and punctuation were not tokenised
- Each token is lowercase
To summarise how the terms generated from a text mapping can be searched against:
Keyword mapping | Text Mapping | |
Term query | Exact match search | Will only match documents that produce a single token . This means strings that contain spaces will not match anything, as the standard analyzer will create separate searchable terms for each word that is delimited by spaces. |
Match query | Exact match search. The search query’s string is not tokenised. | Full-text search |
Sloth Summary
The OpenSearch Text
mapping, in its basic form, is very simple to define. OpenSearch will, by default, use the standard analyzer when indexing a string field into a document containing this mapping.
The standard analyzer will:
- Create tokens from whole words
- Remove grammar and punctuation
- Lowercase normalise the given string
Now that you know how to create a Text
mapping and the basics of the standard analyzer, you can start writing Match Queries to perform a full-text search. You can also begin exploring other analyzers or writing your own altogether!
More on these in the coming posts.
Happy searching!