[Tutorial] Indexing the OpenSearch Text Field Data Type

This article will explore the OpenSearch Text field data type using samples from the CodeSloth GitHub code sample repository. These examples consume the .Net OpenSearch client for interacting with an AWS OpenSearch cluster running in Docker on the local machine. We’ll first dive into defining Text mappings and then look at the tokens they produce when data is indexed into them.

At the time of writing, Elastic.co documentation is far richer than that published by OpenSearch. Therefore, a combination of links between the two vendors may be provided to reference the concepts discussed. These offerings are currently functionally equivalent, so the words ElasticSearch and OpenSearch may be used interchangeably.

What is an OpenSearch Text Field?

The OpenSearch Text field data type is the foundation of full-text search. Unlike the Keyword field data type, which focuses on exact-match searching, the Text type can be configured with many possible analysers, tokenisers, token filters and character filters.

These configurations can be used in diverse combinations, producing different full-text search experiences. For this reason, they can become very complex very quickly and even more difficult to reason. This article will focus specifically on the standard analyzer, which is the default analyzer that is used for Text mappings by OpenSearch.

How is a Text Field Mapped in .Net?

Mapping a Text field with the default Standard Analyzer is straightforward.

The Standard Analyzer provides grammar-based tokenisation, such as the removal of special characters, the creation of tokens from whole words and lowercase normalisation (i.e. all tokens that are produced will be lowercase).

Let’s start by taking a look at the document that will produce our mappings.

/// <summary>
/// A sample document that contains a single text field that is explored during multiple tests within the suite
/// </summary>
public record ProductDocument
{
    public ProductDocument(int id, string description)
    {
        Id = id;
        Description = description ?? throw new ArgumentNullException(nameof(description));
    }

    /// <summary>
    /// The Id field of a document is automatically used for the document id at indexing time
    /// </summary>
    public int Id { get; init; }

    /// <summary>
    /// The string property of this document will be mapped as Text
    /// Conceptually this property could represent a description of a product
    /// </summary>
    public string Description { get; init; }
}

An OpenSearch document can be defined with a user-defined data type, such as a record or class. This example contains an integer Id and a string property called Description.

Next, we use this type to define our document mapping in a strongly typed way.

 Func<TypeMappingDescriptor<ProductDocument>, ITypeMapping> mappingDescriptor = mapping => mapping
             .Properties<ProductDocument>(propertyDescriptor => propertyDescriptor
                 .Text(word => word.Name(name => name.Description))
             );

The mapping function above defines a Text mapping. This looks similar to keyword mapping, except we use the Text method when passing through the property name to map. The linked keyword mapping article provides some detail on the expression tree structure for defining a mapping.

In defining the Text mapping, we refer to our record by its type name, define properties for the type, and specify the Description property of our type when binding the Text mapping.

Inspecting the OpenSearch Text Mapping Using CodeSloth Samples

Using Visual Studio, launch the codesloth-opensearch-samples solution file.

Then, run an OpenSearch cluster. You can do this by setting the docker-compose project in the solution as the startup project, and clicking the OpenSearch play button. Alternatively, you can try running a local OpenSearch cluster in Docker outside of Visual Studio.

Return to Visual Studio and set a breakpoint in file OpenSearchTestIndex.cs, where the indexCreationResult is stored. It looks like the following.

var indexCreationResult = await OpenSearchClient.Indices.CreateAsync(Name, descriptor => createIndexDescriptor);

Find and debug the TextMapping_IndexesUsingStandardTokensiserForGivenString test (any of the theory cases will do). Inspect the DebugInformation property of the variable. This contains the raw HTTP request sent to the cluster to put the mappings.

The following contains the HTTP request that was issued to set the mappings on the index

Valid OpenSearch.Client response built from a successful (200) low level call on PUT: /ca221c0c-da0d-4e90-8351-8ecea08456c2?pretty=true&error_trace=true
# Audit trail of this API call:
 - [1] HealthyResponse: Node: http://localhost:9200/ Took: 00:00:00.2611848
# Request:
{"mappings":{"properties":{"description":{"type":"text"}}}}
# Response:
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "ca221c0c-da0d-4e90-8351-8ecea08456c2"
}

# TCP states:
  Established: 104
  CloseWait: 2
  FinWait2: 1
  TimeWait: 10

# ThreadPool statistics:
  Worker: 
    Busy: 1
    Free: 32766
    Min: 20
    Max: 32767
  IOCP: 
    Busy: 0
    Free: 1000
    Min: 20
    Max: 1000

A PUT request is made to the local cluster running on localhost:9200, specifying the index name (in this case a random GUID ca221c0c-da0d-4e90-8351-8ecea08456c2) in the path. The body of the request then specifies the mappings:

{
    "mappings": {
        "properties": {
            "description": {
                "type": "text"
            }
        }
    }
}

Here, we can see that our record’s Description property has been used to name the field for the Text mapping. In creating this mapping it has been made lowercase. Pretty straightforward!

To fetch the stored mappings from the index, set a breakpoint after the call to GetMappingAsync in the TextMapping_IndexesUsingStandardTokensiserForGivenString method.

 public async Task TextMapping_IndexesUsingStandardTokensiserForGivenString(string description, string[] expectedTokensAndFrequencies, string explanation)
 {
     await using var testIndex = await _fixture.CreateTestIndex(mappingDescriptor);

     var mappingRequest = new GetMappingRequest();
     var mappingResult = await _fixture.OpenSearchClient.Indices.GetMappingAsync(mappingRequest);
...

Debug the test and inspect the DebugInformation of the mappingResult object.

This will contain the mapping information, as below.

   "mappings": {
            "properties": {
                "description": {
                    "type": "text"
                }
            }
        }

Well… For something that can achieve so much, this is a little underwhelming. It doesn’t even specify the standard analyzer in there!

This mapping by itself does nothing. It simply configures OpenSearch to describe a set of operations that will be performed when data is stored in that field on any given document. These operations produce terms (or tokens) that are then used for full-text searching.

OpenSearch Full Text Search Terms

Continue debugging to the end of the test method, and you will see the terms generated for the given Theory case. We use the Term Vectors API to describe the terms produced for a given field.

The raw response to this query can be seen further below. However, you can see that the test code parses the result and creates a formatted string of the important information.

var result = await _fixture.OpenSearchClient.TermVectorsAsync<ProductDocument>(selector => selector
        .Index(testIndex.Name)
        .Document(productDocument)
    );

result.IsValid.Should().BeTrue();
// Each token is parsed from the response, against the number of times it appeared in the given string
var tokensAndFrequency = result.TermVectors.Values.SelectMany(value => value.Terms.Select(term => $"{term.Key}:{term.Value.TermFrequency}"));
tokensAndFrequency.Should().BeEquivalentTo(expectedTokensAndFrequencies, options => options.WithStrictOrdering(), explanation);

This allows us to specify our expected terms in a readable way within the Theory data and assert them as the final step of the test.

Let’s debug the third XUnit test case. This demonstrates the standard analyser’s multiple operations, as the given description contains lots of punctuation, grammar and casing.

[InlineData(
    "This is a sentence! It contains some, really bad. Grammar; sentence", // Input
    new[] { "a:1", "bad:1", "contains:1", "grammar:1", "is:1", "it:1", "really:1", "sentence:2", "some:1", "this:1" }, // Expected tokens
    "Grammar is removed and whole words are stored as tokens, lowercase normalised" // Explanation
    )]

The raw response from our term vector query for this string returns the following.

Valid OpenSearch.Client response built from a successful (200) low level call on POST: /7a1f8562-3264-4466-aabd-6305e52ae08e/_termvectors?pretty=true&error_trace=true
# Audit trail of this API call:
 - [1] HealthyResponse: Node: http://localhost:9200/ Took: 00:00:00.0223444
# Request:
{"doc":{"id":1,"description":"This is a sentence! It contains some, really bad. Grammar; sentence"}}
# Response:
{
  "_index" : "7a1f8562-3264-4466-aabd-6305e52ae08e",
  "_version" : 0,
  "found" : true,
  "took" : 6,
  "term_vectors" : {
    "description" : {
      "field_statistics" : {
        "sum_doc_freq" : 10,
        "doc_count" : 1,
        "sum_ttf" : 11
      },
      "terms" : {
        "a" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 8,
              "end_offset" : 9
            }
          ]
        },
        "bad" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 8,
              "start_offset" : 45,
              "end_offset" : 48
            }
          ]
        },
        "contains" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 5,
              "start_offset" : 23,
              "end_offset" : 31
            }
          ]
        },
        "grammar" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 9,
              "start_offset" : 50,
              "end_offset" : 57
            }
          ]
        },
        "is" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 7
            }
          ]
        },
        "it" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 4,
              "start_offset" : 20,
              "end_offset" : 22
            }
          ]
        },
        "really" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 7,
              "start_offset" : 38,
              "end_offset" : 44
            }
          ]
        },
        "sentence" : {
          "term_freq" : 2,
          "tokens" : [
            {
              "position" : 3,
              "start_offset" : 10,
              "end_offset" : 18
            },
            {
              "position" : 10,
              "start_offset" : 59,
              "end_offset" : 67
            }
          ]
        },
        "some" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 6,
              "start_offset" : 32,
              "end_offset" : 36
            }
          ]
        },
        "this" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 4
            }
          ]
        }
      }
    }
  }
}

# TCP states:
  Established: 85
  TimeWait: 2

# ThreadPool statistics:
  Worker: 
    Busy: 1
    Free: 32766
    Min: 20
    Max: 32767
  IOCP: 
    Busy: 0
    Free: 1000
    Min: 20
    Max: 1000

Each term contains:

The frequency of the term
The zero-based index of the token, called the position
The zero-based index of the start and end of the token, called start_offset and end_offset

For example, the first word in the string This has:

"this" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 4
            }
          ]
        }

It appears once in the string. It is the first token in the string, at position zero. It starts at offset zero and ends after the last character at position 4.

We can also observe that:

Grammar and punctuation were not tokenised
Each token is lowercase

To summarise how the terms generated from a text mapping can be searched against:

	Keyword mapping	Text Mapping
Term query	Exact match search	Will only match documents that produce a `single token`. This means strings that contain spaces will not match anything, as the standard analyzer will create separate searchable terms for each word that is delimited by spaces.
Match query	Exact match search. The search query’s string is not tokenised.	Full-text search

Sloth Summary

The OpenSearch Text mapping, in its basic form, is very simple to define. OpenSearch will, by default, use the standard analyzer when indexing a string field into a document containing this mapping.

The standard analyzer will:

Create tokens from whole words
Remove grammar and punctuation
Lowercase normalise the given string

Now that you know how to create a Text mapping and the basics of the standard analyzer, you can start writing Match Queries to perform a full-text search. You can also begin exploring other analyzers or writing your own altogether!

[Tutorial] Indexing the OpenSearch Text Field Data Type

What is an OpenSearch Text Field?

How is a Text Field Mapped in .Net?

Inspecting the OpenSearch Text Mapping Using CodeSloth Samples

OpenSearch Full Text Search Terms

Sloth Summary

Trent

You may also like

[Tutorial] OpenSearch Terms Aggregation Include Exclude Parameters on...

Getting Started With OpenSearch in Java

[Tutorial] Creating an OpenSearch Custom Analyzer

[Tutorial] Running ElasticSearch Locally in Docker

[Tutorial] OpenSearch Keyword Composite Aggregation

[Tutorial] OpenSearch Keyword Adjacency Matrix Aggregation