Home OpenSearch4. Sorting in OpenSearch [Tutorial] Sorting the Keyword Field Data Type

[Tutorial] Sorting the Keyword Field Data Type

by Trent
0 comments

This article is a continuation of Querying the Keyword Field Data Type. While the first article focused on fetching documents that have an indexed keyword field, this will focus on sorting and sort-scripting keywords. We’ll continue to use the .Net OpenSearch.Client NuGet package as we did in part 1. Please read the Keyword Field Data Type Indexing Deep Dive article prior to completing this tutorial, as it contains useful prerequisite information.

At time of writing, Elastic.co documentation is far richer than OpenSearch, so a combination of links between the two vendors may be provided to reference the concepts discussed. These offerings are currently functionally equivalent.

Enriching Search with Sort

We covered searching for documents in the last article. However, a good search experience is not complete if we cannot order the results in a meaningful way for the consumer.

It is important to understand that there may be memory implications to performing an OpenSearch sort operation, discussed here. However, keywords are exempt to the additional considerations that we must give to text, numeric or geo based sorting, which makes them very simple to work with.

Let’s take a look at how easy it is to sort keywords using the test case below, which:

  • Indexes two documents
  • Issues a match all query to fetch them both
  • Sorts the documents by their keyword name in descending order
[Fact]
/// <summary>
/// Keyword fields do not require anything special to support sorting
/// </summary>
public async Task KeywordMapping_CanBeUsedAsASortedField_WithoutAnySpecialConsiderations()
{
	var indexName = "keyword-index";
	await _fixture.PerformActionInTestIndex(
		indexName,
		mappingDescriptor,
		async (uniqueIndexName, opensearchClient) =>
		{
			var productDocuments = new[] {
new ProductDocument(1, "mouse"),
new ProductDocument(2, "mouse pad"),
};

			await _fixture.IndexDocuments(uniqueIndexName, productDocuments);

			var result = await opensearchClient.SearchAsync<ProductDocument>(selector => selector
				   .Index(uniqueIndexName)
				   .Query(query => query.MatchAll())
				   .Sort(sort => sort
					.Descending(fieldName => fieldName.Name)
				)
			);

			// Our documents can be sorted alphabetically
			result.IsValid.Should().BeTrue();
			var formattedResults = string.Join(", ", result.Documents.Select(doc => doc.Name));
			formattedResults.Should().BeEquivalentTo("mouse pad, mouse");
		}
	);
}

This query produces the following DebugInformation in the response object:

Valid OpenSearch.Client response built from a successful(200)low level call on POST: /keyword-index2ccd7782-75e8-449a-94ac-7c20d10696f3/_search ? pretty = true & error_trace = true & typed_keys = true
     # Audit trail of this API call :
     - [1]HealthyResponse: Node: http: //localhost:9200/ Took: 00:00:00.2225050
     # Request: {
    "explain": true,
    "query": {
        "match_all": {}
    },
    "sort": [{
            "name": {
                "order": "desc"
            }
        }
    ]
}
 # Response: {
    "took": 57,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": null,
        "hits": [{
                "_shard": "[keyword-index2ccd7782-75e8-449a-94ac-7c20d10696f3][0]",
                "_node": "l7YV4K5YSFuy_CFfGwt8ig",
                "_index": "keyword-index2ccd7782-75e8-449a-94ac-7c20d10696f3",
                "_id": "2",
                "_score": null,
                "_source": {
                    "id": 2,
                    "name": "mouse pad"
                },
                "sort": [
                    "mouse pad"
                ],
                "_explanation": {
                    "value": 1.0,
                    "description": "*:*",
                    "details": []
                }
            }, {
                "_shard": "[keyword-index2ccd7782-75e8-449a-94ac-7c20d10696f3][0]",
                "_node": "l7YV4K5YSFuy_CFfGwt8ig",
                "_index": "keyword-index2ccd7782-75e8-449a-94ac-7c20d10696f3",
                "_id": "1",
                "_score": null,
                "_source": {
                    "id": 1,
                    "name": "mouse"
                },
                "sort": [
                    "mouse"
                ],
                "_explanation": {
                    "value": 1.0,
                    "description": "*:*",
                    "details": []
                }
            }
        ]
    }
}

 # TCP states:
Established: 50
TimeWait: 18
CloseWait: 14

 # ThreadPool statistics:
Worker:
Busy: 1
Free: 32766
Min: 12
Max: 32767
IOCP:
Busy: 0
Free: 1000
Min: 12
Max: 1000

Here we can see that the resulting documents are returned to us in descending name order. This has taken into consideration the fact that one of the keywords was multi-part word delimited with a space. It’s that simple!

Numeric Sorting with Keyword Fields

Careful consideration must be given when mapping numeric fields in OpenSearch. This is because they are optimised for range queries, as discussed here. If you are performing term queries, it is recommended to use keyword fields for numbers instead of their numeric mapping type.

But what does this mean for sorting? Are numbers treated differently in keyword fields?

Let’s take a look in this example below, which:

  • Indexes two documents with integer name
  • Issues a match all query to fetch them both
  • Sorts the documents by the name in descending order
[Fact]
public async Task KeywordMapping_ShouldNotBeUsedToSortNumericData()
{
	var indexName = "keyword-index";
	await _fixture.PerformActionInTestIndex(
		indexName,
		mappingDescriptor,
		async (uniqueIndexName, opensearchClient) =>
		{
			var productDocuments = new[] {
new ProductDocument(1, "5"),
new ProductDocument(2, "2000"),
};

			await _fixture.IndexDocuments(uniqueIndexName, productDocuments);

			var result = await opensearchClient.SearchAsync<ProductDocument>(selector => selector
				   .Index(uniqueIndexName)
				   .Query(query => query.MatchAll())
				   .Explain()
				   .Sort(sort => sort
					.Descending(fieldName => fieldName.Name)
				)
			);

			// Our documents can be sorted alphabetically
			result.IsValid.Should().BeTrue();
			var formattedResults = string.Join(", ", result.Documents.Select(doc => doc.Name));
			formattedResults.Should().BeEquivalentTo("2000, 5");
		}
	);
}

This produces the following DebugInformation:

Valid OpenSearch.Client response built from a successful (200) low level call on POST: /keyword-index87cbe8dd-14ac-49b4-bd7b-f6fa2e0f3417/_search?pretty=true&error_trace=true&typed_keys=true
# Audit trail of this API call:
 - [1] HealthyResponse: Node: http://localhost:9200/ Took: 00:00:00.1348488
# Request:
{"explain":true,"query":{"match_all":{}},"sort":[{"name":{"order":"desc"}}]}
# Response:
{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_shard" : "[keyword-index87cbe8dd-14ac-49b4-bd7b-f6fa2e0f3417][0]",
        "_node" : "l7YV4K5YSFuy_CFfGwt8ig",
        "_index" : "keyword-index87cbe8dd-14ac-49b4-bd7b-f6fa2e0f3417",
        "_id" : "1",
        "_score" : null,
        "_source" : {
          "id" : 1,
          "name" : "5"
        },
        "sort" : [
          "5"
        ],
        "_explanation" : {
          "value" : 1.0,
          "description" : "*:*",
          "details" : [ ]
        }
      },
      {
        "_shard" : "[keyword-index87cbe8dd-14ac-49b4-bd7b-f6fa2e0f3417][0]",
        "_node" : "l7YV4K5YSFuy_CFfGwt8ig",
        "_index" : "keyword-index87cbe8dd-14ac-49b4-bd7b-f6fa2e0f3417",
        "_id" : "2",
        "_score" : null,
        "_source" : {
          "id" : 2,
          "name" : "2000"
        },
        "sort" : [
          "2000"
        ],
        "_explanation" : {
          "value" : 1.0,
          "description" : "*:*",
          "details" : [ ]
        }
      }
    ]
  }
}

# TCP states:
  Established: 173
  TimeWait: 25
  CloseWait: 12

# ThreadPool statistics:
  Worker: 
    Busy: 1
    Free: 32766
    Min: 12
    Max: 32767
  IOCP: 
    Busy: 0
    Free: 1000
    Min: 12
    Max: 1000

The name of the test method likely gave it away, but still – gasp! Our documents have (unsurprisingly) been sorted by ASCII instead of numerically.

Something to keep in mind when navigating the complexities of numeric fields in your documents!

Next Level Sorting with Painless Scripts

If sorting on the (static) indexed value of a keyword field is insufficient for your search use case, you can make your sort dynamic by writing a painless script. The Elastic Painless guide can also be found here, if you’d like to read a broader discussion of the scripting language and go through some use cases.

We’ll explore an example in the test below, which:

  • Indexes our two documents
  • Issues a match all query to fetch them both
  • Sorts the documents with a painless script, that performs a ternary comparison on the keyword value to produce an integer value on which the sort will be performed
[Fact]
public async Task KeywordMapping_CanBeUsedToScriptASortedField()
{
	var indexName = "keyword-index";
	await _fixture.PerformActionInTestIndex(
		indexName,
		mappingDescriptor,
		async (uniqueIndexName, opensearchClient) =>
		{
			var productDocuments = new[] {
new ProductDocument(1, "mouse"),
new ProductDocument(2, "mouse pad"),
};

			await _fixture.IndexDocuments(uniqueIndexName, productDocuments);

			var result = await opensearchClient.SearchAsync<ProductDocument>(selector => selector
				   .Index(uniqueIndexName)
				   .Query(query => query.MatchAll())
				   .Explain()
				   .Sort(sort => sort
					.Script(sortScript => sortScript
						.Ascending()
						.Type("number")
						.Script(s => s.Source($"doc['{nameof(ProductDocument.Name).ToLowerInvariant()}'].value == 'mouse pad' ? 0 : 1")
						)
					)
				)
			);

			// Our scripted sort will return the mousepad at the top of the results
			result.IsValid.Should().BeTrue();
			var formattedResults = string.Join(", ", result.Documents.Select(doc => doc.Name));
			formattedResults.Should().BeEquivalentTo("mouse pad, mouse");
		}
	);
}

This query produces the following DebugInformation in the response object:

Valid OpenSearch.Client response built from a successful(200)low level call on POST: /keyword-index93e2a7a2-ad01-4d72-ba43-cf2ccea61308/_search ? pretty = true & error_trace = true & typed_keys = true
     # Audit trail of this API call :
     - [1]HealthyResponse: Node: http: //localhost:9200/ Took: 00:00:00.4245839
     # Request: {
    "explain": true,
    "query": {
        "match_all": {}
    },
    "sort": [{
            "_script": {
                "script": {
                    "source": "doc['name'].value == 'mouse pad' ? 0 : 1"
                },
                "type": "number",
                "order": "asc"
            }
        }
    ]
}
 # Response: {
    "took": 269,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": null,
        "hits": [{
                "_shard": "[keyword-index93e2a7a2-ad01-4d72-ba43-cf2ccea61308][0]",
                "_node": "l7YV4K5YSFuy_CFfGwt8ig",
                "_index": "keyword-index93e2a7a2-ad01-4d72-ba43-cf2ccea61308",
                "_id": "2",
                "_score": null,
                "_source": {
                    "id": 2,
                    "name": "mouse pad"
                },
                "sort": [
                    0.0
                ],
                "_explanation": {
                    "value": 1.0,
                    "description": "*:*",
                    "details": []
                }
            }, {
                "_shard": "[keyword-index93e2a7a2-ad01-4d72-ba43-cf2ccea61308][0]",
                "_node": "l7YV4K5YSFuy_CFfGwt8ig",
                "_index": "keyword-index93e2a7a2-ad01-4d72-ba43-cf2ccea61308",
                "_id": "1",
                "_score": null,
                "_source": {
                    "id": 1,
                    "name": "mouse"
                },
                "sort": [
                    1.0
                ],
                "_explanation": {
                    "value": 1.0,
                    "description": "*:*",
                    "details": []
                }
            }
        ]
    }
}

 # TCP states:
Established: 64
TimeWait: 1
CloseWait: 14

 # ThreadPool statistics:
Worker:
Busy: 1
Free: 32766
Min: 12
Max: 32767
IOCP:
Busy: 0
Free: 1000
Min: 12
Max: 1000

The sort field in the response above highlights the integer values that were calculated by performing the ternary comparison on our keyword field. Mouse pad produces 0 and mouse produces 1. These values are then used to order the results in ascending order.

It is worth mentioning that this sorting example is contrived and doesn’t reflect a good practical example of scripted sorting. While painless scripts may be relatively fast to execute at search time, we should always aim to keep runtime complexity to a minimum. This will not only make your queries easier to debug, but will also allow them to run as fast as possible.

The example above is not dynamic and should not be implemented with a script. This is because we can evaluate whether the name of the product is mouse pad and index the result into separate field of the document at indexing time. It is only possible for this value to change when we re-index this document, at which point we can re-assert our expectations and store the new value accordingly.

An example of a dynamic query could be seen if the value mouse pad were instead an interpolated string variable that was given to us in a HTTP request. This value could be different for every user, which would make it impossible for us to index a pre-calculated value when indexing our document.

Let’s take a look at how a dynamic query could be constructed:

var scriptedVariableValue = "mouse";

var result = await opensearchClient.SearchAsync<ScriptedProductDocument>(selector => selector
	   .Index(uniqueIndexName)
	   .ScriptFields(scriptFields => scriptFields
		.ScriptField(
		   categoryFieldName,
		   selector => selector.Source($"doc['{nameof(ProductDocument.Name).ToLowerInvariant()}'].value == '{scriptedVariableValue}' ? 'computer accessory' : 'mouse accessory'"))
		)
	   .Source(true)
   );

In the example above our query is populated with an interpolated local variable, which has been set to the same value as the hard coded example. In reality, depending on the nature of the application, this value could be anything at any point in time, making the use case dynamic and a sort script justified.

Note that during interpolation of the variable, we must remember to wrap it in single quotes '{scriptedVariableValue}'!

Sloth Summary

Today we’ve covered how to sort keyword fields:

  • They’re one of the simplest things to sort, as they don’t require any special considerations, mappings, or typecasting
  • Numeric data will also be sorted by ASCII code
  • Keyword fields can also be used in scripted sorting
  • Scripted sort should only be used when we have a dynamic use case that drives the sort order. Ask yourself: “can I index this value at index-time instead?” to keep your search requests performant!

You may also like