Thursday, January 16, 2025

Software engineering flipped on its head.

Evolve your thinking into its optimal form: the sloth.

Home OpenSearch5. Aggregating in OpenSearch [Tutorial] OpenSearch Keyword Terms Aggregation

[Tutorial] OpenSearch Keyword Terms Aggregation

by Trent
0 comments

This article will use the .Net OpenSearch.Client NuGet package. Prior to completing the tutorial please read the Keyword Field Data Type Indexing Deep Dive article as it contains useful prerequisite information.

At time of writing, Elastic.co documentation is far richer than OpenSearch, so a combination of links between the two vendors may be provided to reference the concepts discussed. These offerings are currently functionally equivalent.

Don’t forget to check out the full code samples GitHub repositories, via the Code Sloth Code Samples page.

Keywords and Bucket Aggregations

Bucket aggregations are a family of aggregations that use different strategies to group documents into buckets. While there is a long list of bucket aggregations available in OpenSearch, only a subset of these are applicable to keyword field data types.

Let’s explore the terms aggregation and how it can be used on the keyword field data type.

Why Use Terms?

The terms aggregation is the likely the most common aggregation that you will use with OpenSearch. It will calculate the distinct terms of your specified keyword field and the number of documents that contain each term. If you supply a query alongside this aggregation it will only count the terms of documents which match the given filters.

Shopping websites are a great use case for the terms aggregation. If you’ve ever seen a website list a number alongside the type of clothing (t-shirt, shorts, jeans etc) on a filter menu, it’s likely powered by a terms query. If you start to specify increasingly specific filters on your search, those numbers will decrease, as the available inventory that matches your requirements also decreases.

Global Ordinals and Cardinality

Terms aggregations are based on global ordinals, rather than being calculated directly from the fields of the document. These values are typically calculated the first time an aggregation is requested and cached at the shard level. This has a direct impact on the execution time of the initial query.

If the indexing rate of the cluster is high, these global ordinals are invalidated quickly. This means that more queries will have a longer execution time, as they wait for these values to be recalculated. We’ll cover addressing this issue in another tutorial, as we explore enabling eager global ordinals on an index to force recalculation as documents are indexed.

Eager global ordinals aside, while getting started with the terms aggregation it is handy to know:

  • Terms aggregations default to the top 10 terms
  • The maximum number of terms buckets can be increased using the Size() method
  • There is a limit to the maximum number of buckets that you can query. This applies across all aggregations and will result in runtime exceptions if your dynamic queries exceed it
  • You may require a composite aggregation if you wish to retrieve all terms in a data set with high cardinality

Terms Aggregation Example

Let’s take a look at an example, which:

  • Indexes 5 documents, containing two distinct product name keyword fields (mouse and mouse pad)
  • Issues a terms aggregation on the name field
  • Asserts that each term has been returned against the number of documents that reference them
[Fact]
public async Task KeywordMapping_CanBeUsedForTermsAggregation()
{
	var indexName = "keyword-index";
	await _fixture.PerformActionInTestIndex(
		indexName,
		mappingDescriptor,
		async (uniqueIndexName, opensearchClient) =>
		{
			var productDocuments = new[] {
new ProductDocument(1, "mouse"),
new ProductDocument(3, "mouse pad"),
new ProductDocument(4, "mouse"),
new ProductDocument(5, "mouse"),
new ProductDocument(6, "mouse pad"),
};

			await _fixture.IndexDocuments(uniqueIndexName, productDocuments);

			const string productCounts = "productCounts";

			var result = await opensearchClient.SearchAsync<ProductDocument>(selector => selector
				   .Index(uniqueIndexName)
				   .Query(query => query.MatchAll())
				   // We do not want any documents returned; just the aggregations
				   .Size(0)
				   .Aggregations(aggregations => aggregations
					.Terms(productCounts, selector => selector.Field(field => field.Name))
				)
			);

			// Extract each term and its associated number of hits
			result.IsValid.Should().BeTrue();
			var formattedResults = string.Join(", ", result.Aggregations
				.Terms(productCounts).Buckets
				.Select(bucket => $"{bucket.Key}:{bucket.DocCount}")
			);

			formattedResults.Should().BeEquivalentTo("mouse:3, mouse pad:2");
		}
	);
}

This produces the following output

Valid OpenSearch.Client response built from a successful(200)low level call on POST: /keyword-indexa7122530-f920-4f72-9cf9-c4eba4b5202e/_search ? pretty = true & error_trace = true & typed_keys = true
     # Audit trail of this API call :
     - [1]HealthyResponse: Node: http: //localhost:9200/ Took: 00:00:00.2791735
     # Request: {
    "aggs": {
        "productCounts": {
            "terms": {
                "field": "name"
            }
        }
    },
    "query": {
        "match_all": {}
    },
    "size": 0
}
 # Response: {
    "took": 116,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 5,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "sterms#productCounts": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [{
                    "key": "mouse",
                    "doc_count": 3
                }, {
                    "key": "mouse pad",
                    "doc_count": 2
                }
            ]
        }
    }
}

 # TCP states:
Established: 83
CloseWait: 4
TimeWait: 6

 # ThreadPool statistics:
Worker:
Busy: 1
Free: 32766
Min: 12
Max: 32767
IOCP:
Busy: 0
Free: 1000
Min: 12
Max: 1000

The returned aggregation can be identified by its type and the name that we gave to it sterms#productCounts.

Terms aggregations are fetched using a dictionary lookup from the Terms(...) method. However unlike our Cardinality aggregation, bucket aggregations return a complex result, instead of a single value:

"buckets": [{
		"key": "mouse",
		"doc_count": 3
	}, {
		"key": "mouse pad",
		"doc_count": 2
	}
]

Here we have an array containing each distinct product name against the number of documents which referenced it. Our test asserts this by using a LINQ statement to select a strongly formatted string containing each result’s key and value, and then delimits these strings with commas.

var formattedResults = string.Join(", ", result.Aggregations
	.Terms(productCounts).Buckets
	.Select(bucket => $"{bucket.Key}:{bucket.DocCount}")
);

formattedResults.Should().BeEquivalentTo("mouse:3, mouse pad:2");

Stringifying parts of complex objects in this way makes writing test assertions incredibly clear: we expect 3 mouse and 2 mouse pad products to be returned by the terms aggregation.

Sloth Summary

  • The Terms aggregation is incredibly useful for finding the distinct number of terms of a given field and counting the documents that reference the term
  • They are often used on shopping websites to show the number results that match given metadata of products in their inventory
  • The terms aggregation is based off global ordinals and may need to be optimised using eager ordinals
  • While the maximum number of terms can be increased, there is a limit to the number of buckets that can be calculated for a given query, so a composite aggregation may be required if you have terms of high cardinality

You may also like