Thursday, January 16, 2025

Software engineering flipped on its head.

Evolve your thinking into its optimal form: the sloth.

Home OpenSearch5. Aggregating in OpenSearch [Tutorial] OpenSearch Keyword Composite Aggregation

[Tutorial] OpenSearch Keyword Composite Aggregation

by Trent
0 comments

This article will use the .Net OpenSearch.Client NuGet package. Prior to completing the tutorial please read the Keyword Field Data Type Indexing Deep Dive article as it contains useful prerequisite information.

At time of writing, Elastic.co documentation is far richer than OpenSearch, so a combination of links between the two vendors may be provided to reference the concepts discussed. These offerings are currently functionally equivalent.

Don’t forget to check out the full code samples GitHub repositories, via the Code Sloth Code Samples page.

Why Use Composite Aggregations?

While the composite aggregation has its own aggregation name, it doesn’t entirely provide a new aggregation. It instead can take one or more supported aggregation types as an input, each of which only expose a subset of the features of their non-composite counterparts, and outputs the asociated aggregation’s data:

  • Terms
  • Histogram
  • Date Histogram
  • GeoTile Grid

As is the case for any multi-bucket aggregation, the composite aggregation also support sub-aggregations (aggregations that are performed on the result of a parent aggregation).

Of the above aggregations only the terms aggregation is applicable to the keyword field mapping.

Here’s how it might be useful:

Unlimited Aggregation Retrieval

As we previously covered in the keyword terms tutorial, there is a limit to the maximum number of buckets that you can fetch in a single query. This is a similar constraint to the maximum number of documents that you can fetch in a single search query (by default this is 10,000).

The composite aggregation is our solution for this limit, as it supports fetching all buckets for an aggregation(s) using offset pagination. This is similar to how the scroll query or search after allow us to efficiently fetch all documents for a given search query.

Fetching Multiple Aggregation’s Buckets at a Time

The composite aggregation takes multiple sources as an input. In the context of keywords, this means that we can provide multiple terms aggregations in a single request, given the limited support of aggregation types.

This may not be as useful as requesting multiple types of aggregations at the same time, across different field data types.

Composite Aggregation Example

Let’s take a look at an example, which:

  • Indexes 5 documents, containing two distinct product name keyword fields (mouse and mouse pad)
  • Issues a composite terms aggregation on the name field, with a small size, such that we need to fetch multiple pages of aggregation results, using the after parameter
  • Sums the count of each page of results to ensure that it matches our total aggregation expectation
private async Task<ISearchResponse<ProductDocument>> QueryCompositeTermsAggregation(IOpenSearchClient opensearchClient, string uniqueIndexName, CompositeKey? afterKey = null)
{
	return await opensearchClient.SearchAsync<ProductDocument>(selector => selector
		  .Index(uniqueIndexName)
		  .Query(query => query.MatchAll())
		  // We do not want any documents returned; just the aggregations
		  .Size(0)
		  .Aggregations(
			  aggregations => aggregations.Composite(
				  "composite", compositeAggs =>
				  compositeAggs.Sources(
					  sources => sources
					  .Terms(
						  "productCounts",
						  selector => selector.Field(field => field.Name)
						  )
					  )
				  .Size(1)
				  .After(afterKey)
				  )
			  )
		  );
}

[Fact]
public async Task KeywordMapping_CanBeUsedForCompositeTermsAggregation()
{
	var indexName = "keyword-index";
	await _fixture.PerformActionInTestIndex(
		indexName,
		mappingDescriptor,
		async (uniqueIndexName, opensearchClient) =>
		{
			var productDocuments = new[] {
				new ProductDocument(1, "mouse"),
				new ProductDocument(2, "mouse pad"),
				new ProductDocument(3, "mouse"),
				new ProductDocument(4, "mouse"),
				new ProductDocument(5, "mouse pad"),
			};

			await _fixture.IndexDocuments(uniqueIndexName, productDocuments);

			var result = await QueryCompositeTermsAggregation(opensearchClient, uniqueIndexName);

			result.IsValid.Should().BeTrue();

			// We fetch the first terms bucket for the mouse term
			var formattedResults = string.Join(", ", result.Aggregations.Composite("composite").Buckets.Select(bucket => $"{bucket.Key.Values.First()}:{bucket.DocCount}"));
			formattedResults.Should().BeEquivalentTo("mouse:3");

			// Provide the after key to fetch the next bucket. This is the value "mouse" representing the last term bucket that was returned
			// This will fetch us the second terms bucket for the mouse pad term
			result = await QueryCompositeTermsAggregation(opensearchClient, uniqueIndexName, result.Aggregations.Composite("composite").AfterKey);
			formattedResults = string.Join(", ", result.Aggregations.Composite("composite").Buckets.Select(bucket => $"{bucket.Key.Values.First()}:{bucket.DocCount}"));
			formattedResults.Should().BeEquivalentTo("mouse pad:2");
		}
	);
}

Past experience has taught me that creating shared functions (especially those relating to test data setup) when writing unit tests is a recipe for complexity over time. One additional parameter for a new use case becomes five or more and before you know it the function caters for everything on the planet. At this point maintaining existing tests becomes very difficult, as the perceived compexity of the function has grown since they were originally authored. This makes interpreting what a single invocation does a side mission to primary task at hand and only serves to slow down your engineering process.

For this test however, it was pragmatic to functionally decompose the formulation of the OpenSearch query, given that we need to invoke it twice. This has saved copy-pasta code and kept the function minimal.

Composite Keys

As you can see, the QueryCompositeTermsAggregation function takes a CompositeKey. In our initial request we do not supply this key. As a result, the omission of this key will cause OpenSearch to return the first page of aggregations.

For each subsequent page we must supply a composite key. This value is returned to us from the prior query. For the test above the composite key represents the last term from the last bucket that we fetched.

Given that this test sets a bucket size of 1, we first fetch the mouse term. mouse then becomes our composite key, which we pass to our second query to fetch the mouse pad term along with its expected document count from the second page of results.

By default composite buckets are sorted in ascending order based on their values. This ascii sorting explains why we see mouse before mouse pad.

Optimising the Composite Aggregation

The first line of the composite aggregation documentation from elastic.co mentions that the composite aggregation is expensive and that your application should be load tested before deploying usage of it to production.

This article will not cover how to benchmark an OpenSearch cluster, nor deep dive into each recommendation. These will be covered in upcoming articles. However, we’ll cover a summary of each below.

Use index sort to match the source order in the composite aggregation. This means that:

  • If you have a single composite source that refers to a single document field, apply index sorting sorting to that specific field in the direction that your composite aggregation will be sorted. By default this is ascending order
  • If you have multiple composite sources that refer to multiple document fields, ensure that some or all of the referenced fields have index sort applied in the direction(s) that the composite aggregation(s) require
  • Doing so ensures that data is ordered on disk, so that it does not need to be performed in-memory. This is a similar concept to how relational databaes use indexes to increase the performance of queries.

Additionally, optimise early termination by setting track_total_hits to false in the request. If total hits are absolutely required, consider only setting it to true on the first page, and then false for subsequent pages. You should consider if your data volume is highly volatile before doing this, as the total number of aggregations is not persisted on the server, and can change between invocations.

Finally, if the order of our sources do not matter you can put the fields with the highest cardinality first. In order to learn about the cardinality of your aggregated fields, check out the CodeSloth tutorial on the cardinality aggregation for keywords.

Sloth Summary

The composite aggregation:

  • Allows us to paginate through all of the results for a collection of supported aggregation types. This is achieved by using a composite key to facilitate offset pagination.
  • Is similar to search_after or scroll in the search query space
  • May be detrimental to cluster performance, and should be benchmarked before being released in a production environment. Performance optimisation may assist with this.
  • Can only be invoke with a terms source for the keyword field data type

You may also like