Thursday, January 16, 2025

Software engineering flipped on its head.

Evolve your thinking into its optimal form: the sloth.

Home OpenSearch5. Aggregating in OpenSearch [Tutorial] OpenSearch Keyword Cardinality Aggregation

[Tutorial] OpenSearch Keyword Cardinality Aggregation

by Trent
0 comments

This short tutorial will cover what an ElasticSearch Cardinality Aggregation is and how to perform it. Cardinality in ElasticSearch is one of the simplest aggregations to write and can be very helpful while building your search solution. This article will focus specifically on performing a cardinality aggregation on data that is indexed into the keyword field data type.

While this article uses the .Net OpenSearch.Client NuGet package and indexes/searches using an OpenSearch cluster, the query is currently identical to that of ElasticSearch. At time of writing, Elastic.co documentation is far richer than OpenSearch, so a combination of links and terms between the two vendors may be provided to reference the concepts discussed. These offerings are currently functionally equivalent.

Check out this post to run an OpenSearch cluster on your computer, or this one to run an ElasticSearch cluster. Don’t forget to check out the GitHub repository of sample code, via the Code Sloth Code Samples page – they’re ready to run!

What is an ElasticSearch Cardinality Aggregation?

ElasticSearch Cardinality Aggregation is a type of Metric Aggregation which calculates the number of distinct values for a given field. In this article we will use it to count the distinct terms in a keyword field.

Metric aggregations are uncommon for keywords though, as they typically focus on numeric field data types.

Why use Cardinality in ElasticSearch

Cardinality in ElasticSearch may be useful in displaying the number of unique product types that exist in a catalog.

It can also be helpful to understand the breadth of terms for a given field prior to writing other OpenSearch aggregations. Bucket aggregations for example, become less performant on data sets with a high cardinality. Therefore by understanding how many unique terms exist in advance, you can mitigate risk of poor performance by designing your system to account for slowness, or tune performance through the way buckets are retrieved.

ElasticSearch Cardinality Aggregation Example

Let’s take a look at an example, which:

  • Indexes 5 documents, containing two distinct product name keyword fields (mouse and mouse pad)
  • Issues a cardinality aggregation on the name field
  • Asserts that we have calculated a cardinality of 2 names
 [Fact]
public async Task KeywordMapping_CanBeUsedForMetricAggregation_Cardinality()
{
	var indexName = "keyword-index";
	await _fixture.PerformActionInTestIndex(
		indexName,
		mappingDescriptor,
		async (uniqueIndexName, opensearchClient) =>
		{
			var productDocuments = new[] {
new ProductDocument(1, "mouse"),
new ProductDocument(3, "mouse pad"),
new ProductDocument(4, "mouse"),
new ProductDocument(5, "mouse"),
new ProductDocument(6, "mouse pad"),
};

			await _fixture.IndexDocuments(uniqueIndexName, productDocuments);

			const string distinctProductTypes = "distinctProductTypes";

			var result = await opensearchClient.SearchAsync<ProductDocument>(selector => selector
				   .Index(uniqueIndexName)
				   .Query(query => query.MatchAll())
				   // We do not want any documents returned; just the aggregations
				   .Size(0)
				   .Aggregations(aggregations => aggregations
					.Cardinality(distinctProductTypes, selector => selector.Field(field => field.Name))
				)
			);

			// Extract the total number of distinct product names
			result.IsValid.Should().BeTrue();
			var distinctProductCount = result.Aggregations.Cardinality(distinctProductTypes).Value;
			distinctProductCount.Should().Be(2);
		}
	);
}

Let’s take a look at the DebugInformation of the query:

Valid OpenSearch.Client response built from a successful(200)low level call on POST: /keyword-index3fc7ab40-cd30-4cb7-9312-0e20403b8697/_search ? pretty = true & error_trace = true & typed_keys = true
     # Audit trail of this API call :
     - [1]HealthyResponse: Node: http: //localhost:9200/ Took: 00:00:00.1506585
     # Request: {
    "aggs": {
        "distinctProductTypes": {
            "cardinality": {
                "field": "name"
            }
        }
    },
    "query": {
        "match_all": {}
    },
    "size": 0
}
 # Response: {
    "took": 21,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 5,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "cardinality#distinctProductTypes": {
            "value": 2
        }
    }
}

 # TCP states:
Established: 70
TimeWait: 5
CloseWait: 11

 # ThreadPool statistics:
Worker:
Busy: 1
Free: 32766
Min: 12
Max: 32767
IOCP:
Busy: 0
Free: 1000
Min: 12
Max: 1000

Analysing Cardinality in ElasticSearch

Unlike a regular search query, which returns our matching documents under the hits property, the results of our ElasticSearch cardinality aggregations can be found under the aggregations property. A single query may issue multiple aggregation requests, so each is broken down by the aggregation type and the name that was given to it.

In our example, we can see cardinality#distinctProductTypes, with cardinality being our aggregation type, and distinctProductTypes being the string constant that we gave as the first parameter to our call in .Cardinality(...)

The .Net API client’s don’t have very clear documentation at time of writing, so it can be difficult to track down how to extract results. Luckily we have Code Sloth code samples! The Cardinality object can be accessed like a dictionary, by specifying the string key of our cardinality aggregation. Given that cardinality returns a single integer result, we can then inspect its .Value property.

Sloth Summary

The cardinality aggregation is simple to use once you understand how to parse its output. The use cases for this type of aggregation are also quite simple. Being able to understand how diverse the terms of a document are may be useful during a new feature development, or periodic analysis of a data set, however you’ll likely find yourself reaching for other aggregations if building a robust search solution.

You may also like