Sunday, April 6, 2025

Software engineering flipped on its head.

Evolve your thinking into its optimal form: the sloth.

Home OpenSearch/ElasticsearchOpenSearch in Java [Tutorial] OpenSearch Terms Aggregation Include Exclude Parameters on Keyword Fields

[Tutorial] OpenSearch Terms Aggregation Include Exclude Parameters on Keyword Fields

by Trent
0 comments
OpenSearch Terms Aggregation Include Exclude Parameters featured image

The Terms Aggregation is an excellent tool for counting all the terms appearing across our documents. It can be run on both single-value keywords and collections of strings. At times, though, we may not want to count all terms that match a query, and instead focus on particular values that we might want to include, or exclude. This article will dive deep into how OpenSearch Terms Aggregation include exclude parameters can limit what will be counted.

The latest version of code snippets used in this article can be found via the Code Sloth Code Samples page, under Java Search samples.

Including Filtered Matches in Term Aggregations on Single Keyword Field Mappings

The code below contains an integration test that explores running a terms aggregation with an include term filter. This is our simplest example to start with.

Including Filtered Terms

 /**
 * This test verifies that keyword fields can be used for filtered terms aggregation.
 * It demonstrates how to use the 'includes' parameter with explicit term values to filter terms.
 * <p>
 * Unlike the regex version, this approach allows exact matching of specific terms without
 * the complexity of regular expressions.
 *
 * @param includeTerms    Array of terms to include in the aggregation
 * @param expectedResults The expected aggregation results in "term:count" format
 * @param description     A description of what the test case is evaluating
 * @throws Exception If an I/O error occurs
 */
@ParameterizedTest
@CsvSource({
        "mouse, mouse:3, 'Include only mouse - filters out mouse pad'",
        "mouse pad, mouse pad:2, 'Include only mouse pad - filters out mouse'",
        "'mouse, mouse pad', 'mouse:3, mouse pad:2', 'Include both terms - shows all terms'",
        "keyboard, '', 'Include non-existent term - no results'"
})
public void keywordMapping_CanBeUsedForFilteredTermsAggregation_OnSingleKeywordWithIncludeTerms(
        @ConvertWith(StringArrayConverter.class) String[] includeTerms, 
        String expectedResults, 
        String description) throws Exception {
    // Create a test index with keyword mapping for the Name field
    try (OpenSearchTestIndex testIndex = fixture.createTestIndex(mapping ->
            mapping.properties("name", Property.of(p -> p.keyword(k -> k))))) {

        // Create and index product documents
        ProductDocument[] productDocuments = new ProductDocument[]{
                new ProductDocument(1, "mouse", 1),
                new ProductDocument(2, "mouse pad", 2),
                new ProductDocument(3, "mouse", 3),
                new ProductDocument(4, "mouse", 4),
                new ProductDocument(5, "mouse pad", 5)
        };
        testIndex.indexDocuments(productDocuments);

        // Create a search request with terms aggregation and includes filter using explicit terms
        SearchRequest searchRequest = new SearchRequest.Builder()
                .index(testIndex.getName())
                .size(0) // We do not want any documents returned; just the aggregations
                .aggregations("product_counts", a -> a
                        .terms(t -> t
                                .field("name")
                                .size(10)
                                .include(i -> i.terms(Arrays.asList(includeTerms)))
                        )
                )
                .build();

        // Execute the search request
        SearchResponse<ProductDocument> response = openSearchClient.search(searchRequest, ProductDocument.class);

        // Verify the results
        assertThat(response.aggregations()).isNotNull();

        StringTermsAggregate termsAgg = response.aggregations().get("product_counts").sterms();

        // Extract each term and its associated number of hits
        Map<String, Long> bucketCounts = termsAgg.buckets().array().stream()
                .collect(Collectors.toMap(
                        StringTermsBucket::key,
                        StringTermsBucket::docCount
                ));

        // Format the results for verification
        String formattedResults = bucketCounts.entrySet().stream()
                .map(entry -> entry.getKey() + ":" + entry.getValue())
                .collect(Collectors.joining(", "));

        // Verify the expected results
        assertThat(formattedResults)
                .as(description)
                .isEqualTo(expectedResults);
    }
}

Let’s break down what the test is doing:

  • Creates a test index with a keyword mapping for the name keyword field
  • Indexes five product documents with the single names mouse and mouse pad
  • Builds a terms aggregation search request that filters the terms using an include parameter
    • This is supplied via the parameterised test
    • The terms method is used to supply the parameter(s) for exact match filtering

The test explores each combination of single and combined inputs, all results and no results.

Including Filtered Regular Expressions

Similar to this approach, we can apply regular expressions to reduce the counted terms in the aggregation.

/**
 * This test verifies that keyword fields can be used for filtered terms aggregation.
 * It demonstrates how to use the 'includes' parameter with regex patterns to filter terms.
 *
 * @param includesPattern The regex pattern to include terms
 * @param expectedResults The expected aggregation results in "term:count" format
 * @param description     A description of what the test case is evaluating
 * @throws Exception If an I/O error occurs
 */
@ParameterizedTest
@CsvSource({
        "mouse, mouse:3, 'Exact match - matches only the exact term'",
        "mouse.*, 'mouse:3, mouse pad:2', 'Prefix match - matches terms starting with mouse'",
        ".*pad, mouse pad:2, 'Suffix match - matches terms ending with pad'",
        "keyboard, '', 'No matches - pattern matches no terms'"
})
public void keywordMapping_CanBeUsedForFilteredTermsAggregation_OnSingleKeywordWithIncludeRegularExpression(String includesPattern, String expectedResults, String description) throws Exception {
    // Create a test index with keyword mapping for the Name field
    try (OpenSearchTestIndex testIndex = fixture.createTestIndex(mapping ->
            mapping.properties("name", Property.of(p -> p.keyword(k -> k))))) {

        // Create and index product documents
        ProductDocument[] productDocuments = new ProductDocument[]{
                new ProductDocument(1, "mouse", 1),
                new ProductDocument(2, "mouse pad", 2),
                new ProductDocument(3, "mouse", 3),
                new ProductDocument(4, "mouse", 4),
                new ProductDocument(5, "mouse pad", 5)
        };
        testIndex.indexDocuments(productDocuments);

        // Create a search request with terms aggregation and includes filter using regexp
        SearchRequest searchRequest = new SearchRequest.Builder()
                .index(testIndex.getName())
                .size(0) // We do not want any documents returned; just the aggregations
                .aggregations("product_counts", a -> a
                        .terms(t -> t
                                .field("name")
                                .size(10)
                                .include(i -> i.regexp(includesPattern))
                        )
                )
                .build();

        // Execute the search request
        SearchResponse<ProductDocument> response = openSearchClient.search(searchRequest, ProductDocument.class);

        // Verify the results
        assertThat(response.aggregations()).isNotNull();

        StringTermsAggregate termsAgg = response.aggregations().get("product_counts").sterms();

        // Extract each term and its associated number of hits
        Map<String, Long> bucketCounts = termsAgg.buckets().array().stream()
                .collect(Collectors.toMap(
                        StringTermsBucket::key,
                        StringTermsBucket::docCount
                ));

        // Format the results for verification
        String formattedResults = bucketCounts.entrySet().stream()
                .map(entry -> entry.getKey() + ":" + entry.getValue())
                .collect(Collectors.joining(", "));

        // Verify the expected results
        assertThat(formattedResults)
                .as(description)
                .isEqualTo(expectedResults);
    }
}

The structure of this test is the same as the terms test. Note that the include method call is supplied the result of a regexp call instead of terms.

A Closer Look With The Debugger; Regular Expressions

To debug the test, locate the small green play button in the gutter to the left of the test method name, then select Debug. This will automatically use docker-compose to create a running OpenSearch cluster.

Set a breakpoint on the assertion that result.aggregations is not null.

To prepare for observing our variables, open the Threads & Variables debug window.

Type the expressions below in the text box at the top of the panel and click the button pointed at in 3. to keep the watch for future debugging, or hit enter to produce a result that can be seen in the top row.

The request object can be seen by clicking the blue view text against a saved watch of searchRequest.toJsonString().

The below compares the regular expression (left) against the terms filter from above (right). Notice that the only difference in syntax on the JSON object is the presence of an array.

{
  "aggregations" : {
    "product_counts" : {
      "terms" : {
        "field" : "name",
        "include" : "mouse.*",
        "size" : 10
      }
    }
  },
  "size" : 0
}
{
  "aggregations" : {
    "product_counts" : {
      "terms" : {
        "field" : "name",
        "include" : [ "mouse" ],
        "size" : 10
      }
    }
  },
  "size" : 0
}

Enter response.toJsonString() to view the raw response from OpenSearch:

{
  "took" : 43,
  "timed_out" : false,
  "_shards" : {
    "failed" : 0.0,
    "skipped" : 0.0,
    "successful" : 1.0,
    "total" : 1.0
  },
  "hits" : {
    "total" : {
      "relation" : "eq",
      "value" : 5
    },
    "hits" : [ ]
  },
  "aggregations" : {
    "sterms#product_counts" : {
      "buckets" : [ {
        "doc_count" : 3,
        "key" : "mouse"
      }, {
        "doc_count" : 2,
        "key" : "mouse pad"
      } ],
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0
    }
  }
}

Setting the page size to 0 returns 0 hits, as expected. This is a useful strategy to avoid wasted compute and bandwidth if we are only observing aggregations.

The aggregations section of the JSON response has a terms section that we named product_counts, which contains 2 buckets of terms that matched the regular expression:

  • mouse which was matched 3 times
  • mouse pad which was matched 2 times

The other tests demonstrate that mouse can be filtered out with .*pad and no results will be counted with a pattern such as keyboard which does not match.

Including Filtered Matches in Term Aggregations on Collection Keyword Field Mappings

Another practical use case for a filtered terms aggregation is counting specific terms in an array field. While we could use a query to reduce our overall corpus (be it a particular term or analyzed text), when dealing with an array on a document, a plain terms aggregation will count all values within the array; for each applicable document.

Query Filters Do Not Reduce Calculated Terms in Collections

This is demonstrated in the below test, which filters the corpus down to a single document. Despite this, all values in the names array are counted in the terms aggregation.

/**
 * This test verifies that terms aggregations on keyword arrays count all terms in matching documents,
 * even when the documents are filtered by a query.
 * <p>
 * When a document matches a query, all terms in its arrays are counted in the aggregation,
 * not just the terms that matched the query.
 *
 * @throws IOException If an I/O error occurs
 */
@Test
public void keywordMapping_TermsAggregationOnKeywordArrayCountsAllTermsWhenFiltered() throws Exception {
    // Create a test index with keyword mapping for the names array field
    try (OpenSearchTestIndex testIndex = fixture.createTestIndex(mapping ->
            mapping.properties("names", Property.of(p -> p.keyword(k -> k))))) {

        // Create and index product documents with array of names
        ProductDocumentWithMultipleNames[] productDocuments = new ProductDocumentWithMultipleNames[]{
                new ProductDocumentWithMultipleNames(1, new String[]{"mouse", "computer"}, 1),
                new ProductDocumentWithMultipleNames(2, new String[]{"mouse pad", "power cable"}, 2),
                new ProductDocumentWithMultipleNames(3, new String[]{"mouse", "mouse pad"}, 3),
                new ProductDocumentWithMultipleNames(4, new String[]{"mouse", "arm rest pad"}, 4),
                new ProductDocumentWithMultipleNames(5, new String[]{"mouse pad"}, 5)
        };
        testIndex.indexDocuments(productDocuments);

        // Create a search request with a term query on "computer" and a terms aggregation
        SearchRequest searchRequest = new SearchRequest.Builder()
                .index(testIndex.getName())
                // This test adds a query to reduce the overall applicable documents. Only 1 document will match and have its terms aggregated
                .query(q -> q
                        .term(t -> t
                                .field("names")
                                .value(FieldValue.of("computer"))
                        )
                )
                .size(0) // We do not want any documents returned; just the aggregations
                .aggregations("product_counts", a -> a
                        .terms(t -> t
                                .field("names")
                                .size(10)
                        )
                )
                .build();

        // Execute the search request
        SearchResponse<ProductDocumentWithMultipleNames> response = openSearchClient.search(searchRequest, ProductDocumentWithMultipleNames.class);

        // Verify the results
        assertThat(response.aggregations()).isNotNull();

        // Verify that the query matched only one document
        assertThat(response.hits().total().value()).isEqualTo(1);

        StringTermsAggregate termsAgg = response.aggregations().get("product_counts").sterms();

        // Extract each term and its associated number of hits
        Map<String, Long> bucketCounts = termsAgg.buckets().array().stream()
                .collect(Collectors.toMap(
                        StringTermsBucket::key,
                        StringTermsBucket::docCount
                ));

        // Format the results for verification
        String formattedResults = bucketCounts.entrySet().stream()
                .map(entry -> entry.getKey() + ":" + entry.getValue())
                .collect(Collectors.joining(", "));

        // Verify the expected results - both "mouse" and "computer" are counted in the one matching document.
        // Despite a single document being produced from the match term query, all of its values in the field are aggregated over
        assertThat(formattedResults).isEqualTo("mouse:1, computer:1");
    }
}

This test builds the following query:

{
  "aggregations" : {
    "product_counts" : {
      "terms" : {
        "field" : "names",
        "size" : 10
      }
    }
  },
  "query" : {
    "term" : {
      "names" : {
        "value" : "computer"
      }
    }
  },
  "size" : 0
}

This produces the following response:

{
  "took" : 64,
  "timed_out" : false,
  "_shards" : {
    "failed" : 0.0,
    "skipped" : 0.0,
    "successful" : 1.0,
    "total" : 1.0
  },
  "hits" : {
    "total" : {
      "relation" : "eq",
      "value" : 1
    },
    "hits" : [ ]
  },
  "aggregations" : {
    "sterms#product_counts" : {
      "buckets" : [ {
        "doc_count" : 1,
        "key" : "computer"
      }, {
        "doc_count" : 1,
        "key" : "mouse"
      } ],
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0
    }
  }
}

In the response, we can see that there was 1 hit, and both of its terms (computer and mouse) were counted once each. This confirms that the aggregation will evaluate all terms on the returned documents, no matter what filters we apply in the query.

Filtering Counted Collection Terms With Terms Filters

While the include filter in the single keyword tests above completely removed documents from being counted in the terms aggregation, we can use the same filter to eliminate documents from aggregation, or refine counted terms to relevant values on a document.

 /**
 * This test verifies that arrays of keyword fields can be used for filtered terms aggregation
 * using explicit term lists rather than regular expressions.
 * <p>
 * Unlike the regex version, this approach allows exact matching of specific terms without
 * the complexity of regular expressions.
 * <p>
 * See {@link KeywordAggregationTests#keywordMapping_CanBeUsedForTermsAggregationOnKeywordArray} for the base case
 * that counts all elements across all document arrays without filtering.
 *
 * @param includeTerms    Array of terms to include in the aggregation
 * @param expectedResults The expected aggregation results in "term:count" format
 * @param description     A description of what the test case is evaluating
 * @throws Exception If an I/O error occurs
 */
@ParameterizedTest
@CsvSource({
        "mouse, mouse:3, 'Include only mouse - matches exact term'",
        "mouse pad, mouse pad:3, 'Include only mouse pad - matches exact term'",
        "'mouse, mouse pad', 'mouse:3, mouse pad:3', 'Include mouse terms - matches both terms'",
        "'mouse, computer', 'mouse:3, computer:1', 'Include mixed terms - matches one common and one rare term'",
        "keyboard, '', 'Include non-existent term - no results'"
})
public void keywordMapping_CanBeUsedForFilteredTermsAggregation_OnKeywordArrayWithIncludeTerms(
        @ConvertWith(StringArrayConverter.class) String[] includeTerms,
        String expectedResults,
        String description) throws Exception {
    // Create a test index with keyword mapping for the names array field
    try (OpenSearchTestIndex testIndex = fixture.createTestIndex(mapping ->
            mapping.properties("names", Property.of(p -> p.keyword(k -> k))))) {

        // Create and index product documents with array of names
        ProductDocumentWithMultipleNames[] productDocuments = new ProductDocumentWithMultipleNames[]{
                new ProductDocumentWithMultipleNames(1, new String[]{"mouse", "computer"}, 1),
                new ProductDocumentWithMultipleNames(2, new String[]{"mouse pad", "power cable"}, 2),
                new ProductDocumentWithMultipleNames(3, new String[]{"mouse", "mouse pad"}, 3),
                new ProductDocumentWithMultipleNames(4, new String[]{"mouse", "arm rest pad"}, 4),
                new ProductDocumentWithMultipleNames(5, new String[]{"mouse pad"}, 5)
        };
        testIndex.indexDocuments(productDocuments);

        // The includeTerms is now directly a String array, no need for parsing

        // Create a search request with terms aggregation and includes filter using explicit terms
        SearchRequest searchRequest = new SearchRequest.Builder()
                .index(testIndex.getName())
                .size(0) // We do not want any documents returned; just the aggregations
                .aggregations("product_counts", a -> a
                        .terms(t -> t
                                .field("names")
                                .size(10)
                                .include(i -> i.terms(Arrays.asList(includeTerms)))
                        )
                )
                .build();

        // Execute the search request
        SearchResponse<ProductDocumentWithMultipleNames> response = openSearchClient.search(searchRequest, ProductDocumentWithMultipleNames.class);

        // Verify the results
        assertThat(response.aggregations()).isNotNull();

        StringTermsAggregate termsAgg = response.aggregations().get("product_counts").sterms();

        // Extract each term and its associated number of hits
        Map<String, Long> bucketCounts = termsAgg.buckets().array().stream()
                .collect(Collectors.toMap(
                        StringTermsBucket::key,
                        StringTermsBucket::docCount
                ));

        // Format the results for verification
        String formattedResults = bucketCounts.entrySet().stream()
                .map(entry -> entry.getKey() + ":" + entry.getValue())
                .collect(Collectors.joining(", "));

        // Verify the expected results
        assertThat(formattedResults)
                .as(description)
                .isEqualTo(expectedResults);
    }
}

This test indexes documents with multiple names. Test cases provide inputs to assert single terms and combinations of terms.

When we supply multiple term filters, they apply a logical OR to the target array field. This can be seen in the case mouse, computer, which would only produce mouse:1, computer:1 if a logical AND was performed. Instead, we see there are 3 mouse terms counted.

This produces the following request:

{
  "aggregations" : {
    "product_counts" : {
      "terms" : {
        "field" : "names",
        "include" : [ "mouse", "computer" ],
        "size" : 10
      }
    }
  },
  "size" : 0
}

The response can be seen as follows:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "failed" : 0.0,
    "skipped" : 0.0,
    "successful" : 1.0,
    "total" : 1.0
  },
  "hits" : {
    "total" : {
      "relation" : "eq",
      "value" : 5
    },
    "hits" : [ ]
  },
  "aggregations" : {
    "sterms#product_counts" : {
      "buckets" : [ {
        "doc_count" : 3,
        "key" : "mouse"
      }, {
        "doc_count" : 1,
        "key" : "computer"
      } ],
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0
    }
  }
}

Filtering Counted Collection Terms With Regular Expressions

Similar to the single keyword scenario, we can run regular expressions over collections of keywords.

/**
 * This test verifies that arrays of keyword fields can be used for filtered terms aggregation.
 * It demonstrates how to use the 'includes' parameter with regex patterns to filter terms.
 * <p>
 * See {@link KeywordAggregationTests#keywordMapping_CanBeUsedForTermsAggregationOnKeywordArray} for the base case
 * that counts all elements across all document arrays without filtering.
 * <p>
 * See {@link KeywordAggregationTests#keywordMapping_TermsAggregationOnKeywordArrayCountsAllTermsWhenFiltered} for an example
 * of how all terms in an array are counted in the aggregation even when documents are filtered by a query on one of those terms.
 *
 * @param includesPattern The regex pattern to include terms
 * @param expectedResults The expected aggregation results in "term:count" format
 * @param description     A description of what the test case is evaluating
 * @throws Exception If an I/O error occurs
 */
@ParameterizedTest
@CsvSource({
        "mouse, mouse:3, 'Exact match - matches only the exact term'",
        "mouse.*, 'mouse:3, mouse pad:3', 'Prefix match - matches terms starting with mouse'",
        ".*pad, 'mouse pad:3, arm rest pad:1', 'Suffix match - matches terms ending with pad'",
        "keyboard, '', 'No matches - pattern matches no terms'",
        "'.*', 'mouse:3, computer:1, mouse pad:3, power cable:1, arm rest pad:1', 'Match all - pattern matches all terms'"
})
public void keywordMapping_CanBeUsedForFilteredTermsAggregation_OnKeywordArrayWithIncludeRegularExpression(String includesPattern, String expectedResults, String description) throws Exception {
    // Create a test index with keyword mapping for the names array field
    try (OpenSearchTestIndex testIndex = fixture.createTestIndex(mapping ->
            mapping.properties("names", Property.of(p -> p.keyword(k -> k))))) {

        // Create and index product documents with array of names
        ProductDocumentWithMultipleNames[] productDocuments = new ProductDocumentWithMultipleNames[]{
                new ProductDocumentWithMultipleNames(1, new String[]{"mouse", "computer"}, 1),
                new ProductDocumentWithMultipleNames(2, new String[]{"mouse pad", "power cable"}, 2),
                new ProductDocumentWithMultipleNames(3, new String[]{"mouse", "mouse pad"}, 3),
                new ProductDocumentWithMultipleNames(4, new String[]{"mouse", "arm rest pad"}, 4),
                new ProductDocumentWithMultipleNames(5, new String[]{"mouse pad"}, 5)
        };
        testIndex.indexDocuments(productDocuments);

        // Create a search request with terms aggregation and includes filter using regexp
        SearchRequest searchRequest = new SearchRequest.Builder()
                .index(testIndex.getName())
                .size(0) // We do not want any documents returned; just the aggregations
                .aggregations("product_counts", a -> a
                        .terms(t -> t
                                .field("names")
                                .size(10)
                                .include(i -> i.regexp(includesPattern))
                        )
                )
                .build();

        // Execute the search request
        SearchResponse<ProductDocumentWithMultipleNames> response = openSearchClient.search(searchRequest, ProductDocumentWithMultipleNames.class);

        // Verify the results
        assertThat(response.aggregations()).isNotNull();

        StringTermsAggregate termsAgg = response.aggregations().get("product_counts").sterms();

        // Extract each term and its associated number of hits
        Map<String, Long> bucketCounts = termsAgg.buckets().array().stream()
                .collect(Collectors.toMap(
                        StringTermsBucket::key,
                        StringTermsBucket::docCount
                ));

        // Format the results for verification
        String formattedResults = bucketCounts.entrySet().stream()
                .map(entry -> entry.getKey() + ":" + entry.getValue())
                .collect(Collectors.joining(", "));

        // Verify the expected results
        assertThat(formattedResults)
                .as(description)
                .isEqualTo(expectedResults);
    }
}

This produces the following request:

{
  "aggregations" : {
    "product_counts" : {
      "terms" : {
        "field" : "names",
        "include" : "mouse.*",
        "size" : 10
      }
    }
  },
  "size" : 0
}

The response is as follows:

{
  "took" : 48,
  "timed_out" : false,
  "_shards" : {
    "failed" : 0.0,
    "skipped" : 0.0,
    "successful" : 1.0,
    "total" : 1.0
  },
  "hits" : {
    "total" : {
      "relation" : "eq",
      "value" : 5
    },
    "hits" : [ ]
  },
  "aggregations" : {
    "sterms#product_counts" : {
      "buckets" : [ {
        "doc_count" : 3,
        "key" : "mouse"
      }, {
        "doc_count" : 3,
        "key" : "mouse pad"
      } ],
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0
    }
  }
}

In this example, we can see that a mouse prefix does not allow computer to be counted as a term, but mouse pad and mouse are.

Excluding terms from aggregation

The same approach can be applied using the exclude parameter. As the name indicates, this will exclude any term matching the provided pattern.

Example requests can be seen from a Code Sloth Code Sample test below. The left captures a regular expression, and the right, a terms exclude filter.

{
  "aggregations" : {
    "product_counts" : {
      "terms" : {
        "exclude" : "mouse.*",
        "field" : "names",
        "size" : 10
      }
    }
  },
  "size" : 0
}
{
  "aggregations" : {
    "product_counts" : {
      "terms" : {
        "exclude" : [ "mouse", "mouse pad" ],
        "field" : "names",
        "size" : 10
      }
    }
  },
  "size" : 0
}

Query Time v.s. Indexing Time Tradeoffs in Terms Aggregations

Suppose you are working with a single field (not a collection). In that case, you might best index a second representation of the data to form a query filter, rather than relying on the terms aggregation to do it on your behalf.

  • Field 1: used for filtering applicable documents. In the case of a regular expression prefix match, an edge n-grams field (either analyzed as a tokenizer or token filter) could be used in the query part of the request to reduce the overall number of applicable documents
  • Field 2: A keyword field. This could then be used to aggregate the terms over the applicable corpus. If you want exact match filtering, a second field would not be required, and the keyword field could be leveraged for both.

Query time and indexing time strategies have different pros and cons:

  • Using a single field can simplify the solution and reduce the amount of data we need to store. However, this may come with a performance cost at query time, resulting from analyzing more documents than necessary.
  • We can use multiple fields, which take more space, but we can potentially increase query performance by avoiding global ordinals with an execution hint if we can sufficiently reduce the overall number of matching documents, so that they can be counted in memory.

Sloth Summary

In this post, we explored how to perform filtered terms aggregations using the include and exclude parameters on keyword fields.

We learned that filtering documents with a query will have no impact on the counted terms when the aggregated field is a collection.

We also observed that a logical OR is performed when multiple terms are provided in the terms filter against a collection of keywords.

Happy terms aggregation filtering 🦥

You may also like