Stemmer token filter

Provides algorithmic stemming for several languages, some with additional variants. For a list of supported languages, see the language parameter.

When not customized, the filter uses the porter stemming algorithm for English.

Example

The following analyze API request uses the stemmer filter’s default porter stemming algorithm to stem the foxes jumping quickly to the fox jump quickli:

resp = client.indices.analyze(
    tokenizer="standard",
    filter=[
        "stemmer"
    ],
    text="the foxes jumping quickly",
)
print(resp)

response = client.indices.analyze(
  body: {
    tokenizer: 'standard',
    filter: [
      'stemmer'
    ],
    text: 'the foxes jumping quickly'
  }
)
puts response

const response = await client.indices.analyze({
  tokenizer: "standard",
  filter: ["stemmer"],
  text: "the foxes jumping quickly",
});
console.log(response);

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [ "stemmer" ],
  "text": "the foxes jumping quickly"
}

The filter produces the following tokens:

[ the, fox, jump, quickli ]

Add to an analyzer

The following create index API request uses the stemmer filter to configure a new custom analyzer.

resp = client.indices.create(
    index="my-index-000001",
    settings={
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "stemmer"
                    ]
                }
            }
        }
    },
)
print(resp)

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          my_analyzer: {
            tokenizer: 'whitespace',
            filter: [
              'stemmer'
            ]
          }
        }
      }
    }
  }
)
puts response

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
    analysis: {
      analyzer: {
        my_analyzer: {
          tokenizer: "whitespace",
          filter: ["stemmer"],
        },
      },
    },
  },
});
console.log(response);

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "whitespace",
          "filter": [ "stemmer" ]
        }
      }
    }
  }
}

Configurable parameters

language

(Optional, string) Language-dependent stemming algorithm used to stem tokens. If both this and the name parameter are specified, the language parameter argument is used.

Valid values for language

Valid values are sorted by language. Defaults to english. Recommended algorithms are bolded.

Arabic: arabic
Armenian: armenian
Basque: basque
Bengali: bengali
Brazilian Portuguese: brazilian
Bulgarian: bulgarian
Catalan: catalan
Czech: czech
Danish: danish
Dutch: dutch, dutch_kp [8.16.0] Deprecated in 8.16.0. dutch_kp will be removed in a future version
English: english, light_english, lovins [8.16.0] Deprecated in 8.16.0. lovins will be removed in a future version , minimal_english, porter2, possessive_english
Estonian: estonian
Finnish: finnish, light_finnish
French: light_french, french, minimal_french
Galician: galician, minimal_galician (Plural step only)
German: light_german, german, german2, minimal_german
Greek: greek
Hindi: hindi
Hungarian: hungarian, light_hungarian
Indonesian: indonesian
Irish: irish
Italian: light_italian, italian
Kurdish (Sorani): sorani
Latvian: latvian
Lithuanian: lithuanian
Norwegian (Bokmål): norwegian, light_norwegian, minimal_norwegian
Norwegian (Nynorsk): light_nynorsk, minimal_nynorsk
Persian: persian
Portuguese: light_portuguese, minimal_portuguese, portuguese, portuguese_rslp
Romanian: romanian
Russian: russian, light_russian
Serbian: serbian
Spanish: light_spanish, spanish spanish_plural
Swedish: swedish, light_swedish
Turkish: turkish

name

An alias for the language parameter. If both this and the language parameter are specified, the language parameter argument is used.

Customize

To customize the stemmer filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

For example, the following request creates a custom stemmer filter that stems words using the light_german algorithm:

resp = client.indices.create(
    index="my-index-000001",
    settings={
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "my_stemmer"
                    ]
                }
            },
            "filter": {
                "my_stemmer": {
                    "type": "stemmer",
                    "language": "light_german"
                }
            }
        }
    },
)
print(resp)

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          my_analyzer: {
            tokenizer: 'standard',
            filter: [
              'lowercase',
              'my_stemmer'
            ]
          }
        },
        filter: {
          my_stemmer: {
            type: 'stemmer',
            language: 'light_german'
          }
        }
      }
    }
  }
)
puts response

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
    analysis: {
      analyzer: {
        my_analyzer: {
          tokenizer: "standard",
          filter: ["lowercase", "my_stemmer"],
        },
      },
      filter: {
        my_stemmer: {
          type: "stemmer",
          language: "light_german",
        },
      },
    },
  },
});
console.log(response);

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_stemmer"
          ]
        }
      },
      "filter": {
        "my_stemmer": {
          "type": "stemmer",
          "language": "light_german"
        }
      }
    }
  }
}

« Snowball token filter Stemmer override token filter »