Commonly Used Tokenizers

Elasticsearch provides various built-in tokenizers, each suited for different purposes.

Standard Tokenizer

The standard tokenizer is the default tokenizer used by Elasticsearch. It splits text into terms on word boundaries and removes most punctuation.

Whitespace Tokenizer

The whitespace tokenizer splits text into terms whenever it encounters whitespace.

PUT /whitespace_example
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "whitespace_tokenizer": {
          "type": "whitespace"
        }
      }
    }
  }
}

Analyzing text:

GET /whitespace_example/_analyze
{
  "tokenizer": "whitespace_tokenizer",
  "text": "Elasticsearch is a powerful search engine"
}

Output:

{
  "tokens": [
    { "token": "Elasticsearch", "start_offset": 0, "end_offset": 14, "type": "word", "position": 0 },
    { "token": "is", "start_offset": 15, "end_offset": 17, "type": "word", "position": 1 },
    { "token": "a", "start_offset": 18, "end_offset": 19, "type": "word", "position": 2 },
    { "token": "powerful", "start_offset": 20, "end_offset": 28, "type": "word", "position": 3 },
    { "token": "search", "start_offset": 29, "end_offset": 35, "type": "word", "position": 4 },
    { "token": "engine", "start_offset": 36, "end_offset": 42, "type": "word", "position": 5 }
  ]
}

In this example:

The text is split into tokens based on whitespace.

NGram Tokenizer

The ngram tokenizer breaks text into smaller chunks (n-grams) of specified lengths. It’s useful for partial matching and autocomplete features.

PUT /ngram_example
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 5
        }
      }
    }
  }
}

Analyzing text

GET /ngram_example/_analyze
{
  "tokenizer": "ngram_tokenizer",
  "text": "search"
}

Output:

{
  "tokens": [
    { "token": "sea", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 },
    { "token": "sear", "start_offset": 0, "end_offset": 4, "type": "word", "position": 1 },
    { "token": "searc", "start_offset": 0, "end_offset": 5, "type": "word", "position": 2 },
    { "token": "earc", "start_offset": 1, "end_offset": 5, "type": "word", "position": 3 },
    { "token": "ear", "start_offset": 1, "end_offset": 4, "type": "word", "position": 4 },
    { "token": "earc", "start_offset": 1, "end_offset": 5, "type": "word", "position": 5 },
    { "token": "arch", "start_offset": 2, "end_offset": 5, "type": "word", "position": 6 }
  ]
}

In this example:

The text “search” is broken into n-grams of lengths 3 to 5.

Full Text Search with Analyzer and Tokenizer

Elasticsearch is renowned for its powerful full-text search capabilities. At the heart of this functionality are analyzers and tokenizers, which play a crucial role in how text is processed and indexed. This guide will help you understand how analyzers and tokenizers work in Elasticsearch, with detailed examples and outputs to make these concepts easy to grasp.

Commonly Used Tokenizers

Standard Tokenizer

Whitespace Tokenizer

NGram Tokenizer

Analyzing text

Full Text Search with Analyzer and Tokenizer

Categories

Contact US

Commonly Used Tokenizers

Standard Tokenizer

Whitespace Tokenizer

NGram Tokenizer

Analyzing text

Full Text Search with Analyzer and Tokenizer

Similar Reads

Categories

Contact US