Commonly Used Tokenizers
Elasticsearch provides various built-in tokenizers, each suited for different purposes.
Standard Tokenizer
The standard tokenizer is the default tokenizer used by Elasticsearch. It splits text into terms on word boundaries and removes most punctuation.
Whitespace Tokenizer
The whitespace tokenizer splits text into terms whenever it encounters whitespace.
PUT /whitespace_example
{
"settings": {
"analysis": {
"tokenizer": {
"whitespace_tokenizer": {
"type": "whitespace"
}
}
}
}
}
Analyzing text:
GET /whitespace_example/_analyze
{
"tokenizer": "whitespace_tokenizer",
"text": "Elasticsearch is a powerful search engine"
}
Output:
{
"tokens": [
{ "token": "Elasticsearch", "start_offset": 0, "end_offset": 14, "type": "word", "position": 0 },
{ "token": "is", "start_offset": 15, "end_offset": 17, "type": "word", "position": 1 },
{ "token": "a", "start_offset": 18, "end_offset": 19, "type": "word", "position": 2 },
{ "token": "powerful", "start_offset": 20, "end_offset": 28, "type": "word", "position": 3 },
{ "token": "search", "start_offset": 29, "end_offset": 35, "type": "word", "position": 4 },
{ "token": "engine", "start_offset": 36, "end_offset": 42, "type": "word", "position": 5 }
]
}
In this example:
- The text is split into tokens based on whitespace.
NGram Tokenizer
The ngram tokenizer breaks text into smaller chunks (n-grams) of specified lengths. It’s useful for partial matching and autocomplete features.
PUT /ngram_example
{
"settings": {
"analysis": {
"tokenizer": {
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5
}
}
}
}
}
Analyzing text
GET /ngram_example/_analyze
{
"tokenizer": "ngram_tokenizer",
"text": "search"
}
Output:
{
"tokens": [
{ "token": "sea", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 },
{ "token": "sear", "start_offset": 0, "end_offset": 4, "type": "word", "position": 1 },
{ "token": "searc", "start_offset": 0, "end_offset": 5, "type": "word", "position": 2 },
{ "token": "earc", "start_offset": 1, "end_offset": 5, "type": "word", "position": 3 },
{ "token": "ear", "start_offset": 1, "end_offset": 4, "type": "word", "position": 4 },
{ "token": "earc", "start_offset": 1, "end_offset": 5, "type": "word", "position": 5 },
{ "token": "arch", "start_offset": 2, "end_offset": 5, "type": "word", "position": 6 }
]
}
In this example:
- The text “search” is broken into n-grams of lengths 3 to 5.
Full Text Search with Analyzer and Tokenizer
Elasticsearch is renowned for its powerful full-text search capabilities. At the heart of this functionality are analyzers and tokenizers, which play a crucial role in how text is processed and indexed. This guide will help you understand how analyzers and tokenizers work in Elasticsearch, with detailed examples and outputs to make these concepts easy to grasp.