Inverted Index for Queries
1. What is an Inverted Index?
An Inverted Index is a data structure used to quickly find documents containing specific terms. It is one of the core technologies of search engines.
1.1 Basic Concepts
- Forward Index: Document ID → Document Content (list of terms)
- Inverted Index: Term → List of Document IDs containing the term
1.2 Why is it called "Inverted"?
An Inverted Index reverses the traditional relationship of "which terms a document contains" to "in which documents a term appears", hence the name "inverted".
2. Structure of an Inverted Index
2.1 Basic Structure
词项 → 文档频率 → 文档列表
2.2 Detailed Structure
词项 → {
文档频率: N,
文档列表: [
{文档ID: 1, 词频: 2, 位置: [0, 5]},
{文档ID: 3, 词频: 1, 位置: [2]}
]
}
3. How an Inverted Index Works
3.1 Building Process
- Document Preprocessing: Tokenization, stop word removal, stemming
- Term Statistics: Count the frequency and position of each term in documents
- Index Construction: Establish the mapping relationship from terms to documents
3.2 Query Process
- Query Parsing: Tokenize the query string
- Index Lookup: Search for each term in the inverted index
- Result Merging: Merge document lists for multiple terms
- Sorted Return: Return results sorted by relevance
4. Implementing an Inverted Index in Go
4.1 Data Structure Definition
package main
import (
"fmt"
"sort"
"strings"
)
// 文档信息
type Document struct {
ID int
Text string
}
// 词项在文档中的位置信息
type Posting struct {
DocID int
Frequency int
Positions []int
}
// 倒排索引项
type InvertedIndexItem struct {
Term string
DocFreq int
Postings []Posting
}
// 倒排索引
type InvertedIndex struct {
Index map[string]*InvertedIndexItem
}
// 创建新的倒排索引
func NewInvertedIndex() *InvertedIndex {
return &InvertedIndex{
Index: make(map[string]*InvertedIndexItem),
}
}
4.2 Index Construction
// 添加文档到索引
func (idx *InvertedIndex) AddDocument(docID int, text string) {
// 简单的分词(实际应用中需要更复杂的分词算法)
words := strings.Fields(strings.ToLower(text))
for pos, word := range words {
if idx.Index[word] == nil {
idx.Index[word] = &InvertedIndexItem{
Term: word,
DocFreq: 0,
Postings: make([]Posting, 0),
}
}
// 查找是否已存在该文档的posting
var posting *Posting
for i := range idx.Index[word].Postings {
if idx.Index[word].Postings[i].DocID == docID {
posting = &idx.Index[word].Postings[i]
break
}
}
if posting == nil {
// 创建新的posting
newPosting := Posting{
DocID: docID,
Frequency: 1,
Positions: []int{pos},
}
idx.Index[word].Postings = append(idx.Index[word].Postings, newPosting)
idx.Index[word].DocFreq++
} else {
// 更新现有posting
posting.Frequency++
posting.Positions = append(posting.Positions, pos)
}
}
}
4.3 Query Implementation
// 单词查询
func (idx *InvertedIndex) Search(term string) []int {
term = strings.ToLower(term)
if item, exists := idx.Index[term]; exists {
docIDs := make([]int, len(item.Postings))
for i, posting := range item.Postings {
docIDs[i] = posting.DocID
}
return docIDs
}
return []int{}
}
// 多词查询(AND操作)
func (idx *InvertedIndex) SearchAnd(terms []string) []int {
if len(terms) == 0 {
return []int{}
}
// 获取第一个词的结果
result := idx.Search(terms[0])
// 与其他词的结果求交集
for i := 1; i < len(terms); i++ {
otherResult := idx.Search(terms[i])
result = intersect(result, otherResult)
}
return result
}
// 多词查询(OR操作)
func (idx *InvertedIndex) SearchOr(terms []string) []int {
if len(terms) == 0 {
return []int{}
}
resultSet := make(map[int]bool)
for _, term := range terms {
docIDs := idx.Search(term)
for _, docID := range docIDs {
resultSet[docID] = true
}
}
result := make([]int, 0, len(resultSet))
for docID := range resultSet {
result = append(result, docID)
}
sort.Ints(result)
return result
}
// 求两个切片的交集
func intersect(a, b []int) []int {
set := make(map[int]bool)
for _, x := range a {
set[x] = true
}
result := make([]int, 0)
for _, x := range b {
if set[x] {
result = append(result, x)
}
}
return result
}
4.4 Complete Example
func main() {
// 创建倒排索引
index := NewInvertedIndex()
// 添加文档
documents := []Document{
{ID: 1, Text: "Go is a programming language"},
{ID: 2, Text: "Go is fast and efficient"},
{ID: 3, Text: "Programming in Go is fun"},
{ID: 4, Text: "Go language is simple"},
}
// 构建索引
for _, doc := range documents {
index.AddDocument(doc.ID, doc.Text)
}
// 查询示例
fmt.Println("搜索 'go':", index.Search("go"))
fmt.Println("搜索 'programming':", index.Search("programming"))
fmt.Println("搜索 'go' AND 'language':", index.SearchAnd([]string{"go", "language"}))
fmt.Println("搜索 'go' OR 'fast':", index.SearchOr([]string{"go", "fast"}))
// 打印索引结构
fmt.Println("\n倒排索引结构:")
for term, item := range index.Index {
fmt.Printf("词项: %s, 文档频率: %d\n", term, item.DocFreq)
for _, posting := range item.Postings {
fmt.Printf(" 文档ID: %d, 词频: %d, 位置: %v\n",
posting.DocID, posting.Frequency, posting.Positions)
}
}
}
5. Optimizing Inverted Indexes
5.1 Compression Techniques
- Variable-length encoding: Compress document IDs using variable-length encoding
- Differential encoding: Store the difference in document IDs instead of absolute values
- Bitmap compression: Use bitmaps to represent document sets
5.2 Query Optimization
- Skip lists: Quickly locate positions in long lists
- Caching mechanism: Cache popular query results
- Parallel querying: Process queries using multiple threads
6. Practical Application Scenarios
6.1 Search Engines
- Core technology for search engines like Google and Baidu
- Web content indexing and retrieval
6.2 Database Systems
- Full-text search functionality
- Fast querying of text fields
6.3 Code Search
- GitHub code search
- Code navigation in IDEs
6.4 Log Analysis
- Fast retrieval of log files
- Locating error logs
7. Performance Analysis
7.1 Time Complexity
- Index Construction: O(N×M), where N is the number of documents and M is the average number of terms
- Single-term Query: O(1) on average
- Multi-term Query: O(k×log(n)), where k is the number of results and n is the number of documents
7.2 Space Complexity
- Storage Space: O(V×D), where V is the vocabulary size and D is the average document frequency
7.3 Pros and Cons
Pros:
- Fast query speed
- Supports complex queries
- Easy to implement
Cons:
- Time-consuming index construction
- Large storage space
- Complex index updates
8. Summary
The inverted index is a core technology in information retrieval. By reversing the "document-term" relationship to a "term-document" relationship, it achieves efficient text search. In Go language, we can use basic data structures like maps and slices to implement an inverted index, providing powerful search capabilities for applications.
multi_match Usage Guide
multi_match is a query type in ES that searches across multiple fields simultaneously, essentially an extension of the match query to multiple fields. It is suitable for combined retrieval across multiple text fields like title, description, and tags, often used with field boosting, different query types, and analyzers.
1. Basic Usage
POST /index/_search
{
"query": {
"multi_match": {
"query": "iPhone 15",
"fields": ["title", "description", "tags"]
}
}
}
2. Field Boosting (boost)
POST /index/_search
{
"query": {
"multi_match": {
"query": "iPhone 15",
"fields": ["title^3", "description^1.5", "tags"]
}
}
}
Explanation: title^3 means the match score for the title field is multiplied by a weight of 3, thereby boosting the score of results hitting this field during sorting.
3. type Option and Applicable Scenarios
- best_fields (default): Selects the best matching field's score as the primary score among all fields, can be used with
tie_breaker
POST /index/_search
{
"query": {
"multi_match": {
"query": "apple phone",
"fields": ["title", "description", "tags"],
"type": "best_fields",
"tie_breaker": 0.2
}
}
}
- most_fields: Scores from multiple fields are summed, suitable when the same semantic meaning is distributed across multiple fields (e.g., a single text split and stored in different fields)
POST /index/_search
{
"query": {
"multi_match": {
"query": "iphone",
"fields": ["title", "title.ngram", "description"],
"type": "most_fields"
}
}
}
- cross_fields: Treats multiple fields as one large field for matching, suitable for scenarios where terms are distributed across different fields (e.g., first_name + last_name)
POST /index/_search
{
"query": {
"multi_match": {
"query": "tim cook",
"fields": ["first_name", "last_name"],
"type": "cross_fields",
"operator": "and"
}
}
}
- phrase: Phrase matching, requires strict word order and proximity, suitable for exact phrase searches
POST /index/_search
{
"query": {
"multi_match": {
"query": "iphone 15 pro",
"fields": ["title", "description"],
"type": "phrase"
}
}
}
- phrase_prefix: Phrase prefix matching, suitable for input method suggestions/search suggestions
POST /index/_search
{
"query": {
"multi_match": {
"query": "iph 15",
"fields": ["title", "description"],
"type": "phrase_prefix",
"max_expansions": 50
}
}
}
4. Operator and Minimum Match
POST /index/_search
{
"query": {
"multi_match": {
"query": "apple flagship phone",
"fields": ["title", "description"],
"operator": "and",
"minimum_should_match": "75%"
}
}
}
Explanation:
operator: andrequires all query terms to match;or(default) matches any one termminimum_should_matchcontrols the minimum proportion or number of matching terms, such as2,3<75%,75%, etc.
5. Fuzziness and Correction
POST /index/_search
{
"query": {
"multi_match": {
"query": "iphine",
"fields": ["title", "description"],
"fuzziness": "AUTO",
"prefix_length": 1
}
}
}
Explanation: fuzziness: AUTO provides fault tolerance for common spelling errors; prefix_length specifies the length of the prefix that must match exactly.
6. Analyzer and Field Selection
POST /index/_search
{
"query": {
"multi_match": {
"query": "苹果 手机",
"fields": ["title", "title.keyword^5", "description"],
"analyzer": "ik_smart"
}
}
}
Suggestions:
- Mostly used for
textfields for full-text retrieval; exact matching and aggregation/sorting usekeywordfields (can be combined with boost) - For Chinese retrieval,
ik_smart,ik_max_wordand other analyzers can be used (requires plugin installation)
7. Combined Example (Comprehensive fields, weights, filtering, and sorting)
POST /products/_search
{
"_source": ["id", "title", "price", "brand"],
"from": 0,
"size": 20,
"sort": [
{"_score": "desc"},
{"price": "asc"}
],
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "iphone 15 pro",
"fields": ["title^4", "subtitle^2", "description", "tags"],
"type": "best_fields",
"tie_breaker": 0.3,
"minimum_should_match": "66%"
}
}
],
"filter": [
{"term": {"brand": "apple"}},
{"range": {"price": {"gte": 3000, "lte": 10000}}}
]
}
},
"highlight": {
"fields": {
"title": {},
"description": {}
}
}
}
8. Common Issues and Suggestions
- Relevance is not ideal:
- Set higher weights for core fields (e.g.,
title^N) - Choose the appropriate
type: usecross_fieldsfor cross-field term distribution,most_fieldsfor combined scores - Use synonyms, spell correction (
fuzziness), and domain-specific dictionaries - Performance issues:
- Control returned fields (
_sourcefiltering) andsize - Place filter conditions in
filter, which hits the cache and does not participate in scoring - Avoid using
wildcard/phrase_prefixfor prefix expansion on a huge number of fields - Exact vs. Full-text:
- Exact matching and aggregation use
keyword; full-text retrieval usestext+ analyzer - Can create
multi-fields(text+keyword) for the same business field
term Query Explained
The term query is a query type in ES used for exact matching. It does not perform tokenization on the query term but directly performs an exact match with the terms in the index. It is suitable for keyword type fields, numeric fields, date fields, etc. (No tokenization, no lowercasing).
1. Basic Usage
POST /products/_search
{
"query": {
"term": {
"status": "active"
}
}
}
2. Multi-field term Query
POST /products/_search
{
"query": {
"bool": {
"must": [
{"term": {"status": "active"}},
{"term": {"category": "electronics"}},
{"term": {"brand": "apple"}}
]
}
}
}
3. Exact Matching for Numeric Fields
POST /products/_search
{
"query": {
"term": {
"price": 5999
}
}
}
4. Exact Matching for Date Fields
POST /products/_search
{
"query": {
"term": {
"created_date": "2025-01-18"
}
}
}
5. Array Field Matching
POST /products/_search
{
"query": {
"term": {
"tags": "phone"
}
}
}
6. Using boost to Increase Weight
POST /products/_search
{
"query": {
"term": {
"status": {
"value": "active",
"boost": 2.0
}
}
}
}
7. terms Query (Multi-value Matching)
POST /products/_search
{
"query": {
"terms": {
"status": ["active", "pending", "review"]
}
}
}
8. Combined with filter
POST /products/_search
{
"query": {
"bool": {
"must": [
{"match": {"title": "iPhone"}}
],
"filter": [
{"term": {"status": "active"}},
{"term": {"category": "electronics"}}
]
}
}
}
term vs match Query Comparison
1. Core Differences
| Feature | term Query | match Query |
|---|---|---|
| Tokenization | No tokenization, exact match | Tokenizes the query term |
| Matching Method | Exact match of terms in the index | Fuzzy matching, supports relevance scoring |
| Applicable Fields | keyword, numeric, date, etc. | text type fields |
| Performance | Faster (no relevance calculation) | Slower (requires score calculation) |
| Caching | Results can be cached | Results are usually not cached |
2. Practical Example Comparison
2.1 Different Results for the Same Query Term
# 数据准备
POST /test/_doc/1
{
"title": "iPhone 15 Pro Max",
"title.keyword": "iPhone 15 Pro Max",
"status": "active"
}
# term 查询 - 精确匹配
POST /test/_search
{
"query": {
"term": {
"title.keyword": "iPhone 15 Pro Max"
}
}
}
# 结果:匹配成功
# term 查询 - 对 text 字段使用 term(通常不匹配)
POST /test/_search
{
"query": {
"term": {
"title": "iPhone 15 Pro Max"
}
}
}
# 结果:不匹配(因为 title 被分词为 ["iphone", "15", "pro", "max"])
# match 查询 - 对 text 字段使用 match
POST /test/_search
{
"query": {
"match": {
"title": "iPhone 15 Pro Max"
}
}
}
# 结果:匹配成功,有相关性评分
2.2 Partial Match Comparison
# term 查询 - 部分词不匹配
POST /test/_search
{
"query": {
"term": {
"title.keyword": "iPhone 15"
}
}
}
# 结果:不匹配(需要完全一致)
# match 查询 - 部分词匹配
POST /test/_search
{
"query": {
"match": {
"title": "iPhone 15"
}
}
}
# 结果:匹配成功,相关性评分较低
3. Usage Scenario Comparison
3.1 term Query Applicable Scenarios
# 1. 状态过滤
POST /products/_search
{
"query": {
"bool": {
"filter": [
{"term": {"status": "active"}}
]
}
}
}
# 2. 分类筛选
POST /products/_search
{
"query": {
"bool": {
"filter": [
{"term": {"category": "electronics"}}
]
}
}
}
# 3. 标签匹配
POST /products/_search
{
"query": {
"bool": {
"filter": [
{"term": {"tags": "premium"}}
]
}
}
}
# 4. 聚合统计
POST /products/_search
{
"size": 0,
"aggs": {
"status_count": {
"terms": {
"field": "status"
}
}
}
}
3.2 match Query Applicable Scenarios
# 1. 全文搜索
POST /products/_search
{
"query": {
"match": {
"title": "iPhone 15 Pro"
}
}
}
# 2. 描述搜索
POST /products/_search
{
"query": {
"match": {
"description": "Latest款手机"
}
}
}
# 3. 多字段搜索
POST /products/_search
{
"query": {
"multi_match": {
"query": "苹果手机",
"fields": ["title", "description", "tags"]
}
}
}
4. Performance Comparison
4.1 Query Performance
# term 查询 - 高性能
POST /products/_search
{
"query": {
"bool": {
"filter": [
{"term": {"status": "active"}},
{"term": {"category": "electronics"}}
]
}
}
}
# 特点:不计算相关性,结果可缓存
# match 查询 - 相对较慢
POST /products/_search
{
"query": {
"bool": {
"must": [
{"match": {"title": "iPhone"}},
{"match": {"description": "手机"}}
]
}
}
}
# 特点:需要计算相关性评分,结果通常不缓存
4.2 Mixed Usage Optimization
# 最佳实践:term 用于过滤,match 用于搜索
POST /products/_search
{
"query": {
"bool": {
"must": [
{"match": {"title": "iPhone 15"}}
],
"filter": [
{"term": {"status": "active"}},
{"term": {"category": "electronics"}},
{"range": {"price": {"gte": 1000, "lte": 10000}}}
]
}
}
}
5. Common Errors and Solutions
5.1 Using term Query on text Fields
# 错误用法
POST /products/_search
{
"query": {
"term": {
"title": "iPhone" # title 是 text 字段,会被分词
}
}
}
# 正确用法
POST /products/_search
{
"query": {
"term": {
"title.keyword": "iPhone" # 使用 keyword 字段
}
}
}
# 或者使用 match
POST /products/_search
{
"query": {
"match": {
"title": "iPhone"
}
}
}
5.2 Case Sensitivity Issues
# term 查询大小写敏感
POST /products/_search
{
"query": {
"term": {
"status": "Active" # 如果索引中是 "active",则不匹配
}
}
}
# 解决方案:使用 match 或确保大小写一致
POST /products/_search
{
"query": {
"match": {
"status": "Active" # match 会进行分词和标准化
}
}
}
6. Best Practice Recommendations
6.1 Field Mapping Design
# 创建支持两种查询的映射
PUT /products
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_smart",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"status": {
"type": "keyword"
},
"price": {
"type": "double"
}
}
}
}
6.2 Query Combination Strategy
# 推荐:精确过滤 + 模糊搜索
POST /products/_search
{
"query": {
"bool": {
"must": [
{"match": {"title": "用户搜索词"}}
],
"filter": [
{"term": {"status": "active"}},
{"term": {"category": "electronics"}},
{"range": {"price": {"gte": 1000}}}
]
}
},
"sort": [
{"_score": "desc"},
{"price": "asc"}
]
}
7. Summary
- term query: Suitable for exact matching, filtering, aggregation; better performance, results can be cached
- match query: Suitable for full-text search, fuzzy matching; supports relevance scoring
- Best practice: term for filter conditions, match for search content; use both in combination
- Field design: Create keyword sub-fields for fields requiring exact matching
- Performance optimization: Place exact matching conditions in the filter to avoid unnecessary scoring calculations
ES Mapping Concepts and Usage
1. What is Mapping
Mapping is the "structure definition" of an index, similar to the schema of a relational database table. It is used to declare the type and indexing method of each field, determining:
- The data type and storage format of the field (text, keyword, numeric, date, boolean, geo, nested, etc.)
- Whether it participates in the inverted index and how it is tokenized (
index,analyzer) - Whether it can be used for aggregation/sorting (
doc_values) - Multi-field definition: indexing the same business field in multiple ways
- Dynamic field processing strategy (
dynamic)
Since ES 7, an index has only one type (internal _doc), and modeling directly faces "index + mapping".
2. Common Field Types and Scenarios
text: Tokenized, used for full-text retrieval; not suitable for aggregation/sortingkeyword: Not tokenized, suitable for exact matching, aggregation, sorting; hasdoc_valuesby default- Numeric and date:
integer/long/double/date, etc., suitable for range filtering, aggregation, and sorting - Structured:
object(flat object within the same document),nested(each object in an array is modeled independently, supporting independent sub-queries) - Geographic:
geo_point/geo_shape
Typical multi-fields (for both full-text and exact matching):
"title": {
"type": "text",
"analyzer": "ik_smart",
"fields": {
"keyword": { "type": "keyword", "ignore_above": 256 }
}
}
3. Creating an Index and Explicitly Setting Mapping
PUT /products
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1
},
"mappings": {
"dynamic": "true",
"properties": {
"title": {
"type": "text",
"analyzer": "standard",
"fields": {
"keyword": { "type": "keyword", "ignore_above": 256 }
}
},
"price": { "type": "double" },
"status": { "type": "keyword" },
"createdAt": { "type": "date" },
"tags": { "type": "keyword" },
"attrs": { "type": "object" },
"specs": { "type": "nested" }
}
}
}
4. Viewing/Updating Mapping
- View Mapping
GET /products/_mapping
- Add Fields (only new fields can be added, existing field types cannot be changed)
PUT /products/_mapping
{
"properties": {
"brand": { "type": "keyword" }
}
}
5. Correct Way to Change Field Type (Reindexing)
- Create a new index and define the correct mapping
products_v2 - Migrate data
POST /_reindex
{
"source": { "index": "products" },
"dest": { "index": "products_v2" }
}
- Switch traffic using aliases
POST /_aliases
{
"actions": [
{ "remove": { "index": "products", "alias": "products_read" }},
{ "add": { "index": "products_v2", "alias": "products_read" }}
]
}
6. Dynamic Mapping Strategy
"mappings": {
"dynamic": "strict",
"properties": { /* 显式列出字段,未知字段将被拒绝 */ }
}
It is recommended to use strict for core indexes to prevent dirty data from being automatically inferred into incorrect types (e.g., treating numeric values as text).
7. Performance and Practice Essentials
- Only enable
indexfor fields that need to be searched/filtered; for purely display fields,index: falsecan be used - Fields that need aggregation/sorting should keep
doc_values: true(texthas no doc_values) - For Chinese scenarios, install the IK analyzer and specify the
analyzerfortextfields - Use
nestedfor nested arrays to avoid cross-matching caused byobject - Use multi-fields to support both full-text and exact matching simultaneously
In a nutshell: Mapping determines "how fields are stored, indexed, and queried". Before building an index, clarify your query and aggregation requirements, then design the mapping to achieve correct and high-performance retrieval.
ES Analyzer Usage and Explanation
1. What is an Analyzer
An Analyzer is a component that "normalizes → tokenizes → filters" text fields during writing/searching, typically consisting of three parts:
char_filter: Character-level preprocessing (e.g., removing HTML tags)tokenizer: Splits text into tokens (lexemes), such asstandard,whitespace,ik_smartfilter: Further processes tokens (lowercasing, stop word removal, synonyms, stemming, etc.)
The field's analyzer is used during the writing phase, and the same analyzer is used by default during the search phase, but can be specified separately via search_analyzer.
2. Built-in Common Analyzers
standard(default): General English tokenization, lowercasingsimple: Splits by non-letters, lowercasingwhitespace: Splits only by whitespace, preserves casestop: Based onsimple, removes stop wordskeyword: No tokenization, treats the entire input as a single token (mostly used by normalizer for keyword fields)pattern: Splits based on regular expressions
Commonly used for Chinese: ik_smart, ik_max_word (requires plugin installation).
3. Using _analyze to Test Tokenization Effect
POST /_analyze
{
"analyzer": "standard",
"text": "iPhone 15 Pro Max"
}
POST /_analyze
{
"analyzer": "ik_smart",
"text": "苹果手机保护壳"
}
4. Setting analyzer and search_analyzer on Fields
PUT /docs
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_smart",
"search_analyzer": "ik_max_word"
}
}
}
}
Explanation:
- Use
ik_smartduring writing, and a finer-grainedik_max_wordduring querying to improve recall.
5. Temporarily Specifying an Analyzer During Query (Without Changing Mapping)
POST /docs/_search
{
"query": {
"match": {
"title": {
"query": "苹果手机",
"analyzer": "ik_max_word"
}
}
}
}
6. Custom Analyzer (Including Synonyms/Stop Words)
PUT /articles
{
"settings": {
"analysis": {
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms": ["iphone,苹果手机", "notebook,笔记本"]
}
},
"analyzer": {
"my_zh_analyzer": {
"type": "custom",
"char_filter": ["html_strip"],
"tokenizer": "ik_smart",
"filter": ["lowercase", "my_synonyms"]
}
}
}
},
"mappings": {
"properties": {
"content": { "type": "text", "analyzer": "my_zh_analyzer" }
}
}
}
7. Normalizer (Standardization for keyword)
keyword fields are not tokenized and cannot use an analyzer; if lowercasing or punctuation removal is needed, a normalizer can be used:
PUT /users
{
"settings": {
"analysis": {
"normalizer": {
"lowercase_normalizer": {
"type": "custom",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"properties": {
"email": { "type": "keyword", "normalizer": "lowercase_normalizer" }
}
}
}
8. IK Analyzer Installation and Field Example (Brief)
- Installation (based on version):
bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/...Restart ES - Usage:
PUT /goods
{
"mappings": {
"properties": {
"title": { "type": "text", "analyzer": "ik_smart", "search_analyzer": "ik_max_word" }
}
}
}
9. Considerations for Changing Analyzers
- The analyzer of an existing field generally cannot be directly modified; a "reindex" process is required
- Different analyzers will affect the inverted index structure, so re-verify query semantics and relevance after changes
10. Performance and Practice
- Choose the simplest possible analyzer for writing (e.g.,
ik_smart), and a finer-grained one for querying (ik_max_word) to improve recall - Use
_analyzeto verify if tokenization meets expectations; frequent filter conditions should usekeyword+ normalizer - Control the number of fields and tokenization granularity to avoid index explosion; manage synonym lists externally for easy updates
ES Terms and Glossary Quick Reference
The following concepts are categorized by topic for quick understanding and reference.
Index and Document Modeling
- Index: A logical container for a collection of documents, similar to a database. Internally composed of multiple shards
- Document: A single record, stored as JSON, uniquely identified by
_id - Field: A document attribute, determining available query and aggregation methods
- Mapping: Definition of field types and indexing strategies, equivalent to a table schema
- Type: A logical "table" concept existing in 6.x and below. Fixed to
_docfrom 7.x, hidden externally in 8.x - Text: A tokenized field type, used for full-text retrieval, not suitable for aggregation/sorting
- Keyword: Not tokenized, suitable for exact matching, aggregation/sorting, usually has
doc_values - Multi-fields: Indexing the same field in multiple ways, such as
titleandtitle.keyword - Object: Object field, properties are flattened and merged into the same document
- Nested: Nested object, each array element is indexed independently, avoiding cross-matching, and can be queried independently
- Dynamic mapping: Strategy for when unknown fields appear (true/false/strict)
Tokenization and Normalization
- Analyzer: Tokenizer, including three stages:
char_filter→tokenizer→filter - Tokenizer: Component that splits text into tokens, such as
standard,whitespace,ik_smart - Token: The basic unit in the inverted index (standardized/tokenized lexeme)
- Char filter: Character-level preprocessing, such as
html_strip - Token filter: Further processing of tokens, such as
lowercase,synonym,stop - Normalizer: Normalization for
keywordfields (lowercasing, accent removal, etc.), no tokenization
Inverted Index and Scoring
- Inverted index: Index structure of term → document list (postings)
- Term: A term in the index (a token after standardization/tokenization)
- Posting: Document occurrence information, including docID, frequency, position, etc.
- Relevance score: Relevance score, used for sorting
- BM25: Default relevance model (replaces TF-IDF)
- Query vs Filter: Query participates in scoring, Filter only performs boolean filtering and can be cached
- Bool query:
must/should/must_not/filtercombination query
Storage and Segments
- Segment: Immutable data segment, created by appending writes; merge reduces the number of segments
- Refresh: Flushes in-memory increments to new segments, default cycle 1s, visible after refresh
- Flush: Persists the translog and creates a new commit point
- Translog: Write-ahead log, used for crash recovery
- Doc values: Columnar storage, supporting aggregation/sorting/scripting,
texthas no doc values - _source: Original JSON document, stored by default, used for re-retrieval and reindex
- Stored fields: Separately stored fields (not commonly used), distinct from
_source - Norms: Field-level length normalization and other scoring factors, can be disabled to save space
Cluster and Shards
- Cluster: An ES cluster composed of multiple nodes
- Node: An instance in the cluster, common roles:
master,data,ingest,coordinating - Shard (Primary shard): The physical sharding unit of an index, quantity determined at creation
- Replica: A copy of a primary shard, improving high availability and query throughput
- Routing: Determines which primary shard a document falls into based on the routing value, defaults to
_idhash - Alias: An alias, can point to one or more indexes, facilitating seamless switching
Writing and Batch Processing
- Bulk API: Batch write/update/delete
- Update by query: Batch update by condition
- Delete by query: Batch delete by condition
- Reindex: Copy from source index to target index (often used for mapping changes)
- Ingest pipeline: Pre-write processing pipeline (grok, rename, set, script, etc.)
- Painless: ES built-in scripting language, used for script updates, script sorting, etc.
Search and Pagination
- Match: Full-text query, tokenized
- Term/Terms: Exact match, not tokenized
- Range: Range query (numeric/date)
- Multi-match: Multi-field full-text query
- Nested query: Sub-query for
nestedfields - Aggregation: Aggregation analysis (terms, stats, date_histogram, range, etc.)
- Highlight: Highlight matching snippets
- Suggesters: Search suggestions (term/phrase/completion)
- From/size: Basic pagination, deep pagination is costly
- Search after: Cursor-based pagination, replaces deep pagination
- Scroll: Snapshot-style cursor for large-volume exports, not for real-time queries
- PIT (Point in time): Point-in-time consistency snapshot, used for stable pagination
Lifecycle and Index Management
- ILM (Index Lifecycle Management): Hot/Warm/Cold/Delete lifecycle policies
- Rollover: Switch to a new index based on size/document count/time
- Snapshot/Restore: Snapshot and recovery (repositories can connect to S3, HDFS, etc.)
Operations and Performance
- Cluster health: Cluster health (green/yellow/red)
- Refresh interval: Refresh cycle, can be increased in write-heavy scenarios to speed up writes
- Replicas: Number of replicas affects query throughput and write cost
- Force merge: Merge segments for read-only indexes, reducing file count and improving query performance
- Slow logs: Slow query and slow indexing logs, used for troubleshooting and optimization
The above terms cover high-frequency concepts in ES daily modeling, writing, retrieval, aggregation, and operational optimization. Combined with the preceding chapters on Mapping, Analyzer, term/match, and multi_match, they form a complete knowledge graph for ES usage.
主题测试文章,只做测试使用。发布者:Walker,转转请注明出处:https://walker-learn.xyz/archives/4784