Inverted Index for Queries
1. What is an Inverted Index?
An Inverted Index is a data structure used for quickly finding documents that contain specific terms. It is one of the core technologies of search engines.
1.1 Basic Concepts
- Forward Index: Document ID → Document Content (list of terms)
- Inverted Index: Term → List of Document IDs containing the term
1.2 Why is it called "Inverted"?
The Inverted Index reverses the traditional relationship of "which terms a document contains" to "in which documents a term appears," hence the name "inverted."
2. Structure of an Inverted Index
2.1 Basic Structure
词项 → 文档频率 → 文档列表
2.2 Detailed Structure
词项 → {
文档频率: N,
文档列表: [
{文档ID: 1, 词频: 2, 位置: [0, 5]},
{文档ID: 3, 词频: 1, 位置: [2]}
]
}
3. How an Inverted Index Works
3.1 Building Process
- Document Preprocessing: Tokenization, stop word removal, stemming
- Term Statistics: Count the frequency and position of each term in documents
- Index Construction: Establish mapping relationships from terms to documents
3.2 Query Process
- Query Parsing: Tokenize the query string
- Index Lookup: Look up each term in the inverted index
- Result Merging: Merge document lists for multiple terms
- Sorting and Returning: Sort results by relevance and return
4. Implementing an Inverted Index in Go
4.1 Data Structure Definition
package main
import (
"fmt"
"sort"
"strings"
)
// 文档信息
type Document struct {
ID int
Text string
}
// 词项在文档中的位置信息
type Posting struct {
DocID int
Frequency int
Positions []int
}
// 倒排索引项
type InvertedIndexItem struct {
Term string
DocFreq int
Postings []Posting
}
// 倒排索引
type InvertedIndex struct {
Index map[string]*InvertedIndexItem
}
// 创建新的倒排索引
func NewInvertedIndex() *InvertedIndex {
return &InvertedIndex{
Index: make(map[string]*InvertedIndexItem),
}
}
4.2 Index Construction
// 添加文档到索引
func (idx *InvertedIndex) AddDocument(docID int, text string) {
// 简单的分词(实际应用中需要更复杂的分词算法)
words := strings.Fields(strings.ToLower(text))
for pos, word := range words {
if idx.Index[word] == nil {
idx.Index[word] = &InvertedIndexItem{
Term: word,
DocFreq: 0,
Postings: make([]Posting, 0),
}
}
// 查找是否已存在该文档的posting
var posting *Posting
for i := range idx.Index[word].Postings {
if idx.Index[word].Postings[i].DocID == docID {
posting = &idx.Index[word].Postings[i]
break
}
}
if posting == nil {
// 创建新的posting
newPosting := Posting{
DocID: docID,
Frequency: 1,
Positions: []int{pos},
}
idx.Index[word].Postings = append(idx.Index[word].Postings, newPosting)
idx.Index[word].DocFreq++
} else {
// 更新现有posting
posting.Frequency++
posting.Positions = append(posting.Positions, pos)
}
}
}
4.3 Query Implementation
// 单词查询
func (idx *InvertedIndex) Search(term string) []int {
term = strings.ToLower(term)
if item, exists := idx.Index[term]; exists {
docIDs := make([]int, len(item.Postings))
for i, posting := range item.Postings {
docIDs[i] = posting.DocID
}
return docIDs
}
return []int{}
}
// 多词查询(AND操作)
func (idx *InvertedIndex) SearchAnd(terms []string) []int {
if len(terms) == 0 {
return []int{}
}
// 获取第一个词的结果
result := idx.Search(terms[0])
// 与其他词的结果求交集
for i := 1; i < len(terms); i++ {
otherResult := idx.Search(terms[i])
result = intersect(result, otherResult)
}
return result
}
// 多词查询(OR操作)
func (idx *InvertedIndex) SearchOr(terms []string) []int {
if len(terms) == 0 {
return []int{}
}
resultSet := make(map[int]bool)
for _, term := range terms {
docIDs := idx.Search(term)
for _, docID := range docIDs {
resultSet[docID] = true
}
}
result := make([]int, 0, len(resultSet))
for docID := range resultSet {
result = append(result, docID)
}
sort.Ints(result)
return result
}
// 求两个切片的交集
func intersect(a, b []int) []int {
set := make(map[int]bool)
for _, x := range a {
set[x] = true
}
result := make([]int, 0)
for _, x := range b {
if set[x] {
result = append(result, x)
}
}
return result
}
4.4 Complete Example
func main() {
// 创建倒排索引
index := NewInvertedIndex()
// 添加文档
documents := []Document{
{ID: 1, Text: "Go is a programming language"},
{ID: 2, Text: "Go is fast and efficient"},
{ID: 3, Text: "Programming in Go is fun"},
{ID: 4, Text: "Go language is simple"},
}
// 构建索引
for _, doc := range documents {
index.AddDocument(doc.ID, doc.Text)
}
// 查询示例
fmt.Println("搜索 'go':", index.Search("go"))
fmt.Println("搜索 'programming':", index.Search("programming"))
fmt.Println("搜索 'go' AND 'language':", index.SearchAnd([]string{"go", "language"}))
fmt.Println("搜索 'go' OR 'fast':", index.SearchOr([]string{"go", "fast"}))
// 打印索引结构
fmt.Println("\n倒排索引结构:")
for term, item := range index.Index {
fmt.Printf("词项: %s, 文档频率: %d\n", term, item.DocFreq)
for _, posting := range item.Postings {
fmt.Printf(" 文档ID: %d, 词频: %d, 位置: %v\n",
posting.DocID, posting.Frequency, posting.Positions)
}
}
}
5. Optimizing Inverted Indexes
5.1 Compression Techniques
- Variable-Byte Encoding: Use variable-byte encoding to compress document IDs
- Differential Encoding: Store the difference in document IDs instead of absolute values
- Bitmap Compression: Use bitmaps to represent document sets
5.2 Query Optimization
- Skip Lists: Quickly locate positions in long lists
- Caching Mechanism: Cache results of popular queries
- Parallel Querying: Process queries with multiple threads
6. Practical Application Scenarios
6.1 Search Engines
- Core technology for search engines like Google and Baidu
- Indexing and retrieval of web page content
6.2 Database Systems
- Full-text search functionality
- Fast querying of text fields
6.3 Code Search
- GitHub code search
- Code navigation in IDEs
6.4 Log Analysis
- Fast retrieval of log files
- Locating error logs
7. Performance Analysis
7.1 Time Complexity
- Index Construction: O(N×M), where N is the number of documents and M is the average number of terms
- Single-term Query: O(1) on average
- Multi-term Query: O(k×log(n)), where k is the number of results and n is the number of documents
7.2 Space Complexity
- Storage Space: O(V×D), where V is the vocabulary size and D is the average document frequency
7.3 Advantages and Disadvantages
Advantages:
- Fast query speed
- Supports complex queries
- Easy to implement
Disadvantages:
- Time-consuming index construction
- Large storage space
- Complex index updates
8. Summary
The inverted index is a core technology in information retrieval. By reversing the "document-term" relationship to a "term-document" relationship, it enables efficient text search. In Go, we can implement an inverted index using basic data structures like maps and slices, providing powerful search capabilities for applications.
multi_match Usage Guide
multi_match is a query type in ES that performs searches across multiple fields simultaneously. It is essentially an extension of the match query across multiple fields. It is suitable for combined retrieval across multiple text fields like title, description, and tags, often used with field boosting, different query types, and analyzers.
1. Basic Usage
POST /index/_search
{
"query": {
"multi_match": {
"query": "iPhone 15",
"fields": ["title", "description", "tags"]
}
}
}
2. Field Boosting (boost)
POST /index/_search
{
"query": {
"multi_match": {
"query": "iPhone 15",
"fields": ["title^3", "description^1.5", "tags"]
}
}
}
Explanation: title^3 means that the match score for the title field is multiplied by a weight of 3, thereby boosting the score of hits on this field during sorting.
3. Type Option and Applicable Scenarios
- best_fields (default): Selects the score of the best-matching field among all fields as the primary score, can be used with
tie_breaker
POST /index/_search
{
"query": {
"multi_match": {
"query": "apple phone",
"fields": ["title", "description", "tags"],
"type": "best_fields",
"tie_breaker": 0.2
}
}
}
- most_fields: Scores from multiple fields are summed up, suitable for cases where the same semantic meaning is distributed across multiple fields (e.g., the same text split and stored in different fields)
POST /index/_search
{
"query": {
"multi_match": {
"query": "iphone",
"fields": ["title", "title.ngram", "description"],
"type": "most_fields"
}
}
}
- cross_fields: Treats multiple fields as one large field for matching, suitable for scenarios where terms are distributed across different fields (e.g., first_name + last_name)
POST /index/_search
{
"query": {
"multi_match": {
"query": "tim cook",
"fields": ["first_name", "last_name"],
"type": "cross_fields",
"operator": "and"
}
}
}
- phrase: Phrase matching, requires strict word order and proximity, suitable for exact phrase search
POST /index/_search
{
"query": {
"multi_match": {
"query": "iphone 15 pro",
"fields": ["title", "description"],
"type": "phrase"
}
}
}
- phrase_prefix: Phrase prefix matching, suitable for input method suggestions/search suggestions
POST /index/_search
{
"query": {
"multi_match": {
"query": "iph 15",
"fields": ["title", "description"],
"type": "phrase_prefix",
"max_expansions": 50
}
}
}
4. Operator and Minimum Match
POST /index/_search
{
"query": {
"multi_match": {
"query": "apple flagship phone",
"fields": ["title", "description"],
"operator": "and",
"minimum_should_match": "75%"
}
}
}
Explanation:
operator: andrequires all query terms to match;or(default) matches any one termminimum_should_matchcontrols the minimum proportion or number of matching terms, such as2,3<75%,75%, etc.
5. Fuzzy Matching (fuzziness) and Correction
POST /index/_search
{
"query": {
"multi_match": {
"query": "iphine",
"fields": ["title", "description"],
"fuzziness": "AUTO",
"prefix_length": 1
}
}
}
Explanation: fuzziness: AUTO provides fault tolerance for common spelling errors; prefix_length specifies the length of the prefix that must match exactly.
6. Analyzer and Field Selection
POST /index/_search
{
"query": {
"multi_match": {
"query": "苹果 手机",
"fields": ["title", "title.keyword^5", "description"],
"analyzer": "ik_smart"
}
}
}
Suggestions:
- Mostly used for full-text search on
textfields; for exact matching and aggregation/sorting, usekeywordfields (can be combined with boost) - For Chinese search, analyzers like
ik_smart,ik_max_wordcan be used (plugins need to be installed)
7. Combined Example (Comprehensive Fields, Weights, Filtering, and Sorting)
POST /products/_search
{
"_source": ["id", "title", "price", "brand"],
"from": 0,
"size": 20,
"sort": [
{"_score": "desc"},
{"price": "asc"}
],
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "iphone 15 pro",
"fields": ["title^4", "subtitle^2", "description", "tags"],
"type": "best_fields",
"tie_breaker": 0.3,
"minimum_should_match": "66%"
}
}
],
"filter": [
{"term": {"brand": "apple"}},
{"range": {"price": {"gte": 3000, "lte": 10000}}}
]
}
},
"highlight": {
"fields": {
"title": {},
"description": {}
}
}
}
8. Common Issues and Suggestions
- Suboptimal relevance:
- Set higher weights for core fields (e.g.,
title^N) - Choose the appropriate
type: usecross_fieldsfor term distribution across fields,most_fieldsfor combined scores - Use synonyms, spell correction (
fuzziness), and domain-specific dictionaries - Performance issues:
- Control returned fields (
_sourcefiltering) andsize - Place filter conditions in
filter, which hits the cache and does not participate in scoring - Avoid using
wildcard/phrase_prefixfor prefix expansion on a huge number of fields - Exact vs. Full-text:
- Use
keywordfor exact matching and aggregation; usetext+ analyzer for full-text search - Can create
multi-fieldsfor the same business field (text+keyword)
term Query Explained
The term query is a query type in ES used for exact matching. It does not perform tokenization on the query term but directly matches it precisely with terms in the index. It is suitable for keyword type fields, numeric fields, date fields, etc. (No tokenization, no lowercasing).
1. Basic Usage
POST /products/_search
{
"query": {
"term": {
"status": "active"
}
}
}
2. Multi-field term Query
POST /products/_search
{
"query": {
"bool": {
"must": [
{"term": {"status": "active"}},
{"term": {"category": "electronics"}},
{"term": {"brand": "apple"}}
]
}
}
}
3. Exact Matching for Numeric Fields
POST /products/_search
{
"query": {
"term": {
"price": 5999
}
}
}
4. Exact Matching for Date Fields
POST /products/_search
{
"query": {
"term": {
"created_date": "2025-01-18"
}
}
}
5. Array Field Matching
POST /products/_search
{
"query": {
"term": {
"tags": "phone"
}
}
}
6. Using boost to Increase Weight
POST /products/_search
{
"query": {
"term": {
"status": {
"value": "active",
"boost": 2.0
}
}
}
}
7. terms Query (Multi-value Matching)
POST /products/_search
{
"query": {
"terms": {
"status": ["active", "pending", "review"]
}
}
}
8. Combined with filter
POST /products/_search
{
"query": {
"bool": {
"must": [
{"match": {"title": "iPhone"}}
],
"filter": [
{"term": {"status": "active"}},
{"term": {"category": "electronics"}}
]
}
}
}
term vs match Query Comparison
1. Core Differences
| Feature | term Query | match Query |
|---|---|---|
| Tokenization | No tokenization, exact match | Performs tokenization on query terms |
| Matching Method | Exact match with terms in the index | Fuzzy matching, supports relevance scoring |
| Applicable Fields | keyword, numeric, date, etc. | text type fields |
| Performance | Faster (no relevance calculation) | Slower (requires score calculation) |
| Caching | Results can be cached | Results are usually not cached |
2. Practical Example Comparison
2.1 Different Results for the Same Query Term
# 数据准备
POST /test/_doc/1
{
"title": "iPhone 15 Pro Max",
"title.keyword": "iPhone 15 Pro Max",
"status": "active"
}
# term 查询 - 精确匹配
POST /test/_search
{
"query": {
"term": {
"title.keyword": "iPhone 15 Pro Max"
}
}
}
# 结果:匹配成功
# term 查询 - 对 text 字段使用 term(通常不匹配)
POST /test/_search
{
"query": {
"term": {
"title": "iPhone 15 Pro Max"
}
}
}
# 结果:不匹配(因为 title 被分词为 ["iphone", "15", "pro", "max"])
# match 查询 - 对 text 字段使用 match
POST /test/_search
{
"query": {
"match": {
"title": "iPhone 15 Pro Max"
}
}
}
# 结果:匹配成功,有相关性评分
2.2 Partial Match Comparison
# term 查询 - 部分词不匹配
POST /test/_search
{
"query": {
"term": {
"title.keyword": "iPhone 15"
}
}
}
# 结果:不匹配(需要完全一致)
# match 查询 - 部分词匹配
POST /test/_search
{
"query": {
"match": {
"title": "iPhone 15"
}
}
}
# 结果:匹配成功,相关性评分较低
3. Use Case Comparison
3.1 Applicable Scenarios for term Query
# 1. 状态过滤
POST /products/_search
{
"query": {
"bool": {
"filter": [
{"term": {"status": "active"}}
]
}
}
}
# 2. 分类筛选
POST /products/_search
{
"query": {
"bool": {
"filter": [
{"term": {"category": "electronics"}}
]
}
}
}
# 3. 标签匹配
POST /products/_search
{
"query": {
"bool": {
"filter": [
{"term": {"tags": "premium"}}
]
}
}
}
# 4. 聚合统计
POST /products/_search
{
"size": 0,
"aggs": {
"status_count": {
"terms": {
"field": "status"
}
}
}
}
3.2 Applicable Scenarios for match Query
# 1. 全文搜索
POST /products/_search
{
"query": {
"match": {
"title": "iPhone 15 Pro"
}
}
}
# 2. 描述搜索
POST /products/_search
{
"query": {
"match": {
"description": "Latest款手机"
}
}
}
# 3. 多字段搜索
POST /products/_search
{
"query": {
"multi_match": {
"query": "苹果手机",
"fields": ["title", "description", "tags"]
}
}
}
4. Performance Comparison
4.1 Query Performance
# term 查询 - 高性能
POST /products/_search
{
"query": {
"bool": {
"filter": [
{"term": {"status": "active"}},
{"term": {"category": "electronics"}}
]
}
}
}
# 特点:不计算相关性,结果可缓存
# match 查询 - 相对较慢
POST /products/_search
{
"query": {
"bool": {
"must": [
{"match": {"title": "iPhone"}},
{"match": {"description": "手机"}}
]
}
}
}
# 特点:需要计算相关性评分,结果通常不缓存
4.2 Hybrid Usage Optimization
# 最佳实践:term 用于过滤,match 用于搜索
POST /products/_search
{
"query": {
"bool": {
"must": [
{"match": {"title": "iPhone 15"}}
],
"filter": [
{"term": {"status": "active"}},
{"term": {"category": "electronics"}},
{"range": {"price": {"gte": 1000, "lte": 10000}}}
]
}
}
}
5. Common Errors and Solutions
5.1 Using term Query on text Fields
# 错误用法
POST /products/_search
{
"query": {
"term": {
"title": "iPhone" # title 是 text 字段,会被分词
}
}
}
# 正确用法
POST /products/_search
{
"query": {
"term": {
"title.keyword": "iPhone" # 使用 keyword 字段
}
}
}
# 或者使用 match
POST /products/_search
{
"query": {
"match": {
"title": "iPhone"
}
}
}
5.2 Case Sensitivity Issues
# term 查询大小写敏感
POST /products/_search
{
"query": {
"term": {
"status": "Active" # 如果索引中是 "active",则不匹配
}
}
}
# 解决方案:使用 match 或确保大小写一致
POST /products/_search
{
"query": {
"match": {
"status": "Active" # match 会进行分词和标准化
}
}
}
6. Best Practice Suggestions
6.1 Field Mapping Design
# 创建支持两种查询的映射
PUT /products
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_smart",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"status": {
"type": "keyword"
},
"price": {
"type": "double"
}
}
}
}
6.2 Query Combination Strategy
# 推荐:精确过滤 + 模糊搜索
POST /products/_search
{
"query": {
"bool": {
"must": [
{"match": {"title": "用户搜索词"}}
],
"filter": [
{"term": {"status": "active"}},
{"term": {"category": "electronics"}},
{"range": {"price": {"gte": 1000, "lte": 10000}}}
]
}
},
"sort": [
{"_score": "desc"},
{"price": "asc"}
]
}
7. Summary
- term Query: Suitable for exact matching, filtering, aggregation; better performance, results can be cached
- match Query: Suitable for full-text search, fuzzy matching; supports relevance scoring
- Best Practice: Use term for filter conditions, match for search content; combine both
- Field Design: Create keyword sub-fields for fields requiring exact matching
- Performance Optimization: Place exact match conditions in the filter to avoid unnecessary scoring calculations
ES Mapping Concepts and Usage
1. What is Mapping
Mapping is the "structure definition" of an index, similar to a relational database table schema. It is used to declare the type and indexing method for each field, determining:
- Data type and storage format of fields (text, keyword, numeric, date, boolean, geo, nested, etc.)
- Whether to participate in the inverted index and how to tokenize (
index,analyzer) - Whether it can be used for aggregation/sorting (
doc_values) - Multi-field definition: Indexing the same business field in multiple ways
- Dynamic field handling strategy (
dynamic)
Since ES 7, an index has only one type (internally _doc), and modeling directly focuses on "index + mapping."
2. Common Field Types and Scenarios
text: Tokenized, used for full-text search; not suitable for aggregation/sortingkeyword: Not tokenized, suitable for exact matching, aggregation, sorting; hasdoc_valuesby default- Numeric and Date:
integer/long/double/date, etc., suitable for range filtering, aggregation, and sorting - Structured:
object(flattened objects within the same document),nested(each object in an array is modeled independently, supporting independent sub-queries) - Geographic:
geo_point/geo_shape
Typical multi-fields (for both full-text and exact matching):
"title": {
"type": "text",
"analyzer": "ik_smart",
"fields": {
"keyword": { "type": "keyword", "ignore_above": 256 }
}
}
3. Create Index and Explicitly Set Mapping
PUT /products
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1
},
"mappings": {
"dynamic": "true",
"properties": {
"title": {
"type": "text",
"analyzer": "standard",
"fields": {
"keyword": { "type": "keyword", "ignore_above": 256 }
}
},
"price": { "type": "double" },
"status": { "type": "keyword" },
"createdAt": { "type": "date" },
"tags": { "type": "keyword" },
"attrs": { "type": "object" },
"specs": { "type": "nested" }
}
}
}
4. View/Update Mapping
- View Mapping
GET /products/_mapping
- Add Field (only new fields can be added, existing field types cannot be changed)
PUT /products/_mapping
{
"properties": {
"brand": { "type": "keyword" }
}
}
5. Correct Way to Modify Field Types (Reindex)
- Create a new index and define the correct mapping
products_v2 - Migrate data
POST /_reindex
{
"source": { "index": "products" },
"dest": { "index": "products_v2" }
}
- Switch traffic using an alias
POST /_aliases
{
"actions": [
{ "remove": { "index": "products", "alias": "products_read" }},
{ "add": { "index": "products_v2", "alias": "products_read" }}
]
}
6. Dynamic Mapping Strategy
"mappings": {
"dynamic": "strict",
"properties": { /* 显式列出字段,未知字段将被拒绝 */ }
}
It is recommended to use strict for core indexes to prevent dirty data from being automatically inferred as incorrect types (e.g., treating numbers as text).
7. Performance and Practical Considerations
- Only enable
indexfor fields that need to be searched/filtered; for purely display fields, setindex: false - Fields requiring aggregation/sorting should keep
doc_values: true(texthas no doc_values) - For Chinese scenarios, install the IK analyzer and specify
analyzerfortextfields - Use
nestedfor nested arrays to avoidobjectcausing cross-matching - Use multi-fields to support both full-text and exact matching simultaneously
In short: Mapping determines "how fields are stored, indexed, and searched." Before building an index, clarify your query and aggregation requirements, then design the mapping to achieve correct and high-performance retrieval results.
ES Analyzer Usage and Explanation
1. What is an Analyzer
An Analyzer is a component that performs "normalization → tokenization → filtering" on text fields during writing/searching, typically consisting of three parts:
char_filter: Character-level preprocessing (e.g., removing HTML tags)tokenizer: Splits text into tokens (lexemes), such asstandard,whitespace,ik_smartfilter: Further processes tokens (lowercasing, stop word removal, synonyms, stemming, etc.)
The field's analyzer is used during the writing phase, and the same analyzer is used by default during the search phase, but can be specified separately via search_analyzer.
2. Commonly Used Built-in Analyzers
standard(default): General English tokenization, lowercasingsimple: Splits by non-letters, lowercasingwhitespace: Splits only by whitespace, no case changestop: Removes stop words based onsimplekeyword: No tokenization, the entire input is treated as a single token (often used with normalizer for keyword fields)pattern: Splits based on regular expressions
Common for Chinese: ik_smart, ik_max_word (require plugin installation).
3. Using _analyze to Test Tokenization Effects
POST /_analyze
{
"analyzer": "standard",
"text": "iPhone 15 Pro Max"
}
POST /_analyze
{
"analyzer": "ik_smart",
"text": "苹果手机保护壳"
}
4. Setting analyzer and search_analyzer on Fields
PUT /docs
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_smart",
"search_analyzer": "ik_max_word"
}
}
}
}
Explanation:
- Use
ik_smartfor indexing, and a finer-grainedik_max_wordfor querying to improve recall.
5. Temporarily Specifying Analyzer During Query (Without Changing Mapping)
POST /docs/_search
{
"query": {
"match": {
"title": {
"query": "苹果手机",
"analyzer": "ik_max_word"
}
}
}
}
6. Custom Analyzer (Including Synonyms/Stop Words)
PUT /articles
{
"settings": {
"analysis": {
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms": ["iphone,苹果手机", "notebook,笔记本"]
}
},
"analyzer": {
"my_zh_analyzer": {
"type": "custom",
"char_filter": ["html_strip"],
"tokenizer": "ik_smart",
"filter": ["lowercase", "my_synonyms"]
}
}
}
},
"mappings": {
"properties": {
"content": { "type": "text", "analyzer": "my_zh_analyzer" }
}
}
}
7. Normalizer (Standardization for keyword fields)
keyword fields are not tokenized and cannot use an analyzer; if case normalization or punctuation removal is needed, a normalizer can be used:
PUT /users
{
"settings": {
"analysis": {
"normalizer": {
"lowercase_normalizer": {
"type": "custom",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"properties": {
"email": { "type": "keyword", "normalizer": "lowercase_normalizer" }
}
}
}
8. IK Analyzer Installation and Field Example (Brief)
- Installation (based on version):
bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/...Restart ES - Usage:
PUT /goods
{
"mappings": {
"properties": {
"title": { "type": "text", "analyzer": "ik_smart", "search_analyzer": "ik_max_word" }
}
}
}
9. Considerations for Changing Analyzers
- The analyzer for an existing field generally cannot be directly modified; a "reindex" process is required
- Different analyzers affect the inverted index structure; after changes, re-verify query semantics and relevance
10. Performance and Practice
- Choose the simplest possible indexing analyzer (e.g.,
ik_smart), and a finer-grained one for the query side (ik_max_word) to improve recall - Use
_analyzeto verify if tokenization meets expectations; frequent filter conditions should usekeyword+ normalizer - Control the number of fields and tokenization granularity to avoid index explosion; external management of synonym lists facilitates updates
ES Terminology and Glossary
The following concepts are categorized by topic for quick understanding and reference.
Indexing and Document Modeling
- Index: A logical container for a collection of documents, similar to a database. Internally composed of multiple shards
- Document: A record, stored as JSON, uniquely identified by
_id - Field: A document attribute, determining the available query and aggregation methods
- Mapping: Definition of field types and indexing strategies, equivalent to a table schema
- Type: A logical "table" concept existing in 6.x and below. Fixed to
_docfrom 7.x, hidden externally in 8.x - Text: A tokenized field type, used for full-text search, not suitable for aggregation/sorting
- Keyword: Not tokenized, suitable for exact matching, aggregation/sorting, usually has
doc_values - Multi-fields: Indexing the same field in multiple ways, such as
titleandtitle.keyword - Object: Object field, attributes are flattened and merged into the same document
- Nested: Nested object, each array element is indexed independently, avoiding cross-matching, and supporting independent sub-queries
- Dynamic mapping: Strategy for when unknown fields appear (true/false/strict)
Tokenization and Normalization
- Analyzer: Tokenizer, includes three stages:
char_filter→tokenizer→filter - Tokenizer: Component that splits into tokens, such as
standard,whitespace,ik_smart - Token (lexeme/term): Basic unit in an inverted index
- Char filter: Character-level preprocessing, such as
html_strip - Token filter: Further processes tokens, such as
lowercase,synonym,stop - Normalizer: Standardization for
keywordfields (lowercasing, accent removal, etc.), no tokenization
Inverted Index and Scoring
- Inverted index: Index structure of term → document list (postings)
- Term: A term in the index (a token after standardization/tokenization)
- Posting: Document occurrence information, including docID, frequency, position, etc.
- Relevance score: Used for sorting
- BM25: Default relevance model (replaces TF-IDF)
- Query vs Filter: Query participates in scoring, Filter only performs boolean filtering and can be cached
- Bool query: Combination query with
must/should/must_not/filter
Storage and Segments
- Segment: Immutable data segment, generated by appending writes; merging reduces the number of segments
- Refresh: Flushes in-memory increments to new segments, default cycle 1s, visible after refresh
- Flush: Persists the translog and creates a new commit point
- Translog: Write-ahead log, used for crash recovery
- Doc values: Columnar storage, supports aggregation/sorting/scripting,
texthas no doc values - _source: Original JSON document, stored by default, used for re-fetching and reindexing
- Stored fields: Separately stored fields (less common), distinct from
_source - Norms: Field-level length normalization and other scoring factors, can be disabled to save space
Cluster and Shards
- Cluster: An ES cluster composed of multiple nodes
- Node: An instance in the cluster, common roles:
master,data,ingest,coordinating - Shard (Primary Shard): The physical unit of an index, number determined at creation
- Replica: A copy of a primary shard, enhancing high availability and query throughput
- Routing: Determines which primary shard a document falls into based on the routing value, defaults to
_idhash - Alias: An alias that can point to one or more indexes, facilitating seamless switching
Writes and Batch Processing
- Bulk API: Batch write/update/delete
- Update by query: Batch update by condition
- Delete by query: Batch delete by condition
- Reindex: Copy from source index to target index (often used for mapping changes)
- Ingest pipeline: Pre-write processing pipeline (grok, rename, set, script, etc.)
- Painless: ES built-in scripting language, used for script updates, script sorting, etc.
Search and Pagination
- Match: Full-text query, tokenized
- Term/Terms: Exact match, not tokenized
- Range: Range query (numeric/date)
- Multi-match: Multi-field full-text query
- Nested query: Sub-query for
nestedfields - Aggregation: Aggregation analysis (terms, stats, date_histogram, range, etc.)
- Highlight: Highlight matching snippets
- Suggesters: Search suggestions (term/phrase/completion)
- From/size: Basic pagination, deep pagination is costly
- Search after: Cursor-based pagination, replaces deep pagination
- Scroll: Snapshot-style cursor for large-volume exports, not for real-time queries
- PIT (Point in time): Point-in-time consistent snapshot, used for stable pagination
Lifecycle and Index Management
- ILM (Index Lifecycle Management): Hot/Warm/Cold/Delete lifecycle policies
- Rollover: Switches to a new index based on size/document count/time
- Snapshot/Restore: Snapshot and recovery (repositories can integrate with S3, HDFS, etc.)
Operations and Performance
- Cluster health: (green/yellow/red)
- Refresh interval: Refresh cycle, can be increased in write-heavy scenarios to speed up writes
- Replicas: Number of replicas affects query throughput and write cost
- Force merge: Merges segments for read-only indexes, reducing file count and improving query performance
- Slow logs: Slow query and slow indexing logs, used for troubleshooting and optimization
The above terms cover high-frequency concepts in ES daily modeling, writing, retrieval, aggregation, and operational optimization. Combined with the preceding chapters on Mapping, Analyzer, term/match, and multi_match, they form a complete knowledge graph for ES usage.
主题测试文章,只做测试使用。发布者:Walker,转转请注明出处:https://walker-learn.xyz/archives/6777