Go Engineer Structured Course 011 [Learning Notes]

Inverted Index for Queries
1. What is an Inverted Index?
An Inverted Index is a data structure used to quickly find documents containing specific terms. It is one of the core technologies of search engines.
1.1 Basic Concepts
Forward Index: Document ID → Document Content (list of terms)
Inverted Index: Term → List of Document IDs containing the term
1.2 Why is it called "Inverted"?
An inverted index reverses the traditional relationship of "which terms a document contains" to "in which documents a term appears...

Inverted Index for Queries

1. What is an Inverted Index?

An Inverted Index is a data structure used to quickly find documents containing specific terms. It is one of the core technologies of search engines.

1.1 Basic Concepts

  • Forward Index: Document ID → Document Content (list of terms)
  • Inverted Index: Term → List of Document IDs containing the term

1.2 Why is it called "Inverted"?

An Inverted Index reverses the traditional relationship of "which terms a document contains" to "in which documents a term appears", hence the name "inverted".

2. Structure of an Inverted Index

2.1 Basic Structure

词项 → 文档频率 → 文档列表

2.2 Detailed Structure

词项 → {
    文档频率: N,
    文档列表: [
        {文档ID: 1, 词频: 2, 位置: [0, 5]},
        {文档ID: 3, 词频: 1, 位置: [2]}
    ]
}

3. How an Inverted Index Works

3.1 Building Process

  1. Document Preprocessing: Tokenization, stop word removal, stemming
  2. Term Statistics: Count the frequency and position of each term in documents
  3. Index Construction: Establish the mapping relationship from terms to documents

3.2 Query Process

  1. Query Parsing: Tokenize the query string
  2. Index Lookup: Search for each term in the inverted index
  3. Result Merging: Merge document lists for multiple terms
  4. Sorted Return: Return results sorted by relevance

4. Implementing an Inverted Index in Go

4.1 Data Structure Definition

package main

import (
    "fmt"
    "sort"
    "strings"
)

// 文档信息
type Document struct {
    ID   int
    Text string
}

// 词项在文档中的位置信息
type Posting struct {
    DocID     int
    Frequency int
    Positions []int
}

// 倒排索引项
type InvertedIndexItem struct {
    Term      string
    DocFreq   int
    Postings  []Posting
}

// 倒排索引
type InvertedIndex struct {
    Index map[string]*InvertedIndexItem
}

// 创建新的倒排索引
func NewInvertedIndex() *InvertedIndex {
    return &InvertedIndex{
        Index: make(map[string]*InvertedIndexItem),
    }
}

4.2 Index Construction

// 添加文档到索引
func (idx *InvertedIndex) AddDocument(docID int, text string) {
    // 简单的分词(实际应用中需要更复杂的分词算法)
    words := strings.Fields(strings.ToLower(text))

    for pos, word := range words {
        if idx.Index[word] == nil {
            idx.Index[word] = &InvertedIndexItem{
                Term:     word,
                DocFreq:  0,
                Postings: make([]Posting, 0),
            }
        }

        // 查找是否已存在该文档的posting
        var posting *Posting
        for i := range idx.Index[word].Postings {
            if idx.Index[word].Postings[i].DocID == docID {
                posting = &idx.Index[word].Postings[i]
                break
            }
        }

        if posting == nil {
            // 创建新的posting
            newPosting := Posting{
                DocID:     docID,
                Frequency: 1,
                Positions: []int{pos},
            }
            idx.Index[word].Postings = append(idx.Index[word].Postings, newPosting)
            idx.Index[word].DocFreq++
        } else {
            // 更新现有posting
            posting.Frequency++
            posting.Positions = append(posting.Positions, pos)
        }
    }
}

4.3 Query Implementation

// 单词查询
func (idx *InvertedIndex) Search(term string) []int {
    term = strings.ToLower(term)
    if item, exists := idx.Index[term]; exists {
        docIDs := make([]int, len(item.Postings))
        for i, posting := range item.Postings {
            docIDs[i] = posting.DocID
        }
        return docIDs
    }
    return []int{}
}

// 多词查询(AND操作)
func (idx *InvertedIndex) SearchAnd(terms []string) []int {
    if len(terms) == 0 {
        return []int{}
    }

    // 获取第一个词的结果
    result := idx.Search(terms[0])

    // 与其他词的结果求交集
    for i := 1; i < len(terms); i++ {
        otherResult := idx.Search(terms[i])
        result = intersect(result, otherResult)
    }

    return result
}

// 多词查询(OR操作)
func (idx *InvertedIndex) SearchOr(terms []string) []int {
    if len(terms) == 0 {
        return []int{}
    }

    resultSet := make(map[int]bool)

    for _, term := range terms {
        docIDs := idx.Search(term)
        for _, docID := range docIDs {
            resultSet[docID] = true
        }
    }

    result := make([]int, 0, len(resultSet))
    for docID := range resultSet {
        result = append(result, docID)
    }

    sort.Ints(result)
    return result
}

// 求两个切片的交集
func intersect(a, b []int) []int {
    set := make(map[int]bool)
    for _, x := range a {
        set[x] = true
    }

    result := make([]int, 0)
    for _, x := range b {
        if set[x] {
            result = append(result, x)
        }
    }

    return result
}

4.4 Complete Example

func main() {
    // 创建倒排索引
    index := NewInvertedIndex()

    // 添加文档
    documents := []Document{
        {ID: 1, Text: "Go is a programming language"},
        {ID: 2, Text: "Go is fast and efficient"},
        {ID: 3, Text: "Programming in Go is fun"},
        {ID: 4, Text: "Go language is simple"},
    }

    // 构建索引
    for _, doc := range documents {
        index.AddDocument(doc.ID, doc.Text)
    }

    // 查询示例
    fmt.Println("搜索 'go':", index.Search("go"))
    fmt.Println("搜索 'programming':", index.Search("programming"))
    fmt.Println("搜索 'go' AND 'language':", index.SearchAnd([]string{"go", "language"}))
    fmt.Println("搜索 'go' OR 'fast':", index.SearchOr([]string{"go", "fast"}))

    // 打印索引结构
    fmt.Println("\n倒排索引结构:")
    for term, item := range index.Index {
        fmt.Printf("词项: %s, 文档频率: %d\n", term, item.DocFreq)
        for _, posting := range item.Postings {
            fmt.Printf("  文档ID: %d, 词频: %d, 位置: %v\n",
                posting.DocID, posting.Frequency, posting.Positions)
        }
    }
}

5. Optimizing Inverted Indexes

5.1 Compression Techniques

  • Variable-length encoding: Compress document IDs using variable-length encoding
  • Differential encoding: Store the difference in document IDs instead of absolute values
  • Bitmap compression: Use bitmaps to represent document sets

5.2 Query Optimization

  • Skip lists: Quickly locate positions in long lists
  • Caching mechanism: Cache popular query results
  • Parallel querying: Process queries using multiple threads

6. Practical Application Scenarios

6.1 Search Engines

  • Core technology for search engines like Google and Baidu
  • Web content indexing and retrieval

6.2 Database Systems

  • Full-text search functionality
  • Fast querying of text fields

6.3 Code Search

  • GitHub code search
  • Code navigation in IDEs

6.4 Log Analysis

  • Fast retrieval of log files
  • Locating error logs

7. Performance Analysis

7.1 Time Complexity

  • Index Construction: O(N×M), where N is the number of documents and M is the average number of terms
  • Single-term Query: O(1) on average
  • Multi-term Query: O(k×log(n)), where k is the number of results and n is the number of documents

7.2 Space Complexity

  • Storage Space: O(V×D), where V is the vocabulary size and D is the average document frequency

7.3 Pros and Cons

Pros:

  • Fast query speed
  • Supports complex queries
  • Easy to implement

Cons:

  • Time-consuming index construction
  • Large storage space
  • Complex index updates

8. Summary

The inverted index is a core technology in information retrieval. By reversing the "document-term" relationship to a "term-document" relationship, it achieves efficient text search. In Go language, we can use basic data structures like maps and slices to implement an inverted index, providing powerful search capabilities for applications.

multi_match Usage Guide

multi_match is a query type in ES that searches across multiple fields simultaneously, essentially an extension of the match query to multiple fields. It is suitable for combined retrieval across multiple text fields like title, description, and tags, often used with field boosting, different query types, and analyzers.

1. Basic Usage

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "iPhone 15",
      "fields": ["title", "description", "tags"]
    }
  }
}

2. Field Boosting (boost)

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "iPhone 15",
      "fields": ["title^3", "description^1.5", "tags"]
    }
  }
}

Explanation: title^3 means the match score for the title field is multiplied by a weight of 3, thereby boosting the score of results hitting this field during sorting.

3. type Option and Applicable Scenarios

  • best_fields (default): Selects the best matching field's score as the primary score among all fields, can be used with tie_breaker
POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "apple phone",
      "fields": ["title", "description", "tags"],
      "type": "best_fields",
      "tie_breaker": 0.2
    }
  }
}
  • most_fields: Scores from multiple fields are summed, suitable when the same semantic meaning is distributed across multiple fields (e.g., a single text split and stored in different fields)
POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "iphone",
      "fields": ["title", "title.ngram", "description"],
      "type": "most_fields"
    }
  }
}
  • cross_fields: Treats multiple fields as one large field for matching, suitable for scenarios where terms are distributed across different fields (e.g., first_name + last_name)
POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "tim cook",
      "fields": ["first_name", "last_name"],
      "type": "cross_fields",
      "operator": "and"
    }
  }
}
  • phrase: Phrase matching, requires strict word order and proximity, suitable for exact phrase searches
POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "iphone 15 pro",
      "fields": ["title", "description"],
      "type": "phrase"
    }
  }
}
  • phrase_prefix: Phrase prefix matching, suitable for input method suggestions/search suggestions
POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "iph 15",
      "fields": ["title", "description"],
      "type": "phrase_prefix",
      "max_expansions": 50
    }
  }
}

4. Operator and Minimum Match

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "apple flagship phone",
      "fields": ["title", "description"],
      "operator": "and",
      "minimum_should_match": "75%"
    }
  }
}

Explanation:

  • operator: and requires all query terms to match; or (default) matches any one term
  • minimum_should_match controls the minimum proportion or number of matching terms, such as 2, 3<75%, 75%, etc.

5. Fuzziness and Correction

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "iphine",
      "fields": ["title", "description"],
      "fuzziness": "AUTO",
      "prefix_length": 1
    }
  }
}

Explanation: fuzziness: AUTO provides fault tolerance for common spelling errors; prefix_length specifies the length of the prefix that must match exactly.

6. Analyzer and Field Selection

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "苹果 手机",
      "fields": ["title", "title.keyword^5", "description"],
      "analyzer": "ik_smart"
    }
  }
}

Suggestions:

  • Mostly used for text fields for full-text retrieval; exact matching and aggregation/sorting use keyword fields (can be combined with boost)
  • For Chinese retrieval, ik_smart, ik_max_word and other analyzers can be used (requires plugin installation)

7. Combined Example (Comprehensive fields, weights, filtering, and sorting)

POST /products/_search
{
  "_source": ["id", "title", "price", "brand"],
  "from": 0,
  "size": 20,
  "sort": [
    {"_score": "desc"},
    {"price": "asc"}
  ],
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "iphone 15 pro",
            "fields": ["title^4", "subtitle^2", "description", "tags"],
            "type": "best_fields",
            "tie_breaker": 0.3,
            "minimum_should_match": "66%"
          }
        }
      ],
      "filter": [
        {"term": {"brand": "apple"}},
        {"range": {"price": {"gte": 3000, "lte": 10000}}}
      ]
    }
  },
  "highlight": {
    "fields": {
      "title": {},
      "description": {}
    }
  }
}

8. Common Issues and Suggestions

  • Relevance is not ideal:
  • Set higher weights for core fields (e.g., title^N)
  • Choose the appropriate type: use cross_fields for cross-field term distribution, most_fields for combined scores
  • Use synonyms, spell correction (fuzziness), and domain-specific dictionaries
  • Performance issues:
  • Control returned fields (_source filtering) and size
  • Place filter conditions in filter, which hits the cache and does not participate in scoring
  • Avoid using wildcard/phrase_prefix for prefix expansion on a huge number of fields
  • Exact vs. Full-text:
  • Exact matching and aggregation use keyword; full-text retrieval uses text + analyzer
  • Can create multi-fields (text + keyword) for the same business field

term Query Explained

The term query is a query type in ES used for exact matching. It does not perform tokenization on the query term but directly performs an exact match with the terms in the index. It is suitable for keyword type fields, numeric fields, date fields, etc. (No tokenization, no lowercasing).

1. Basic Usage

POST /products/_search
{
  "query": {
    "term": {
      "status": "active"
    }
  }
}

2. Multi-field term Query

POST /products/_search
{
  "query": {
    "bool": {
      "must": [
        {"term": {"status": "active"}},
        {"term": {"category": "electronics"}},
        {"term": {"brand": "apple"}}
      ]
    }
  }
}

3. Exact Matching for Numeric Fields

POST /products/_search
{
  "query": {
    "term": {
      "price": 5999
    }
  }
}

4. Exact Matching for Date Fields

POST /products/_search
{
  "query": {
    "term": {
      "created_date": "2025-01-18"
    }
  }
}

5. Array Field Matching

POST /products/_search
{
  "query": {
    "term": {
      "tags": "phone"
    }
  }
}

6. Using boost to Increase Weight

POST /products/_search
{
  "query": {
    "term": {
      "status": {
        "value": "active",
        "boost": 2.0
      }
    }
  }
}

7. terms Query (Multi-value Matching)

POST /products/_search
{
  "query": {
    "terms": {
      "status": ["active", "pending", "review"]
    }
  }
}

8. Combined with filter

POST /products/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "iPhone"}}
      ],
      "filter": [
        {"term": {"status": "active"}},
        {"term": {"category": "electronics"}}
      ]
    }
  }
}

term vs match Query Comparison

1. Core Differences

Feature term Query match Query
Tokenization No tokenization, exact match Tokenizes the query term
Matching Method Exact match of terms in the index Fuzzy matching, supports relevance scoring
Applicable Fields keyword, numeric, date, etc. text type fields
Performance Faster (no relevance calculation) Slower (requires score calculation)
Caching Results can be cached Results are usually not cached

2. Practical Example Comparison

2.1 Different Results for the Same Query Term

# 数据准备
POST /test/_doc/1
{
  "title": "iPhone 15 Pro Max",
  "title.keyword": "iPhone 15 Pro Max",
  "status": "active"
}

# term 查询 - 精确匹配
POST /test/_search
{
  "query": {
    "term": {
      "title.keyword": "iPhone 15 Pro Max"
    }
  }
}
# 结果:匹配成功

# term 查询 - 对 text 字段使用 term(通常不匹配)
POST /test/_search
{
  "query": {
    "term": {
      "title": "iPhone 15 Pro Max"
    }
  }
}
# 结果:不匹配(因为 title 被分词为 ["iphone", "15", "pro", "max"])

# match 查询 - 对 text 字段使用 match
POST /test/_search
{
  "query": {
    "match": {
      "title": "iPhone 15 Pro Max"
    }
  }
}
# 结果:匹配成功,有相关性评分

2.2 Partial Match Comparison

# term 查询 - 部分词不匹配
POST /test/_search
{
  "query": {
    "term": {
      "title.keyword": "iPhone 15"
    }
  }
}
# 结果:不匹配(需要完全一致)

# match 查询 - 部分词匹配
POST /test/_search
{
  "query": {
    "match": {
      "title": "iPhone 15"
    }
  }
}
# 结果:匹配成功,相关性评分较低

3. Usage Scenario Comparison

3.1 term Query Applicable Scenarios

# 1. 状态过滤
POST /products/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"status": "active"}}
      ]
    }
  }
}

# 2. 分类筛选
POST /products/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"category": "electronics"}}
      ]
    }
  }
}

# 3. 标签匹配
POST /products/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"tags": "premium"}}
      ]
    }
  }
}

# 4. 聚合统计
POST /products/_search
{
  "size": 0,
  "aggs": {
    "status_count": {
      "terms": {
        "field": "status"
      }
    }
  }
}

3.2 match Query Applicable Scenarios

# 1. 全文搜索
POST /products/_search
{
  "query": {
    "match": {
      "title": "iPhone 15 Pro"
    }
  }
}

# 2. 描述搜索
POST /products/_search
{
  "query": {
    "match": {
      "description": "Latest款手机"
    }
  }
}

# 3. 多字段搜索
POST /products/_search
{
  "query": {
    "multi_match": {
      "query": "苹果手机",
      "fields": ["title", "description", "tags"]
    }
  }
}

4. Performance Comparison

4.1 Query Performance

# term 查询 - 高性能
POST /products/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"status": "active"}},
        {"term": {"category": "electronics"}}
      ]
    }
  }
}
# 特点:不计算相关性,结果可缓存

# match 查询 - 相对较慢
POST /products/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "iPhone"}},
        {"match": {"description": "手机"}}
      ]
    }
  }
}
# 特点:需要计算相关性评分,结果通常不缓存

4.2 Mixed Usage Optimization

# 最佳实践:term 用于过滤,match 用于搜索
POST /products/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "iPhone 15"}}
      ],
      "filter": [
        {"term": {"status": "active"}},
        {"term": {"category": "electronics"}},
        {"range": {"price": {"gte": 1000, "lte": 10000}}}
      ]
    }
  }
}

5. Common Errors and Solutions

5.1 Using term Query on text Fields

# 错误用法
POST /products/_search
{
  "query": {
    "term": {
      "title": "iPhone"  # title 是 text 字段,会被分词
    }
  }
}

# 正确用法
POST /products/_search
{
  "query": {
    "term": {
      "title.keyword": "iPhone"  # 使用 keyword 字段
    }
  }
}

# 或者使用 match
POST /products/_search
{
  "query": {
    "match": {
      "title": "iPhone"
    }
  }
}

5.2 Case Sensitivity Issues

# term 查询大小写敏感
POST /products/_search
{
  "query": {
    "term": {
      "status": "Active"  # 如果索引中是 "active",则不匹配
    }
  }
}

# 解决方案:使用 match 或确保大小写一致
POST /products/_search
{
  "query": {
    "match": {
      "status": "Active"  # match 会进行分词和标准化
    }
  }
}

6. Best Practice Recommendations

6.1 Field Mapping Design

# 创建支持两种查询的映射
PUT /products
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_smart",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "status": {
        "type": "keyword"
      },
      "price": {
        "type": "double"
      }
    }
  }
}

6.2 Query Combination Strategy

# 推荐:精确过滤 + 模糊搜索
POST /products/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "用户搜索词"}}
      ],
      "filter": [
        {"term": {"status": "active"}},
        {"term": {"category": "electronics"}},
        {"range": {"price": {"gte": 1000}}}
      ]
    }
  },
  "sort": [
    {"_score": "desc"},
    {"price": "asc"}
  ]
}

7. Summary

  • term query: Suitable for exact matching, filtering, aggregation; better performance, results can be cached
  • match query: Suitable for full-text search, fuzzy matching; supports relevance scoring
  • Best practice: term for filter conditions, match for search content; use both in combination
  • Field design: Create keyword sub-fields for fields requiring exact matching
  • Performance optimization: Place exact matching conditions in the filter to avoid unnecessary scoring calculations

ES Mapping Concepts and Usage

1. What is Mapping

Mapping is the "structure definition" of an index, similar to the schema of a relational database table. It is used to declare the type and indexing method of each field, determining:

  • The data type and storage format of the field (text, keyword, numeric, date, boolean, geo, nested, etc.)
  • Whether it participates in the inverted index and how it is tokenized (index, analyzer)
  • Whether it can be used for aggregation/sorting (doc_values)
  • Multi-field definition: indexing the same business field in multiple ways
  • Dynamic field processing strategy (dynamic)

Since ES 7, an index has only one type (internal _doc), and modeling directly faces "index + mapping".

2. Common Field Types and Scenarios

  • text: Tokenized, used for full-text retrieval; not suitable for aggregation/sorting
  • keyword: Not tokenized, suitable for exact matching, aggregation, sorting; has doc_values by default
  • Numeric and date: integer/long/double/date, etc., suitable for range filtering, aggregation, and sorting
  • Structured: object (flat object within the same document), nested (each object in an array is modeled independently, supporting independent sub-queries)
  • Geographic: geo_point/geo_shape

Typical multi-fields (for both full-text and exact matching):

"title": {
  "type": "text",
  "analyzer": "ik_smart",
  "fields": {
    "keyword": { "type": "keyword", "ignore_above": 256 }
  }
}

3. Creating an Index and Explicitly Setting Mapping

PUT /products
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  },
  "mappings": {
    "dynamic": "true",
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "keyword": { "type": "keyword", "ignore_above": 256 }
        }
      },
      "price": { "type": "double" },
      "status": { "type": "keyword" },
      "createdAt": { "type": "date" },
      "tags": { "type": "keyword" },
      "attrs": { "type": "object" },
      "specs": { "type": "nested" }
    }
  }
}

4. Viewing/Updating Mapping

  • View Mapping
GET /products/_mapping
  • Add Fields (only new fields can be added, existing field types cannot be changed)
PUT /products/_mapping
{
  "properties": {
    "brand": { "type": "keyword" }
  }
}

5. Correct Way to Change Field Type (Reindexing)

  1. Create a new index and define the correct mapping products_v2
  2. Migrate data
POST /_reindex
{
  "source": { "index": "products" },
  "dest":   { "index": "products_v2" }
}
  1. Switch traffic using aliases
POST /_aliases
{
  "actions": [
    { "remove": { "index": "products", "alias": "products_read" }},
    { "add":    { "index": "products_v2", "alias": "products_read" }}
  ]
}

6. Dynamic Mapping Strategy

"mappings": {
  "dynamic": "strict",
  "properties": { /* 显式列出字段,未知字段将被拒绝 */ }
}

It is recommended to use strict for core indexes to prevent dirty data from being automatically inferred into incorrect types (e.g., treating numeric values as text).

7. Performance and Practice Essentials

  • Only enable index for fields that need to be searched/filtered; for purely display fields, index: false can be used
  • Fields that need aggregation/sorting should keep doc_values: true (text has no doc_values)
  • For Chinese scenarios, install the IK analyzer and specify the analyzer for text fields
  • Use nested for nested arrays to avoid cross-matching caused by object
  • Use multi-fields to support both full-text and exact matching simultaneously

In a nutshell: Mapping determines "how fields are stored, indexed, and queried". Before building an index, clarify your query and aggregation requirements, then design the mapping to achieve correct and high-performance retrieval.

ES Analyzer Usage and Explanation

1. What is an Analyzer

An Analyzer is a component that "normalizes → tokenizes → filters" text fields during writing/searching, typically consisting of three parts:

  • char_filter: Character-level preprocessing (e.g., removing HTML tags)
  • tokenizer: Splits text into tokens (lexemes), such as standard, whitespace, ik_smart
  • filter: Further processes tokens (lowercasing, stop word removal, synonyms, stemming, etc.)

The field's analyzer is used during the writing phase, and the same analyzer is used by default during the search phase, but can be specified separately via search_analyzer.

2. Built-in Common Analyzers

  • standard (default): General English tokenization, lowercasing
  • simple: Splits by non-letters, lowercasing
  • whitespace: Splits only by whitespace, preserves case
  • stop: Based on simple, removes stop words
  • keyword: No tokenization, treats the entire input as a single token (mostly used by normalizer for keyword fields)
  • pattern: Splits based on regular expressions

Commonly used for Chinese: ik_smart, ik_max_word (requires plugin installation).

3. Using _analyze to Test Tokenization Effect

POST /_analyze
{
  "analyzer": "standard",
  "text": "iPhone 15 Pro Max"
}

POST /_analyze
{
  "analyzer": "ik_smart",
  "text": "苹果手机保护壳"
}

4. Setting analyzer and search_analyzer on Fields

PUT /docs
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_smart",
        "search_analyzer": "ik_max_word"
      }
    }
  }
}

Explanation:

  • Use ik_smart during writing, and a finer-grained ik_max_word during querying to improve recall.

5. Temporarily Specifying an Analyzer During Query (Without Changing Mapping)

POST /docs/_search
{
  "query": {
    "match": {
      "title": {
        "query": "苹果手机",
        "analyzer": "ik_max_word"
      }
    }
  }
}

6. Custom Analyzer (Including Synonyms/Stop Words)

PUT /articles
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": ["iphone,苹果手机", "notebook,笔记本"]
        }
      },
      "analyzer": {
        "my_zh_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip"],
          "tokenizer": "ik_smart",
          "filter": ["lowercase", "my_synonyms"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": { "type": "text", "analyzer": "my_zh_analyzer" }
    }
  }
}

7. Normalizer (Standardization for keyword)

keyword fields are not tokenized and cannot use an analyzer; if lowercasing or punctuation removal is needed, a normalizer can be used:

PUT /users
{
  "settings": {
    "analysis": {
      "normalizer": {
        "lowercase_normalizer": {
          "type": "custom",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "email": { "type": "keyword", "normalizer": "lowercase_normalizer" }
    }
  }
}

8. IK Analyzer Installation and Field Example (Brief)

  1. Installation (based on version): bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/... Restart ES
  2. Usage:
PUT /goods
{
  "mappings": {
    "properties": {
      "title": { "type": "text", "analyzer": "ik_smart", "search_analyzer": "ik_max_word" }
    }
  }
}

9. Considerations for Changing Analyzers

  • The analyzer of an existing field generally cannot be directly modified; a "reindex" process is required
  • Different analyzers will affect the inverted index structure, so re-verify query semantics and relevance after changes

10. Performance and Practice

  • Choose the simplest possible analyzer for writing (e.g., ik_smart), and a finer-grained one for querying (ik_max_word) to improve recall
  • Use _analyze to verify if tokenization meets expectations; frequent filter conditions should use keyword + normalizer
  • Control the number of fields and tokenization granularity to avoid index explosion; manage synonym lists externally for easy updates

ES Terms and Glossary Quick Reference

The following concepts are categorized by topic for quick understanding and reference.

Index and Document Modeling

  • Index: A logical container for a collection of documents, similar to a database. Internally composed of multiple shards
  • Document: A single record, stored as JSON, uniquely identified by _id
  • Field: A document attribute, determining available query and aggregation methods
  • Mapping: Definition of field types and indexing strategies, equivalent to a table schema
  • Type: A logical "table" concept existing in 6.x and below. Fixed to _doc from 7.x, hidden externally in 8.x
  • Text: A tokenized field type, used for full-text retrieval, not suitable for aggregation/sorting
  • Keyword: Not tokenized, suitable for exact matching, aggregation/sorting, usually has doc_values
  • Multi-fields: Indexing the same field in multiple ways, such as title and title.keyword
  • Object: Object field, properties are flattened and merged into the same document
  • Nested: Nested object, each array element is indexed independently, avoiding cross-matching, and can be queried independently
  • Dynamic mapping: Strategy for when unknown fields appear (true/false/strict)

Tokenization and Normalization

  • Analyzer: Tokenizer, including three stages: char_filtertokenizerfilter
  • Tokenizer: Component that splits text into tokens, such as standard, whitespace, ik_smart
  • Token: The basic unit in the inverted index (standardized/tokenized lexeme)
  • Char filter: Character-level preprocessing, such as html_strip
  • Token filter: Further processing of tokens, such as lowercase, synonym, stop
  • Normalizer: Normalization for keyword fields (lowercasing, accent removal, etc.), no tokenization

Inverted Index and Scoring

  • Inverted index: Index structure of term → document list (postings)
  • Term: A term in the index (a token after standardization/tokenization)
  • Posting: Document occurrence information, including docID, frequency, position, etc.
  • Relevance score: Relevance score, used for sorting
  • BM25: Default relevance model (replaces TF-IDF)
  • Query vs Filter: Query participates in scoring, Filter only performs boolean filtering and can be cached
  • Bool query: must/should/must_not/filter combination query

Storage and Segments

  • Segment: Immutable data segment, created by appending writes; merge reduces the number of segments
  • Refresh: Flushes in-memory increments to new segments, default cycle 1s, visible after refresh
  • Flush: Persists the translog and creates a new commit point
  • Translog: Write-ahead log, used for crash recovery
  • Doc values: Columnar storage, supporting aggregation/sorting/scripting, text has no doc values
  • _source: Original JSON document, stored by default, used for re-retrieval and reindex
  • Stored fields: Separately stored fields (not commonly used), distinct from _source
  • Norms: Field-level length normalization and other scoring factors, can be disabled to save space

Cluster and Shards

  • Cluster: An ES cluster composed of multiple nodes
  • Node: An instance in the cluster, common roles: master, data, ingest, coordinating
  • Shard (Primary shard): The physical sharding unit of an index, quantity determined at creation
  • Replica: A copy of a primary shard, improving high availability and query throughput
  • Routing: Determines which primary shard a document falls into based on the routing value, defaults to _id hash
  • Alias: An alias, can point to one or more indexes, facilitating seamless switching

Writing and Batch Processing

  • Bulk API: Batch write/update/delete
  • Update by query: Batch update by condition
  • Delete by query: Batch delete by condition
  • Reindex: Copy from source index to target index (often used for mapping changes)
  • Ingest pipeline: Pre-write processing pipeline (grok, rename, set, script, etc.)
  • Painless: ES built-in scripting language, used for script updates, script sorting, etc.

Search and Pagination

  • Match: Full-text query, tokenized
  • Term/Terms: Exact match, not tokenized
  • Range: Range query (numeric/date)
  • Multi-match: Multi-field full-text query
  • Nested query: Sub-query for nested fields
  • Aggregation: Aggregation analysis (terms, stats, date_histogram, range, etc.)
  • Highlight: Highlight matching snippets
  • Suggesters: Search suggestions (term/phrase/completion)
  • From/size: Basic pagination, deep pagination is costly
  • Search after: Cursor-based pagination, replaces deep pagination
  • Scroll: Snapshot-style cursor for large-volume exports, not for real-time queries
  • PIT (Point in time): Point-in-time consistency snapshot, used for stable pagination

Lifecycle and Index Management

  • ILM (Index Lifecycle Management): Hot/Warm/Cold/Delete lifecycle policies
  • Rollover: Switch to a new index based on size/document count/time
  • Snapshot/Restore: Snapshot and recovery (repositories can connect to S3, HDFS, etc.)

Operations and Performance

  • Cluster health: Cluster health (green/yellow/red)
  • Refresh interval: Refresh cycle, can be increased in write-heavy scenarios to speed up writes
  • Replicas: Number of replicas affects query throughput and write cost
  • Force merge: Merge segments for read-only indexes, reducing file count and improving query performance
  • Slow logs: Slow query and slow indexing logs, used for troubleshooting and optimization

The above terms cover high-frequency concepts in ES daily modeling, writing, retrieval, aggregation, and operational optimization. Combined with the preceding chapters on Mapping, Analyzer, term/match, and multi_match, they form a complete knowledge graph for ES usage.

主题测试文章,只做测试使用。发布者:Walker,转转请注明出处:https://walker-learn.xyz/archives/4784

(0)
Walker的头像Walker
上一篇 Mar 10, 2026 00:00
下一篇 Mar 8, 2026 15:40

Related Posts

  • Go Engineering Systematic Course 003 [Study Notes]

    grpc grpc grpc-go grpc seamlessly integrates protobuf protobuf. For those of you accustomed to using JSON and XML data storage formats, I believe most have never heard of Protocol Buffer. Protocol Buffer is actually a lightweight & efficient structured data storage format developed by Google, and its performance is truly much, much stronger than JSON and XML! protobuf…

    Personal Nov 25, 2025
    25300
  • From 0 to 1: Implementing Micro-frontend Architecture 001 [Study Notes]

    Micro-frontends, JS isolation, CSS isolation, element isolation, lifecycle, preloading, data communication, application navigation, multi-level nesting. Note: This uses Mermaid's flowchart syntax, which is supported by Markdown renderers such as Typora, VitePress, and some Git platforms. Retained: Host application main-vue3; child applications: child-nuxt2-home, child-vue2-job, child-vu...

    Apr 20, 2025
    1.6K00
  • [Opening]

    I am Walker, born in the early 1980s, a journeyer through code and life. A full-stack development engineer, I navigate the boundaries between front-end and back-end, dedicated to the intersection of technology and art. Code is the language with which I weave dreams; projects are the canvas on which I paint the future. Amidst the rhythmic tapping of the keyboard, I explore the endless possibilities of technology, allowing inspiration to bloom eternally within the code. An avid coffee enthusiast, I am captivated by the poetry and ritual of every pour-over. In the rich aroma and subtle bitterness of coffee, I find focus and inspiration, mirroring my pursuit of excellence and balance in the world of development. Cycling...

    Feb 6, 2025 Personal
    2.3K00
  • In-depth Understanding of ES6 006 [Study Notes]

    Symbol and Symbol properties The 6th primitive data type: Symbol. Private names were originally designed to allow developers to create non-string property names, but general techniques cannot detect the private names of these properties. Creating a Symbol let firstName = Symbol(); let person = {} person[firstName] = "Nicholas"; cons…

    Personal Mar 8, 2025
    1.3K00
  • Go Engineer System Course 004 [Study Notes]

    Requirements Analysis Backend Management System Product Management Product List Product Categories Brand Management Brand Categories Order Management Order List User Information Management User List User Addresses User Messages Carousel Management E-commerce System Login Page Homepage Product Search Product Category Navigation Carousel Display Recommended Products Display Product Details Page Product Image Display Product Description Product Specification Selection Add to Cart Shopping Cart Product List Quantity Adjustment Delete Product Checkout Function User Center Order Center My...

    Nov 25, 2025
    27400
EN
简体中文 繁體中文 English