Inverted Index for Queries

1. What is an Inverted Index?

An Inverted Index is a data structure used for quickly finding documents that contain specific terms. It is one of the core technologies of search engines.

1.1 Basic Concepts

Forward Index: Document ID → Document Content (list of terms)
Inverted Index: Term → List of Document IDs containing the term

1.2 Why is it called "Inverted"?

The Inverted Index reverses the traditional relationship of "which terms a document contains" to "in which documents a term appears," hence the name "inverted."

2. Structure of an Inverted Index

2.1 Basic Structure

词项 → 文档频率 → 文档列表

2.2 Detailed Structure

词项 → {
    文档频率: N,
    文档列表: [
        {文档ID: 1, 词频: 2, 位置: [0, 5]},
        {文档ID: 3, 词频: 1, 位置: [2]}
    ]
}

3. How an Inverted Index Works

3.1 Building Process

Document Preprocessing: Tokenization, stop word removal, stemming
Term Statistics: Count the frequency and position of each term in documents
Index Construction: Establish mapping relationships from terms to documents

3.2 Query Process

Query Parsing: Tokenize the query string
Index Lookup: Look up each term in the inverted index
Result Merging: Merge document lists for multiple terms
Sorting and Returning: Sort results by relevance and return

4. Implementing an Inverted Index in Go

4.1 Data Structure Definition

package main

import (
    "fmt"
    "sort"
    "strings"
)

// 文档信息
type Document struct {
    ID   int
    Text string
}

// 词项在文档中的位置信息
type Posting struct {
    DocID     int
    Frequency int
    Positions []int
}

// 倒排索引项
type InvertedIndexItem struct {
    Term      string
    DocFreq   int
    Postings  []Posting
}

// 倒排索引
type InvertedIndex struct {
    Index map[string]*InvertedIndexItem
}

// 创建新的倒排索引
func NewInvertedIndex() *InvertedIndex {
    return &InvertedIndex{
        Index: make(map[string]*InvertedIndexItem),
    }
}

4.2 Index Construction

// 添加文档到索引
func (idx *InvertedIndex) AddDocument(docID int, text string) {
    // 简单的分词（实际应用中需要更复杂的分词算法）
    words := strings.Fields(strings.ToLower(text))

    for pos, word := range words {
        if idx.Index[word] == nil {
            idx.Index[word] = &InvertedIndexItem{
                Term:     word,
                DocFreq:  0,
                Postings: make([]Posting, 0),
            }
        }

        // 查找是否已存在该文档的posting
        var posting *Posting
        for i := range idx.Index[word].Postings {
            if idx.Index[word].Postings[i].DocID == docID {
                posting = &idx.Index[word].Postings[i]
                break
            }
        }

        if posting == nil {
            // 创建新的posting
            newPosting := Posting{
                DocID:     docID,
                Frequency: 1,
                Positions: []int{pos},
            }
            idx.Index[word].Postings = append(idx.Index[word].Postings, newPosting)
            idx.Index[word].DocFreq++
        } else {
            // 更新现有posting
            posting.Frequency++
            posting.Positions = append(posting.Positions, pos)
        }
    }
}

4.3 Query Implementation

// 单词查询
func (idx *InvertedIndex) Search(term string) []int {
    term = strings.ToLower(term)
    if item, exists := idx.Index[term]; exists {
        docIDs := make([]int, len(item.Postings))
        for i, posting := range item.Postings {
            docIDs[i] = posting.DocID
        }
        return docIDs
    }
    return []int{}
}

// 多词查询（AND操作）
func (idx *InvertedIndex) SearchAnd(terms []string) []int {
    if len(terms) == 0 {
        return []int{}
    }

    // 获取第一个词的结果
    result := idx.Search(terms[0])

    // 与其他词的结果求交集
    for i := 1; i < len(terms); i++ {
        otherResult := idx.Search(terms[i])
        result = intersect(result, otherResult)
    }

    return result
}

// 多词查询（OR操作）
func (idx *InvertedIndex) SearchOr(terms []string) []int {
    if len(terms) == 0 {
        return []int{}
    }

    resultSet := make(map[int]bool)

    for _, term := range terms {
        docIDs := idx.Search(term)
        for _, docID := range docIDs {
            resultSet[docID] = true
        }
    }

    result := make([]int, 0, len(resultSet))
    for docID := range resultSet {
        result = append(result, docID)
    }

    sort.Ints(result)
    return result
}

// 求两个切片的交集
func intersect(a, b []int) []int {
    set := make(map[int]bool)
    for _, x := range a {
        set[x] = true
    }

    result := make([]int, 0)
    for _, x := range b {
        if set[x] {
            result = append(result, x)
        }
    }

    return result
}

4.4 Complete Example

func main() {
    // 创建倒排索引
    index := NewInvertedIndex()

    // 添加文档
    documents := []Document{
        {ID: 1, Text: "Go is a programming language"},
        {ID: 2, Text: "Go is fast and efficient"},
        {ID: 3, Text: "Programming in Go is fun"},
        {ID: 4, Text: "Go language is simple"},
    }

    // 构建索引
    for _, doc := range documents {
        index.AddDocument(doc.ID, doc.Text)
    }

    // 查询示例
    fmt.Println("搜索 'go':", index.Search("go"))
    fmt.Println("搜索 'programming':", index.Search("programming"))
    fmt.Println("搜索 'go' AND 'language':", index.SearchAnd([]string{"go", "language"}))
    fmt.Println("搜索 'go' OR 'fast':", index.SearchOr([]string{"go", "fast"}))

    // 打印索引结构
    fmt.Println("\n倒排索引结构:")
    for term, item := range index.Index {
        fmt.Printf("词项: %s, 文档频率: %d\n", term, item.DocFreq)
        for _, posting := range item.Postings {
            fmt.Printf("  文档ID: %d, 词频: %d, 位置: %v\n",
                posting.DocID, posting.Frequency, posting.Positions)
        }
    }
}

5. Optimizing Inverted Indexes

5.1 Compression Techniques

Variable-Byte Encoding: Use variable-byte encoding to compress document IDs
Differential Encoding: Store the difference in document IDs instead of absolute values
Bitmap Compression: Use bitmaps to represent document sets

5.2 Query Optimization

Skip Lists: Quickly locate positions in long lists
Caching Mechanism: Cache results of popular queries
Parallel Querying: Process queries with multiple threads

6. Practical Application Scenarios

6.1 Search Engines

Core technology for search engines like Google and Baidu
Indexing and retrieval of web page content

6.2 Database Systems

Full-text search functionality
Fast querying of text fields

6.3 Code Search

GitHub code search
Code navigation in IDEs

6.4 Log Analysis

Fast retrieval of log files
Locating error logs

7. Performance Analysis

7.1 Time Complexity

Index Construction: O(N×M), where N is the number of documents and M is the average number of terms
Single-term Query: O(1) on average
Multi-term Query: O(k×log(n)), where k is the number of results and n is the number of documents

7.2 Space Complexity

Storage Space: O(V×D), where V is the vocabulary size and D is the average document frequency

7.3 Advantages and Disadvantages

Advantages:

Fast query speed
Supports complex queries
Easy to implement

Disadvantages:

Time-consuming index construction
Large storage space
Complex index updates

8. Summary

The inverted index is a core technology in information retrieval. By reversing the "document-term" relationship to a "term-document" relationship, it enables efficient text search. In Go, we can implement an inverted index using basic data structures like maps and slices, providing powerful search capabilities for applications.

multi_match Usage Guide

multi_match is a query type in ES that performs searches across multiple fields simultaneously. It is essentially an extension of the match query across multiple fields. It is suitable for combined retrieval across multiple text fields like title, description, and tags, often used with field boosting, different query types, and analyzers.

1. Basic Usage

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "iPhone 15",
      "fields": ["title", "description", "tags"]
    }
  }
}

2. Field Boosting (boost)

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "iPhone 15",
      "fields": ["title^3", "description^1.5", "tags"]
    }
  }
}

Explanation: title^3 means that the match score for the title field is multiplied by a weight of 3, thereby boosting the score of hits on this field during sorting.

3. Type Option and Applicable Scenarios

best_fields (default): Selects the score of the best-matching field among all fields as the primary score, can be used with tie_breaker

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "apple phone",
      "fields": ["title", "description", "tags"],
      "type": "best_fields",
      "tie_breaker": 0.2
    }
  }
}

most_fields: Scores from multiple fields are summed up, suitable for cases where the same semantic meaning is distributed across multiple fields (e.g., the same text split and stored in different fields)

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "iphone",
      "fields": ["title", "title.ngram", "description"],
      "type": "most_fields"
    }
  }
}

cross_fields: Treats multiple fields as one large field for matching, suitable for scenarios where terms are distributed across different fields (e.g., first_name + last_name)

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "tim cook",
      "fields": ["first_name", "last_name"],
      "type": "cross_fields",
      "operator": "and"
    }
  }
}

phrase: Phrase matching, requires strict word order and proximity, suitable for exact phrase search

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "iphone 15 pro",
      "fields": ["title", "description"],
      "type": "phrase"
    }
  }
}

phrase_prefix: Phrase prefix matching, suitable for input method suggestions/search suggestions

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "iph 15",
      "fields": ["title", "description"],
      "type": "phrase_prefix",
      "max_expansions": 50
    }
  }
}

4. Operator and Minimum Match

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "apple flagship phone",
      "fields": ["title", "description"],
      "operator": "and",
      "minimum_should_match": "75%"
    }
  }
}

Explanation:

operator: and requires all query terms to match; or (default) matches any one term
minimum_should_match controls the minimum proportion or number of matching terms, such as 2, 3<75%, 75%, etc.

5. Fuzzy Matching (fuzziness) and Correction

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "iphine",
      "fields": ["title", "description"],
      "fuzziness": "AUTO",
      "prefix_length": 1
    }
  }
}

Explanation: fuzziness: AUTO provides fault tolerance for common spelling errors; prefix_length specifies the length of the prefix that must match exactly.

6. Analyzer and Field Selection

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "苹果 手机",
      "fields": ["title", "title.keyword^5", "description"],
      "analyzer": "ik_smart"
    }
  }
}

Suggestions:

Mostly used for full-text search on text fields; for exact matching and aggregation/sorting, use keyword fields (can be combined with boost)
For Chinese search, analyzers like ik_smart, ik_max_word can be used (plugins need to be installed)

7. Combined Example (Comprehensive Fields, Weights, Filtering, and Sorting)

POST /products/_search
{
  "_source": ["id", "title", "price", "brand"],
  "from": 0,
  "size": 20,
  "sort": [
    {"_score": "desc"},
    {"price": "asc"}
  ],
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "iphone 15 pro",
            "fields": ["title^4", "subtitle^2", "description", "tags"],
            "type": "best_fields",
            "tie_breaker": 0.3,
            "minimum_should_match": "66%"
          }
        }
      ],
      "filter": [
        {"term": {"brand": "apple"}},
        {"range": {"price": {"gte": 3000, "lte": 10000}}}
      ]
    }
  },
  "highlight": {
    "fields": {
      "title": {},
      "description": {}
    }
  }
}

8. Common Issues and Suggestions

Suboptimal relevance:
Set higher weights for core fields (e.g., title^N)
Choose the appropriate type: use cross_fields for term distribution across fields, most_fields for combined scores
Use synonyms, spell correction (fuzziness), and domain-specific dictionaries
Performance issues:
Control returned fields (_source filtering) and size
Place filter conditions in filter, which hits the cache and does not participate in scoring
Avoid using wildcard/phrase_prefix for prefix expansion on a huge number of fields
Exact vs. Full-text:
Use keyword for exact matching and aggregation; use text + analyzer for full-text search
Can create multi-fields for the same business field (text + keyword)

term Query Explained

The term query is a query type in ES used for exact matching. It does not perform tokenization on the query term but directly matches it precisely with terms in the index. It is suitable for keyword type fields, numeric fields, date fields, etc. (No tokenization, no lowercasing).

1. Basic Usage

POST /products/_search
{
  "query": {
    "term": {
      "status": "active"
    }
  }
}

2. Multi-field term Query

POST /products/_search
{
  "query": {
    "bool": {
      "must": [
        {"term": {"status": "active"}},
        {"term": {"category": "electronics"}},
        {"term": {"brand": "apple"}}
      ]
    }
  }
}

3. Exact Matching for Numeric Fields

POST /products/_search
{
  "query": {
    "term": {
      "price": 5999
    }
  }
}

4. Exact Matching for Date Fields

POST /products/_search
{
  "query": {
    "term": {
      "created_date": "2025-01-18"
    }
  }
}

5. Array Field Matching

POST /products/_search
{
  "query": {
    "term": {
      "tags": "phone"
    }
  }
}

6. Using boost to Increase Weight

POST /products/_search
{
  "query": {
    "term": {
      "status": {
        "value": "active",
        "boost": 2.0
      }
    }
  }
}

7. terms Query (Multi-value Matching)

POST /products/_search
{
  "query": {
    "terms": {
      "status": ["active", "pending", "review"]
    }
  }
}

8. Combined with filter

POST /products/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "iPhone"}}
      ],
      "filter": [
        {"term": {"status": "active"}},
        {"term": {"category": "electronics"}}
      ]
    }
  }
}

term vs match Query Comparison

1. Core Differences

Feature	term Query	match Query
Tokenization	No tokenization, exact match	Performs tokenization on query terms
Matching Method	Exact match with terms in the index	Fuzzy matching, supports relevance scoring
Applicable Fields	keyword, numeric, date, etc.	text type fields
Performance	Faster (no relevance calculation)	Slower (requires score calculation)
Caching	Results can be cached	Results are usually not cached

2. Practical Example Comparison

2.1 Different Results for the Same Query Term

# 数据准备
POST /test/_doc/1
{
  "title": "iPhone 15 Pro Max",
  "title.keyword": "iPhone 15 Pro Max",
  "status": "active"
}

# term 查询 - 精确匹配
POST /test/_search
{
  "query": {
    "term": {
      "title.keyword": "iPhone 15 Pro Max"
    }
  }
}
# 结果：匹配成功

# term 查询 - 对 text 字段使用 term（通常不匹配）
POST /test/_search
{
  "query": {
    "term": {
      "title": "iPhone 15 Pro Max"
    }
  }
}
# 结果：不匹配（因为 title 被分词为 ["iphone", "15", "pro", "max"]）

# match 查询 - 对 text 字段使用 match
POST /test/_search
{
  "query": {
    "match": {
      "title": "iPhone 15 Pro Max"
    }
  }
}
# 结果：匹配成功，有相关性评分

2.2 Partial Match Comparison

# term 查询 - 部分词不匹配
POST /test/_search
{
  "query": {
    "term": {
      "title.keyword": "iPhone 15"
    }
  }
}
# 结果：不匹配（需要完全一致）

# match 查询 - 部分词匹配
POST /test/_search
{
  "query": {
    "match": {
      "title": "iPhone 15"
    }
  }
}
# 结果：匹配成功，相关性评分较低

3. Use Case Comparison

3.1 Applicable Scenarios for term Query

# 1. 状态过滤
POST /products/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"status": "active"}}
      ]
    }
  }
}

# 2. 分类筛选
POST /products/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"category": "electronics"}}
      ]
    }
  }
}

# 3. 标签匹配
POST /products/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"tags": "premium"}}
      ]
    }
  }
}

# 4. 聚合统计
POST /products/_search
{
  "size": 0,
  "aggs": {
    "status_count": {
      "terms": {
        "field": "status"
      }
    }
  }
}

3.2 Applicable Scenarios for match Query

# 1. 全文搜索
POST /products/_search
{
  "query": {
    "match": {
      "title": "iPhone 15 Pro"
    }
  }
}

# 2. 描述搜索
POST /products/_search
{
  "query": {
    "match": {
      "description": "Latest款手机"
    }
  }
}

# 3. 多字段搜索
POST /products/_search
{
  "query": {
    "multi_match": {
      "query": "苹果手机",
      "fields": ["title", "description", "tags"]
    }
  }
}

4. Performance Comparison

4.1 Query Performance

# term 查询 - 高性能
POST /products/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"status": "active"}},
        {"term": {"category": "electronics"}}
      ]
    }
  }
}
# 特点：不计算相关性，结果可缓存

# match 查询 - 相对较慢
POST /products/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "iPhone"}},
        {"match": {"description": "手机"}}
      ]
    }
  }
}
# 特点：需要计算相关性评分，结果通常不缓存

4.2 Hybrid Usage Optimization

# 最佳实践：term 用于过滤，match 用于搜索
POST /products/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "iPhone 15"}}
      ],
      "filter": [
        {"term": {"status": "active"}},
        {"term": {"category": "electronics"}},
        {"range": {"price": {"gte": 1000, "lte": 10000}}}
      ]
    }
  }
}

5. Common Errors and Solutions

5.1 Using term Query on text Fields

# 错误用法
POST /products/_search
{
  "query": {
    "term": {
      "title": "iPhone"  # title 是 text 字段，会被分词
    }
  }
}

# 正确用法
POST /products/_search
{
  "query": {
    "term": {
      "title.keyword": "iPhone"  # 使用 keyword 字段
    }
  }
}

# 或者使用 match
POST /products/_search
{
  "query": {
    "match": {
      "title": "iPhone"
    }
  }
}

5.2 Case Sensitivity Issues

# term 查询大小写敏感
POST /products/_search
{
  "query": {
    "term": {
      "status": "Active"  # 如果索引中是 "active"，则不匹配
    }
  }
}

# 解决方案：使用 match 或确保大小写一致
POST /products/_search
{
  "query": {
    "match": {
      "status": "Active"  # match 会进行分词和标准化
    }
  }
}

6. Best Practice Suggestions

6.1 Field Mapping Design

# 创建支持两种查询的映射
PUT /products
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_smart",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "status": {
        "type": "keyword"
      },
      "price": {
        "type": "double"
      }
    }
  }
}

6.2 Query Combination Strategy

# 推荐：精确过滤 + 模糊搜索
POST /products/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "用户搜索词"}}
      ],
      "filter": [
        {"term": {"status": "active"}},
        {"term": {"category": "electronics"}},
        {"range": {"price": {"gte": 1000, "lte": 10000}}}
      ]
    }
  },
  "sort": [
    {"_score": "desc"},
    {"price": "asc"}
  ]
}

7. Summary

term Query: Suitable for exact matching, filtering, aggregation; better performance, results can be cached
match Query: Suitable for full-text search, fuzzy matching; supports relevance scoring
Best Practice: Use term for filter conditions, match for search content; combine both
Field Design: Create keyword sub-fields for fields requiring exact matching
Performance Optimization: Place exact match conditions in the filter to avoid unnecessary scoring calculations

ES Mapping Concepts and Usage

1. What is Mapping

Mapping is the "structure definition" of an index, similar to a relational database table schema. It is used to declare the type and indexing method for each field, determining:

Data type and storage format of fields (text, keyword, numeric, date, boolean, geo, nested, etc.)
Whether to participate in the inverted index and how to tokenize (index, analyzer)
Whether it can be used for aggregation/sorting (doc_values)
Multi-field definition: Indexing the same business field in multiple ways
Dynamic field handling strategy (dynamic)

Since ES 7, an index has only one type (internally _doc), and modeling directly focuses on "index + mapping."

2. Common Field Types and Scenarios

text: Tokenized, used for full-text search; not suitable for aggregation/sorting
keyword: Not tokenized, suitable for exact matching, aggregation, sorting; has doc_values by default
Numeric and Date: integer/long/double/date, etc., suitable for range filtering, aggregation, and sorting
Structured: object (flattened objects within the same document), nested (each object in an array is modeled independently, supporting independent sub-queries)
Geographic: geo_point/geo_shape

Typical multi-fields (for both full-text and exact matching):

"title": {
  "type": "text",
  "analyzer": "ik_smart",
  "fields": {
    "keyword": { "type": "keyword", "ignore_above": 256 }
  }
}

3. Create Index and Explicitly Set Mapping

PUT /products
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  },
  "mappings": {
    "dynamic": "true",
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "keyword": { "type": "keyword", "ignore_above": 256 }
        }
      },
      "price": { "type": "double" },
      "status": { "type": "keyword" },
      "createdAt": { "type": "date" },
      "tags": { "type": "keyword" },
      "attrs": { "type": "object" },
      "specs": { "type": "nested" }
    }
  }
}

4. View/Update Mapping

View Mapping

GET /products/_mapping

Add Field (only new fields can be added, existing field types cannot be changed)

PUT /products/_mapping
{
  "properties": {
    "brand": { "type": "keyword" }
  }
}

5. Correct Way to Modify Field Types (Reindex)

Create a new index and define the correct mapping products_v2
Migrate data

POST /_reindex
{
  "source": { "index": "products" },
  "dest":   { "index": "products_v2" }
}

Switch traffic using an alias

POST /_aliases
{
  "actions": [
    { "remove": { "index": "products", "alias": "products_read" }},
    { "add":    { "index": "products_v2", "alias": "products_read" }}
  ]
}

6. Dynamic Mapping Strategy

"mappings": {
  "dynamic": "strict",
  "properties": { /* 显式列出字段，未知字段将被拒绝 */ }
}

It is recommended to use strict for core indexes to prevent dirty data from being automatically inferred as incorrect types (e.g., treating numbers as text).

7. Performance and Practical Considerations

Only enable index for fields that need to be searched/filtered; for purely display fields, set index: false
Fields requiring aggregation/sorting should keep doc_values: true (text has no doc_values)
For Chinese scenarios, install the IK analyzer and specify analyzer for text fields
Use nested for nested arrays to avoid object causing cross-matching
Use multi-fields to support both full-text and exact matching simultaneously

In short: Mapping determines "how fields are stored, indexed, and searched." Before building an index, clarify your query and aggregation requirements, then design the mapping to achieve correct and high-performance retrieval results.

ES Analyzer Usage and Explanation

1. What is an Analyzer

An Analyzer is a component that performs "normalization → tokenization → filtering" on text fields during writing/searching, typically consisting of three parts:

char_filter: Character-level preprocessing (e.g., removing HTML tags)
tokenizer: Splits text into tokens (lexemes), such as standard, whitespace, ik_smart
filter: Further processes tokens (lowercasing, stop word removal, synonyms, stemming, etc.)

The field's analyzer is used during the writing phase, and the same analyzer is used by default during the search phase, but can be specified separately via search_analyzer.

2. Commonly Used Built-in Analyzers

standard (default): General English tokenization, lowercasing
simple: Splits by non-letters, lowercasing
whitespace: Splits only by whitespace, no case change
stop: Removes stop words based on simple
keyword: No tokenization, the entire input is treated as a single token (often used with normalizer for keyword fields)
pattern: Splits based on regular expressions

Common for Chinese: ik_smart, ik_max_word (require plugin installation).

3. Using `_analyze` to Test Tokenization Effects

POST /_analyze
{
  "analyzer": "standard",
  "text": "iPhone 15 Pro Max"
}

POST /_analyze
{
  "analyzer": "ik_smart",
  "text": "苹果手机保护壳"
}

4. Setting `analyzer` and `search_analyzer` on Fields

PUT /docs
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_smart",
        "search_analyzer": "ik_max_word"
      }
    }
  }
}

Explanation:

Use ik_smart for indexing, and a finer-grained ik_max_word for querying to improve recall.

5. Temporarily Specifying Analyzer During Query (Without Changing Mapping)

POST /docs/_search
{
  "query": {
    "match": {
      "title": {
        "query": "苹果手机",
        "analyzer": "ik_max_word"
      }
    }
  }
}

6. Custom Analyzer (Including Synonyms/Stop Words)

PUT /articles
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": ["iphone,苹果手机", "notebook,笔记本"]
        }
      },
      "analyzer": {
        "my_zh_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip"],
          "tokenizer": "ik_smart",
          "filter": ["lowercase", "my_synonyms"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": { "type": "text", "analyzer": "my_zh_analyzer" }
    }
  }
}

7. Normalizer (Standardization for keyword fields)

keyword fields are not tokenized and cannot use an analyzer; if case normalization or punctuation removal is needed, a normalizer can be used:

PUT /users
{
  "settings": {
    "analysis": {
      "normalizer": {
        "lowercase_normalizer": {
          "type": "custom",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "email": { "type": "keyword", "normalizer": "lowercase_normalizer" }
    }
  }
}

8. IK Analyzer Installation and Field Example (Brief)

Installation (based on version): bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/... Restart ES
Usage:

PUT /goods
{
  "mappings": {
    "properties": {
      "title": { "type": "text", "analyzer": "ik_smart", "search_analyzer": "ik_max_word" }
    }
  }
}

9. Considerations for Changing Analyzers

The analyzer for an existing field generally cannot be directly modified; a "reindex" process is required
Different analyzers affect the inverted index structure; after changes, re-verify query semantics and relevance

10. Performance and Practice

Choose the simplest possible indexing analyzer (e.g., ik_smart), and a finer-grained one for the query side (ik_max_word) to improve recall
Use _analyze to verify if tokenization meets expectations; frequent filter conditions should use keyword + normalizer
Control the number of fields and tokenization granularity to avoid index explosion; external management of synonym lists facilitates updates

ES Terminology and Glossary

The following concepts are categorized by topic for quick understanding and reference.

Indexing and Document Modeling

Index: A logical container for a collection of documents, similar to a database. Internally composed of multiple shards
Document: A record, stored as JSON, uniquely identified by _id
Field: A document attribute, determining the available query and aggregation methods
Mapping: Definition of field types and indexing strategies, equivalent to a table schema
Type: A logical "table" concept existing in 6.x and below. Fixed to _doc from 7.x, hidden externally in 8.x
Text: A tokenized field type, used for full-text search, not suitable for aggregation/sorting
Keyword: Not tokenized, suitable for exact matching, aggregation/sorting, usually has doc_values
Multi-fields: Indexing the same field in multiple ways, such as title and title.keyword
Object: Object field, attributes are flattened and merged into the same document
Nested: Nested object, each array element is indexed independently, avoiding cross-matching, and supporting independent sub-queries
Dynamic mapping: Strategy for when unknown fields appear (true/false/strict)

Tokenization and Normalization

Analyzer: Tokenizer, includes three stages: char_filter → tokenizer → filter
Tokenizer: Component that splits into tokens, such as standard, whitespace, ik_smart
Token (lexeme/term): Basic unit in an inverted index
Char filter: Character-level preprocessing, such as html_strip
Token filter: Further processes tokens, such as lowercase, synonym, stop
Normalizer: Standardization for keyword fields (lowercasing, accent removal, etc.), no tokenization

Inverted Index and Scoring

Inverted index: Index structure of term → document list (postings)
Term: A term in the index (a token after standardization/tokenization)
Posting: Document occurrence information, including docID, frequency, position, etc.
Relevance score: Used for sorting
BM25: Default relevance model (replaces TF-IDF)
Query vs Filter: Query participates in scoring, Filter only performs boolean filtering and can be cached
Bool query: Combination query with must/should/must_not/filter

Storage and Segments

Segment: Immutable data segment, generated by appending writes; merging reduces the number of segments
Refresh: Flushes in-memory increments to new segments, default cycle 1s, visible after refresh
Flush: Persists the translog and creates a new commit point
Translog: Write-ahead log, used for crash recovery
Doc values: Columnar storage, supports aggregation/sorting/scripting, text has no doc values
_source: Original JSON document, stored by default, used for re-fetching and reindexing
Stored fields: Separately stored fields (less common), distinct from _source
Norms: Field-level length normalization and other scoring factors, can be disabled to save space

Cluster and Shards

Cluster: An ES cluster composed of multiple nodes
Node: An instance in the cluster, common roles: master, data, ingest, coordinating
Shard (Primary Shard): The physical unit of an index, number determined at creation
Replica: A copy of a primary shard, enhancing high availability and query throughput
Routing: Determines which primary shard a document falls into based on the routing value, defaults to _id hash
Alias: An alias that can point to one or more indexes, facilitating seamless switching

Writes and Batch Processing

Bulk API: Batch write/update/delete
Update by query: Batch update by condition
Delete by query: Batch delete by condition
Reindex: Copy from source index to target index (often used for mapping changes)
Ingest pipeline: Pre-write processing pipeline (grok, rename, set, script, etc.)
Painless: ES built-in scripting language, used for script updates, script sorting, etc.

Search and Pagination

Match: Full-text query, tokenized
Term/Terms: Exact match, not tokenized
Range: Range query (numeric/date)
Multi-match: Multi-field full-text query
Nested query: Sub-query for nested fields
Aggregation: Aggregation analysis (terms, stats, date_histogram, range, etc.)
Highlight: Highlight matching snippets
Suggesters: Search suggestions (term/phrase/completion)
From/size: Basic pagination, deep pagination is costly
Search after: Cursor-based pagination, replaces deep pagination
Scroll: Snapshot-style cursor for large-volume exports, not for real-time queries
PIT (Point in time): Point-in-time consistent snapshot, used for stable pagination

Lifecycle and Index Management

ILM (Index Lifecycle Management): Hot/Warm/Cold/Delete lifecycle policies
Rollover: Switches to a new index based on size/document count/time
Snapshot/Restore: Snapshot and recovery (repositories can integrate with S3, HDFS, etc.)

Operations and Performance

Cluster health: (green/yellow/red)
Refresh interval: Refresh cycle, can be increased in write-heavy scenarios to speed up writes
Replicas: Number of replicas affects query throughput and write cost
Force merge: Merges segments for read-only indexes, reducing file count and improving query performance
Slow logs: Slow query and slow indexing logs, used for troubleshooting and optimization

The above terms cover high-frequency concepts in ES daily modeling, writing, retrieval, aggregation, and operational optimization. Combined with the preceding chapters on Mapping, Analyzer, term/match, and multi_match, they form a complete knowledge graph for ES usage.

主题测试文章，只做测试使用。发布者：Walker，转转请注明出处：https://walker-learn.xyz/archives/6777

Go Engineer System Course 011

Inverted Index for Queries

1. What is an Inverted Index?

1.1 Basic Concepts

1.2 Why is it called "Inverted"?

2. Structure of an Inverted Index

2.1 Basic Structure

2.2 Detailed Structure

3. How an Inverted Index Works

3.1 Building Process

3.2 Query Process

4. Implementing an Inverted Index in Go

4.1 Data Structure Definition

4.2 Index Construction

4.3 Query Implementation

4.4 Complete Example

5. Optimizing Inverted Indexes

5.1 Compression Techniques

5.2 Query Optimization

6. Practical Application Scenarios

6.1 Search Engines

6.2 Database Systems

6.3 Code Search

6.4 Log Analysis

7. Performance Analysis

7.1 Time Complexity

7.2 Space Complexity

7.3 Advantages and Disadvantages

8. Summary

multi_match Usage Guide

1. Basic Usage

2. Field Boosting (boost)

3. Type Option and Applicable Scenarios

4. Operator and Minimum Match

5. Fuzzy Matching (fuzziness) and Correction

6. Analyzer and Field Selection

7. Combined Example (Comprehensive Fields, Weights, Filtering, and Sorting)

8. Common Issues and Suggestions

term Query Explained

1. Basic Usage

2. Multi-field term Query

3. Exact Matching for Numeric Fields

4. Exact Matching for Date Fields

5. Array Field Matching

6. Using boost to Increase Weight

7. terms Query (Multi-value Matching)

8. Combined with filter

term vs match Query Comparison

1. Core Differences

2. Practical Example Comparison

2.1 Different Results for the Same Query Term

2.2 Partial Match Comparison

3. Use Case Comparison

3.1 Applicable Scenarios for term Query

3.2 Applicable Scenarios for match Query

4. Performance Comparison

4.1 Query Performance

4.2 Hybrid Usage Optimization

5. Common Errors and Solutions

5.1 Using term Query on text Fields

5.2 Case Sensitivity Issues

6. Best Practice Suggestions

6.1 Field Mapping Design

6.2 Query Combination Strategy

7. Summary

ES Mapping Concepts and Usage

1. What is Mapping

2. Common Field Types and Scenarios

3. Create Index and Explicitly Set Mapping

4. View/Update Mapping

5. Correct Way to Modify Field Types (Reindex)

6. Dynamic Mapping Strategy

7. Performance and Practical Considerations

ES Analyzer Usage and Explanation

1. What is an Analyzer

2. Commonly Used Built-in Analyzers

3. Using _analyze to Test Tokenization Effects

4. Setting analyzer and search_analyzer on Fields

5. Temporarily Specifying Analyzer During Query (Without Changing Mapping)

6. Custom Analyzer (Including Synonyms/Stop Words)

3. Using `_analyze` to Test Tokenization Effects

4. Setting `analyzer` and `search_analyzer` on Fields