Inverted Index for Queries

1. What is an Inverted Index?

An Inverted Index is a data structure used to quickly find documents containing specific terms. It is one of the core technologies of search engines.

1.1 Basic Concepts

Forward Index: Document ID → Document Content (list of terms)
Inverted Index: Term → List of Document IDs containing the term

1.2 Why is it called "Inverted"?

An Inverted Index reverses the traditional relationship of "which terms a document contains" to "in which documents a term appears", hence the name "inverted".

2. Structure of an Inverted Index

2.1 Basic Structure

词项 → 文档频率 → 文档列表

2.2 Detailed Structure

词项 → {
    文档频率: N,
    文档列表: [
        {文档ID: 1, 词频: 2, 位置: [0, 5]},
        {文档ID: 3, 词频: 1, 位置: [2]}
    ]
}

3. How an Inverted Index Works

3.1 Building Process

Document Preprocessing: Tokenization, stop word removal, stemming
Term Statistics: Count the frequency and position of each term in documents
Index Construction: Establish the mapping relationship from terms to documents

3.2 Query Process

Query Parsing: Tokenize the query string
Index Lookup: Search for each term in the inverted index
Result Merging: Merge document lists for multiple terms
Sorted Return: Return results sorted by relevance

4. Implementing an Inverted Index in Go

4.1 Data Structure Definition

package main

import (
    "fmt"
    "sort"
    "strings"
)

// 文档信息
type Document struct {
    ID   int
    Text string
}

// 词项在文档中的位置信息
type Posting struct {
    DocID     int
    Frequency int
    Positions []int
}

// 倒排索引项
type InvertedIndexItem struct {
    Term      string
    DocFreq   int
    Postings  []Posting
}

// 倒排索引
type InvertedIndex struct {
    Index map[string]*InvertedIndexItem
}

// 创建新的倒排索引
func NewInvertedIndex() *InvertedIndex {
    return &InvertedIndex{
        Index: make(map[string]*InvertedIndexItem),
    }
}

4.2 Index Construction

// 添加文档到索引
func (idx *InvertedIndex) AddDocument(docID int, text string) {
    // 简单的分词（实际应用中需要更复杂的分词算法）
    words := strings.Fields(strings.ToLower(text))

    for pos, word := range words {
        if idx.Index[word] == nil {
            idx.Index[word] = &InvertedIndexItem{
                Term:     word,
                DocFreq:  0,
                Postings: make([]Posting, 0),
            }
        }

        // 查找是否已存在该文档的posting
        var posting *Posting
        for i := range idx.Index[word].Postings {
            if idx.Index[word].Postings[i].DocID == docID {
                posting = &idx.Index[word].Postings[i]
                break
            }
        }

        if posting == nil {
            // 创建新的posting
            newPosting := Posting{
                DocID:     docID,
                Frequency: 1,
                Positions: []int{pos},
            }
            idx.Index[word].Postings = append(idx.Index[word].Postings, newPosting)
            idx.Index[word].DocFreq++
        } else {
            // 更新现有posting
            posting.Frequency++
            posting.Positions = append(posting.Positions, pos)
        }
    }
}

4.3 Query Implementation

// 单词查询
func (idx *InvertedIndex) Search(term string) []int {
    term = strings.ToLower(term)
    if item, exists := idx.Index[term]; exists {
        docIDs := make([]int, len(item.Postings))
        for i, posting := range item.Postings {
            docIDs[i] = posting.DocID
        }
        return docIDs
    }
    return []int{}
}

// 多词查询（AND操作）
func (idx *InvertedIndex) SearchAnd(terms []string) []int {
    if len(terms) == 0 {
        return []int{}
    }

    // 获取第一个词的结果
    result := idx.Search(terms[0])

    // 与其他词的结果求交集
    for i := 1; i < len(terms); i++ {
        otherResult := idx.Search(terms[i])
        result = intersect(result, otherResult)
    }

    return result
}

// 多词查询（OR操作）
func (idx *InvertedIndex) SearchOr(terms []string) []int {
    if len(terms) == 0 {
        return []int{}
    }

    resultSet := make(map[int]bool)

    for _, term := range terms {
        docIDs := idx.Search(term)
        for _, docID := range docIDs {
            resultSet[docID] = true
        }
    }

    result := make([]int, 0, len(resultSet))
    for docID := range resultSet {
        result = append(result, docID)
    }

    sort.Ints(result)
    return result
}

// 求两个切片的交集
func intersect(a, b []int) []int {
    set := make(map[int]bool)
    for _, x := range a {
        set[x] = true
    }

    result := make([]int, 0)
    for _, x := range b {
        if set[x] {
            result = append(result, x)
        }
    }

    return result
}

4.4 Complete Example

func main() {
    // 创建倒排索引
    index := NewInvertedIndex()

    // 添加文档
    documents := []Document{
        {ID: 1, Text: "Go is a programming language"},
        {ID: 2, Text: "Go is fast and efficient"},
        {ID: 3, Text: "Programming in Go is fun"},
        {ID: 4, Text: "Go language is simple"},
    }

    // 构建索引
    for _, doc := range documents {
        index.AddDocument(doc.ID, doc.Text)
    }

    // 查询示例
    fmt.Println("搜索 'go':", index.Search("go"))
    fmt.Println("搜索 'programming':", index.Search("programming"))
    fmt.Println("搜索 'go' AND 'language':", index.SearchAnd([]string{"go", "language"}))
    fmt.Println("搜索 'go' OR 'fast':", index.SearchOr([]string{"go", "fast"}))

    // 打印索引结构
    fmt.Println("\n倒排索引结构:")
    for term, item := range index.Index {
        fmt.Printf("词项: %s, 文档频率: %d\n", term, item.DocFreq)
        for _, posting := range item.Postings {
            fmt.Printf("  文档ID: %d, 词频: %d, 位置: %v\n",
                posting.DocID, posting.Frequency, posting.Positions)
        }
    }
}

5. Optimizing Inverted Indexes

5.1 Compression Techniques

Variable-length encoding: Compress document IDs using variable-length encoding
Differential encoding: Store the difference in document IDs instead of absolute values
Bitmap compression: Use bitmaps to represent document sets

5.2 Query Optimization

Skip lists: Quickly locate positions in long lists
Caching mechanism: Cache popular query results
Parallel querying: Process queries using multiple threads

6. Practical Application Scenarios

6.1 Search Engines

Core technology for search engines like Google and Baidu
Web content indexing and retrieval

6.2 Database Systems

Full-text search functionality
Fast querying of text fields

6.3 Code Search

GitHub code search
Code navigation in IDEs

6.4 Log Analysis

Fast retrieval of log files
Locating error logs

7. Performance Analysis

7.1 Time Complexity

Index Construction: O(N×M), where N is the number of documents and M is the average number of terms
Single-term Query: O(1) on average
Multi-term Query: O(k×log(n)), where k is the number of results and n is the number of documents

7.2 Space Complexity

Storage Space: O(V×D), where V is the vocabulary size and D is the average document frequency

7.3 Pros and Cons

Pros:

Fast query speed
Supports complex queries
Easy to implement

Cons:

Time-consuming index construction
Large storage space
Complex index updates

8. Summary

The inverted index is a core technology in information retrieval. By reversing the "document-term" relationship to a "term-document" relationship, it achieves efficient text search. In Go language, we can use basic data structures like maps and slices to implement an inverted index, providing powerful search capabilities for applications.

multi_match Usage Guide

multi_match is a query type in ES that searches across multiple fields simultaneously, essentially an extension of the match query to multiple fields. It is suitable for combined retrieval across multiple text fields like title, description, and tags, often used with field boosting, different query types, and analyzers.

1. Basic Usage

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "iPhone 15",
      "fields": ["title", "description", "tags"]
    }
  }
}

2. Field Boosting (boost)

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "iPhone 15",
      "fields": ["title^3", "description^1.5", "tags"]
    }
  }
}

Explanation: title^3 means the match score for the title field is multiplied by a weight of 3, thereby boosting the score of results hitting this field during sorting.

3. type Option and Applicable Scenarios

best_fields (default): Selects the best matching field's score as the primary score among all fields, can be used with tie_breaker

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "apple phone",
      "fields": ["title", "description", "tags"],
      "type": "best_fields",
      "tie_breaker": 0.2
    }
  }
}

most_fields: Scores from multiple fields are summed, suitable when the same semantic meaning is distributed across multiple fields (e.g., a single text split and stored in different fields)

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "iphone",
      "fields": ["title", "title.ngram", "description"],
      "type": "most_fields"
    }
  }
}

cross_fields: Treats multiple fields as one large field for matching, suitable for scenarios where terms are distributed across different fields (e.g., first_name + last_name)

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "tim cook",
      "fields": ["first_name", "last_name"],
      "type": "cross_fields",
      "operator": "and"
    }
  }
}

phrase: Phrase matching, requires strict word order and proximity, suitable for exact phrase searches

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "iphone 15 pro",
      "fields": ["title", "description"],
      "type": "phrase"
    }
  }
}

phrase_prefix: Phrase prefix matching, suitable for input method suggestions/search suggestions

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "iph 15",
      "fields": ["title", "description"],
      "type": "phrase_prefix",
      "max_expansions": 50
    }
  }
}

4. Operator and Minimum Match

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "apple flagship phone",
      "fields": ["title", "description"],
      "operator": "and",
      "minimum_should_match": "75%"
    }
  }
}

Explanation:

operator: and requires all query terms to match; or (default) matches any one term
minimum_should_match controls the minimum proportion or number of matching terms, such as 2, 3<75%, 75%, etc.

5. Fuzziness and Correction

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "iphine",
      "fields": ["title", "description"],
      "fuzziness": "AUTO",
      "prefix_length": 1
    }
  }
}

Explanation: fuzziness: AUTO provides fault tolerance for common spelling errors; prefix_length specifies the length of the prefix that must match exactly.

6. Analyzer and Field Selection

POST /index/_search
{
  "query": {
    "multi_match": {
      "query": "苹果 手机",
      "fields": ["title", "title.keyword^5", "description"],
      "analyzer": "ik_smart"
    }
  }
}

Suggestions:

Mostly used for text fields for full-text retrieval; exact matching and aggregation/sorting use keyword fields (can be combined with boost)
For Chinese retrieval, ik_smart, ik_max_word and other analyzers can be used (requires plugin installation)

7. Combined Example (Comprehensive fields, weights, filtering, and sorting)

POST /products/_search
{
  "_source": ["id", "title", "price", "brand"],
  "from": 0,
  "size": 20,
  "sort": [
    {"_score": "desc"},
    {"price": "asc"}
  ],
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "iphone 15 pro",
            "fields": ["title^4", "subtitle^2", "description", "tags"],
            "type": "best_fields",
            "tie_breaker": 0.3,
            "minimum_should_match": "66%"
          }
        }
      ],
      "filter": [
        {"term": {"brand": "apple"}},
        {"range": {"price": {"gte": 3000, "lte": 10000}}}
      ]
    }
  },
  "highlight": {
    "fields": {
      "title": {},
      "description": {}
    }
  }
}

8. Common Issues and Suggestions

Relevance is not ideal:
Set higher weights for core fields (e.g., title^N)
Choose the appropriate type: use cross_fields for cross-field term distribution, most_fields for combined scores
Use synonyms, spell correction (fuzziness), and domain-specific dictionaries
Performance issues:
Control returned fields (_source filtering) and size
Place filter conditions in filter, which hits the cache and does not participate in scoring
Avoid using wildcard/phrase_prefix for prefix expansion on a huge number of fields
Exact vs. Full-text:
Exact matching and aggregation use keyword; full-text retrieval uses text + analyzer
Can create multi-fields (text + keyword) for the same business field

term Query Explained

The term query is a query type in ES used for exact matching. It does not perform tokenization on the query term but directly performs an exact match with the terms in the index. It is suitable for keyword type fields, numeric fields, date fields, etc. (No tokenization, no lowercasing).

1. Basic Usage

POST /products/_search
{
  "query": {
    "term": {
      "status": "active"
    }
  }
}

2. Multi-field term Query

POST /products/_search
{
  "query": {
    "bool": {
      "must": [
        {"term": {"status": "active"}},
        {"term": {"category": "electronics"}},
        {"term": {"brand": "apple"}}
      ]
    }
  }
}

3. Exact Matching for Numeric Fields

POST /products/_search
{
  "query": {
    "term": {
      "price": 5999
    }
  }
}

4. Exact Matching for Date Fields

POST /products/_search
{
  "query": {
    "term": {
      "created_date": "2025-01-18"
    }
  }
}

5. Array Field Matching

POST /products/_search
{
  "query": {
    "term": {
      "tags": "phone"
    }
  }
}

6. Using boost to Increase Weight

POST /products/_search
{
  "query": {
    "term": {
      "status": {
        "value": "active",
        "boost": 2.0
      }
    }
  }
}

7. terms Query (Multi-value Matching)

POST /products/_search
{
  "query": {
    "terms": {
      "status": ["active", "pending", "review"]
    }
  }
}

8. Combined with filter

POST /products/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "iPhone"}}
      ],
      "filter": [
        {"term": {"status": "active"}},
        {"term": {"category": "electronics"}}
      ]
    }
  }
}

term vs match Query Comparison

1. Core Differences

Feature	term Query	match Query
Tokenization	No tokenization, exact match	Tokenizes the query term
Matching Method	Exact match of terms in the index	Fuzzy matching, supports relevance scoring
Applicable Fields	keyword, numeric, date, etc.	text type fields
Performance	Faster (no relevance calculation)	Slower (requires score calculation)
Caching	Results can be cached	Results are usually not cached

2. Practical Example Comparison

2.1 Different Results for the Same Query Term

# 数据准备
POST /test/_doc/1
{
  "title": "iPhone 15 Pro Max",
  "title.keyword": "iPhone 15 Pro Max",
  "status": "active"
}

# term 查询 - 精确匹配
POST /test/_search
{
  "query": {
    "term": {
      "title.keyword": "iPhone 15 Pro Max"
    }
  }
}
# 结果：匹配成功

# term 查询 - 对 text 字段使用 term（通常不匹配）
POST /test/_search
{
  "query": {
    "term": {
      "title": "iPhone 15 Pro Max"
    }
  }
}
# 结果：不匹配（因为 title 被分词为 ["iphone", "15", "pro", "max"]）

# match 查询 - 对 text 字段使用 match
POST /test/_search
{
  "query": {
    "match": {
      "title": "iPhone 15 Pro Max"
    }
  }
}
# 结果：匹配成功，有相关性评分

2.2 Partial Match Comparison

# term 查询 - 部分词不匹配
POST /test/_search
{
  "query": {
    "term": {
      "title.keyword": "iPhone 15"
    }
  }
}
# 结果：不匹配（需要完全一致）

# match 查询 - 部分词匹配
POST /test/_search
{
  "query": {
    "match": {
      "title": "iPhone 15"
    }
  }
}
# 结果：匹配成功，相关性评分较低

3. Usage Scenario Comparison

3.1 term Query Applicable Scenarios

# 1. 状态过滤
POST /products/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"status": "active"}}
      ]
    }
  }
}

# 2. 分类筛选
POST /products/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"category": "electronics"}}
      ]
    }
  }
}

# 3. 标签匹配
POST /products/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"tags": "premium"}}
      ]
    }
  }
}

# 4. 聚合统计
POST /products/_search
{
  "size": 0,
  "aggs": {
    "status_count": {
      "terms": {
        "field": "status"
      }
    }
  }
}

3.2 match Query Applicable Scenarios

# 1. 全文搜索
POST /products/_search
{
  "query": {
    "match": {
      "title": "iPhone 15 Pro"
    }
  }
}

# 2. 描述搜索
POST /products/_search
{
  "query": {
    "match": {
      "description": "Latest款手机"
    }
  }
}

# 3. 多字段搜索
POST /products/_search
{
  "query": {
    "multi_match": {
      "query": "苹果手机",
      "fields": ["title", "description", "tags"]
    }
  }
}

4. Performance Comparison

4.1 Query Performance

# term 查询 - 高性能
POST /products/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"status": "active"}},
        {"term": {"category": "electronics"}}
      ]
    }
  }
}
# 特点：不计算相关性，结果可缓存

# match 查询 - 相对较慢
POST /products/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "iPhone"}},
        {"match": {"description": "手机"}}
      ]
    }
  }
}
# 特点：需要计算相关性评分，结果通常不缓存

4.2 Mixed Usage Optimization

# 最佳实践：term 用于过滤，match 用于搜索
POST /products/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "iPhone 15"}}
      ],
      "filter": [
        {"term": {"status": "active"}},
        {"term": {"category": "electronics"}},
        {"range": {"price": {"gte": 1000, "lte": 10000}}}
      ]
    }
  }
}

5. Common Errors and Solutions

5.1 Using term Query on text Fields

# 错误用法
POST /products/_search
{
  "query": {
    "term": {
      "title": "iPhone"  # title 是 text 字段，会被分词
    }
  }
}

# 正确用法
POST /products/_search
{
  "query": {
    "term": {
      "title.keyword": "iPhone"  # 使用 keyword 字段
    }
  }
}

# 或者使用 match
POST /products/_search
{
  "query": {
    "match": {
      "title": "iPhone"
    }
  }
}

5.2 Case Sensitivity Issues

# term 查询大小写敏感
POST /products/_search
{
  "query": {
    "term": {
      "status": "Active"  # 如果索引中是 "active"，则不匹配
    }
  }
}

# 解决方案：使用 match 或确保大小写一致
POST /products/_search
{
  "query": {
    "match": {
      "status": "Active"  # match 会进行分词和标准化
    }
  }
}

6. Best Practice Recommendations

6.1 Field Mapping Design

# 创建支持两种查询的映射
PUT /products
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_smart",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "status": {
        "type": "keyword"
      },
      "price": {
        "type": "double"
      }
    }
  }
}

6.2 Query Combination Strategy

# 推荐：精确过滤 + 模糊搜索
POST /products/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "用户搜索词"}}
      ],
      "filter": [
        {"term": {"status": "active"}},
        {"term": {"category": "electronics"}},
        {"range": {"price": {"gte": 1000}}}
      ]
    }
  },
  "sort": [
    {"_score": "desc"},
    {"price": "asc"}
  ]
}

7. Summary

term query: Suitable for exact matching, filtering, aggregation; better performance, results can be cached
match query: Suitable for full-text search, fuzzy matching; supports relevance scoring
Best practice: term for filter conditions, match for search content; use both in combination
Field design: Create keyword sub-fields for fields requiring exact matching
Performance optimization: Place exact matching conditions in the filter to avoid unnecessary scoring calculations

ES Mapping Concepts and Usage

1. What is Mapping

Mapping is the "structure definition" of an index, similar to the schema of a relational database table. It is used to declare the type and indexing method of each field, determining:

The data type and storage format of the field (text, keyword, numeric, date, boolean, geo, nested, etc.)
Whether it participates in the inverted index and how it is tokenized (index, analyzer)
Whether it can be used for aggregation/sorting (doc_values)
Multi-field definition: indexing the same business field in multiple ways
Dynamic field processing strategy (dynamic)

Since ES 7, an index has only one type (internal _doc), and modeling directly faces "index + mapping".

2. Common Field Types and Scenarios

text: Tokenized, used for full-text retrieval; not suitable for aggregation/sorting
keyword: Not tokenized, suitable for exact matching, aggregation, sorting; has doc_values by default
Numeric and date: integer/long/double/date, etc., suitable for range filtering, aggregation, and sorting
Structured: object (flat object within the same document), nested (each object in an array is modeled independently, supporting independent sub-queries)
Geographic: geo_point/geo_shape

Typical multi-fields (for both full-text and exact matching):

"title": {
  "type": "text",
  "analyzer": "ik_smart",
  "fields": {
    "keyword": { "type": "keyword", "ignore_above": 256 }
  }
}

3. Creating an Index and Explicitly Setting Mapping

PUT /products
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  },
  "mappings": {
    "dynamic": "true",
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "keyword": { "type": "keyword", "ignore_above": 256 }
        }
      },
      "price": { "type": "double" },
      "status": { "type": "keyword" },
      "createdAt": { "type": "date" },
      "tags": { "type": "keyword" },
      "attrs": { "type": "object" },
      "specs": { "type": "nested" }
    }
  }
}

4. Viewing/Updating Mapping

View Mapping

GET /products/_mapping

Add Fields (only new fields can be added, existing field types cannot be changed)

PUT /products/_mapping
{
  "properties": {
    "brand": { "type": "keyword" }
  }
}

5. Correct Way to Change Field Type (Reindexing)

Create a new index and define the correct mapping products_v2
Migrate data

POST /_reindex
{
  "source": { "index": "products" },
  "dest":   { "index": "products_v2" }
}

Switch traffic using aliases

POST /_aliases
{
  "actions": [
    { "remove": { "index": "products", "alias": "products_read" }},
    { "add":    { "index": "products_v2", "alias": "products_read" }}
  ]
}

6. Dynamic Mapping Strategy

"mappings": {
  "dynamic": "strict",
  "properties": { /* 显式列出字段，未知字段将被拒绝 */ }
}

It is recommended to use strict for core indexes to prevent dirty data from being automatically inferred into incorrect types (e.g., treating numeric values as text).

7. Performance and Practice Essentials

Only enable index for fields that need to be searched/filtered; for purely display fields, index: false can be used
Fields that need aggregation/sorting should keep doc_values: true (text has no doc_values)
For Chinese scenarios, install the IK analyzer and specify the analyzer for text fields
Use nested for nested arrays to avoid cross-matching caused by object
Use multi-fields to support both full-text and exact matching simultaneously

In a nutshell: Mapping determines "how fields are stored, indexed, and queried". Before building an index, clarify your query and aggregation requirements, then design the mapping to achieve correct and high-performance retrieval.

ES Analyzer Usage and Explanation

1. What is an Analyzer

An Analyzer is a component that "normalizes → tokenizes → filters" text fields during writing/searching, typically consisting of three parts:

char_filter: Character-level preprocessing (e.g., removing HTML tags)
tokenizer: Splits text into tokens (lexemes), such as standard, whitespace, ik_smart
filter: Further processes tokens (lowercasing, stop word removal, synonyms, stemming, etc.)

The field's analyzer is used during the writing phase, and the same analyzer is used by default during the search phase, but can be specified separately via search_analyzer.

2. Built-in Common Analyzers

standard (default): General English tokenization, lowercasing
simple: Splits by non-letters, lowercasing
whitespace: Splits only by whitespace, preserves case
stop: Based on simple, removes stop words
keyword: No tokenization, treats the entire input as a single token (mostly used by normalizer for keyword fields)
pattern: Splits based on regular expressions

Commonly used for Chinese: ik_smart, ik_max_word (requires plugin installation).

3. Using `_analyze` to Test Tokenization Effect

POST /_analyze
{
  "analyzer": "standard",
  "text": "iPhone 15 Pro Max"
}

POST /_analyze
{
  "analyzer": "ik_smart",
  "text": "苹果手机保护壳"
}

4. Setting `analyzer` and `search_analyzer` on Fields

PUT /docs
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_smart",
        "search_analyzer": "ik_max_word"
      }
    }
  }
}

Explanation:

Use ik_smart during writing, and a finer-grained ik_max_word during querying to improve recall.

5. Temporarily Specifying an Analyzer During Query (Without Changing Mapping)

POST /docs/_search
{
  "query": {
    "match": {
      "title": {
        "query": "苹果手机",
        "analyzer": "ik_max_word"
      }
    }
  }
}

6. Custom Analyzer (Including Synonyms/Stop Words)

PUT /articles
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": ["iphone,苹果手机", "notebook,笔记本"]
        }
      },
      "analyzer": {
        "my_zh_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip"],
          "tokenizer": "ik_smart",
          "filter": ["lowercase", "my_synonyms"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": { "type": "text", "analyzer": "my_zh_analyzer" }
    }
  }
}

7. Normalizer (Standardization for keyword)

keyword fields are not tokenized and cannot use an analyzer; if lowercasing or punctuation removal is needed, a normalizer can be used:

PUT /users
{
  "settings": {
    "analysis": {
      "normalizer": {
        "lowercase_normalizer": {
          "type": "custom",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "email": { "type": "keyword", "normalizer": "lowercase_normalizer" }
    }
  }
}

8. IK Analyzer Installation and Field Example (Brief)

Installation (based on version): bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/... Restart ES
Usage:

PUT /goods
{
  "mappings": {
    "properties": {
      "title": { "type": "text", "analyzer": "ik_smart", "search_analyzer": "ik_max_word" }
    }
  }
}

9. Considerations for Changing Analyzers

The analyzer of an existing field generally cannot be directly modified; a "reindex" process is required
Different analyzers will affect the inverted index structure, so re-verify query semantics and relevance after changes

10. Performance and Practice

Choose the simplest possible analyzer for writing (e.g., ik_smart), and a finer-grained one for querying (ik_max_word) to improve recall
Use _analyze to verify if tokenization meets expectations; frequent filter conditions should use keyword + normalizer
Control the number of fields and tokenization granularity to avoid index explosion; manage synonym lists externally for easy updates

ES Terms and Glossary Quick Reference

The following concepts are categorized by topic for quick understanding and reference.

Index and Document Modeling

Index: A logical container for a collection of documents, similar to a database. Internally composed of multiple shards
Document: A single record, stored as JSON, uniquely identified by _id
Field: A document attribute, determining available query and aggregation methods
Mapping: Definition of field types and indexing strategies, equivalent to a table schema
Type: A logical "table" concept existing in 6.x and below. Fixed to _doc from 7.x, hidden externally in 8.x
Text: A tokenized field type, used for full-text retrieval, not suitable for aggregation/sorting
Keyword: Not tokenized, suitable for exact matching, aggregation/sorting, usually has doc_values
Multi-fields: Indexing the same field in multiple ways, such as title and title.keyword
Object: Object field, properties are flattened and merged into the same document
Nested: Nested object, each array element is indexed independently, avoiding cross-matching, and can be queried independently
Dynamic mapping: Strategy for when unknown fields appear (true/false/strict)

Tokenization and Normalization

Analyzer: Tokenizer, including three stages: char_filter → tokenizer → filter
Tokenizer: Component that splits text into tokens, such as standard, whitespace, ik_smart
Token: The basic unit in the inverted index (standardized/tokenized lexeme)
Char filter: Character-level preprocessing, such as html_strip
Token filter: Further processing of tokens, such as lowercase, synonym, stop
Normalizer: Normalization for keyword fields (lowercasing, accent removal, etc.), no tokenization

Inverted Index and Scoring

Inverted index: Index structure of term → document list (postings)
Term: A term in the index (a token after standardization/tokenization)
Posting: Document occurrence information, including docID, frequency, position, etc.
Relevance score: Relevance score, used for sorting
BM25: Default relevance model (replaces TF-IDF)
Query vs Filter: Query participates in scoring, Filter only performs boolean filtering and can be cached
Bool query: must/should/must_not/filter combination query

Storage and Segments

Segment: Immutable data segment, created by appending writes; merge reduces the number of segments
Refresh: Flushes in-memory increments to new segments, default cycle 1s, visible after refresh
Flush: Persists the translog and creates a new commit point
Translog: Write-ahead log, used for crash recovery
Doc values: Columnar storage, supporting aggregation/sorting/scripting, text has no doc values
_source: Original JSON document, stored by default, used for re-retrieval and reindex
Stored fields: Separately stored fields (not commonly used), distinct from _source
Norms: Field-level length normalization and other scoring factors, can be disabled to save space

Cluster and Shards

Cluster: An ES cluster composed of multiple nodes
Node: An instance in the cluster, common roles: master, data, ingest, coordinating
Shard (Primary shard): The physical sharding unit of an index, quantity determined at creation
Replica: A copy of a primary shard, improving high availability and query throughput
Routing: Determines which primary shard a document falls into based on the routing value, defaults to _id hash
Alias: An alias, can point to one or more indexes, facilitating seamless switching

Writing and Batch Processing

Bulk API: Batch write/update/delete
Update by query: Batch update by condition
Delete by query: Batch delete by condition
Reindex: Copy from source index to target index (often used for mapping changes)
Ingest pipeline: Pre-write processing pipeline (grok, rename, set, script, etc.)
Painless: ES built-in scripting language, used for script updates, script sorting, etc.

Search and Pagination

Match: Full-text query, tokenized
Term/Terms: Exact match, not tokenized
Range: Range query (numeric/date)
Multi-match: Multi-field full-text query
Nested query: Sub-query for nested fields
Aggregation: Aggregation analysis (terms, stats, date_histogram, range, etc.)
Highlight: Highlight matching snippets
Suggesters: Search suggestions (term/phrase/completion)
From/size: Basic pagination, deep pagination is costly
Search after: Cursor-based pagination, replaces deep pagination
Scroll: Snapshot-style cursor for large-volume exports, not for real-time queries
PIT (Point in time): Point-in-time consistency snapshot, used for stable pagination

Lifecycle and Index Management

ILM (Index Lifecycle Management): Hot/Warm/Cold/Delete lifecycle policies
Rollover: Switch to a new index based on size/document count/time
Snapshot/Restore: Snapshot and recovery (repositories can connect to S3, HDFS, etc.)

Operations and Performance

Cluster health: Cluster health (green/yellow/red)
Refresh interval: Refresh cycle, can be increased in write-heavy scenarios to speed up writes
Replicas: Number of replicas affects query throughput and write cost
Force merge: Merge segments for read-only indexes, reducing file count and improving query performance
Slow logs: Slow query and slow indexing logs, used for troubleshooting and optimization

The above terms cover high-frequency concepts in ES daily modeling, writing, retrieval, aggregation, and operational optimization. Combined with the preceding chapters on Mapping, Analyzer, term/match, and multi_match, they form a complete knowledge graph for ES usage.

主题测试文章，只做测试使用。发布者：Walker，转转请注明出处：https://walker-learn.xyz/archives/4784

Go Engineer Structured Course 011 [Learning Notes]

Inverted Index for Queries

1. What is an Inverted Index?

1.1 Basic Concepts

1.2 Why is it called "Inverted"?

2. Structure of an Inverted Index

2.1 Basic Structure

2.2 Detailed Structure

3. How an Inverted Index Works

3.1 Building Process

3.2 Query Process

4. Implementing an Inverted Index in Go

4.1 Data Structure Definition

4.2 Index Construction

4.3 Query Implementation

4.4 Complete Example

5. Optimizing Inverted Indexes

5.1 Compression Techniques

5.2 Query Optimization

6. Practical Application Scenarios

6.1 Search Engines

6.2 Database Systems

6.3 Code Search

6.4 Log Analysis

7. Performance Analysis

7.1 Time Complexity

7.2 Space Complexity

7.3 Pros and Cons

8. Summary

multi_match Usage Guide

1. Basic Usage

2. Field Boosting (boost)

3. type Option and Applicable Scenarios

4. Operator and Minimum Match

5. Fuzziness and Correction

6. Analyzer and Field Selection

7. Combined Example (Comprehensive fields, weights, filtering, and sorting)

8. Common Issues and Suggestions

term Query Explained

1. Basic Usage

2. Multi-field term Query

3. Exact Matching for Numeric Fields

4. Exact Matching for Date Fields

5. Array Field Matching

6. Using boost to Increase Weight

7. terms Query (Multi-value Matching)

8. Combined with filter

term vs match Query Comparison

1. Core Differences

2. Practical Example Comparison

2.1 Different Results for the Same Query Term

2.2 Partial Match Comparison

3. Usage Scenario Comparison

3.1 term Query Applicable Scenarios

3.2 match Query Applicable Scenarios

4. Performance Comparison

4.1 Query Performance

4.2 Mixed Usage Optimization

5. Common Errors and Solutions

5.1 Using term Query on text Fields

5.2 Case Sensitivity Issues

6. Best Practice Recommendations

6.1 Field Mapping Design

6.2 Query Combination Strategy

7. Summary

ES Mapping Concepts and Usage

1. What is Mapping

2. Common Field Types and Scenarios

3. Creating an Index and Explicitly Setting Mapping

4. Viewing/Updating Mapping

5. Correct Way to Change Field Type (Reindexing)

6. Dynamic Mapping Strategy

7. Performance and Practice Essentials

ES Analyzer Usage and Explanation

1. What is an Analyzer

2. Built-in Common Analyzers

3. Using _analyze to Test Tokenization Effect

4. Setting analyzer and search_analyzer on Fields

5. Temporarily Specifying an Analyzer During Query (Without Changing Mapping)

6. Custom Analyzer (Including Synonyms/Stop Words)

3. Using `_analyze` to Test Tokenization Effect

4. Setting `analyzer` and `search_analyzer` on Fields