← Back
后端开发 2026.03.07

Go Engineer Training Program 009

后端开发
  • User Center
  • Favorites
  • Manage Shipping Addresses (CRUD)
  • Messages

Copyinventory_srv—> userop_srv query and replace allinventory

Elasticsearch In-depth Analysis Document

1. What is Elasticsearch

Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene, capable of rapidly storing, searching, and analyzing massive amounts of data. It is a core component of the Elastic Stack (formerly ELK Stack).

2. Problems Faced by MySQL Search - In-depth Analysis

2.1 Detailed Explanation of Performance Issues

Problem Description:

-- 当数据量达到100万条时,以下查询可能需要数秒
SELECT * FROM products WHERE name LIKE '%手机%' OR description LIKE '%手机%';

Performance Comparison Data:

Data VolumeMySQL LIKE QueryElasticsearch Full-Text SearchPerformance Improvement
1万条50ms10ms5倍
10万条500ms15ms33倍
100万条5000ms20ms250倍
1000万条50000ms+30ms1600倍+

Root Causes:

  1. Full Table Scan: LIKE ‘%keyword%’ cannot use B+ tree indexes, must scan all rows
  2. I/O Intensive: Each query requires reading a large amount of data from disk
  3. CPU Intensive: Performs string matching operations on every row of data
  4. Memory Pressure: Large amounts of data loaded into memory for processing

Real-world Case:

A certain e-commerce platform’s product table has 5 million records, using MySQL fuzzy search to find “Apple phone”: - Query time: 8.3 seconds - CPU utilization: Soared to 85% - When 10 concurrent queries were made, response time increased to over 30 seconds

2.2 Detailed Explanation of Lack of Relevance Ranking

Pain Points of MySQL Query Results:

-- MySQL只能按固定规则排序
SELECT * FROM products
WHERE name LIKE '%手机%'
ORDER BY price DESC;  -- 只能按价格、时间等字段排序

Elasticsearch’s Relevance Scoring Mechanism:

搜索词:"小米手机"

相关性评分计算:
┌─────────────────────────────────────┐
│ 文档1:"小米手机12 Pro"              │
│ • 词频(TF):2个关键词都出现           │
│ • 逆文档频率(IDF):计算词的稀有度      │
│ • 字段长度:标题较短,权重更高         │
│ • 评分:9.8                         │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│ 文档2:"这是一款性价比很高的手机"      │
│ • 词频(TF):只有"手机"出现            │
│ • 逆文档频率(IDF):"手机"较常见        │
│ • 字段长度:描述较长,权重降低         │
│ • 评分:3.2                         │
└─────────────────────────────────────┘

Detailed Explanation of Relevance Factors:

  1. TF (Term Frequency): Frequency of keywords appearing in the document
  2. IDF (Inverse Document Frequency): Rarity of keywords across all documents
  3. Field Length Normalization: Matches in shorter fields have higher weight than in longer fields
  4. Field Weight Boost: Can set title to be more important than content
  5. Query-time Weight: Can specify certain query terms as more important

Limitations of MySQL Full-Text Index:

-- MySQL全文索引创建
ALTER TABLE products ADD FULLTEXT(name, description);

-- 问题1:最小词长限制(默认4个字符)
-- "手机" 可以搜索,但 "机" 搜不到

-- 问题2:中文分词支持差
-- "苹果手机" 被当作一个整体,搜索"苹果"找不到

Elasticsearch Full-Text Search Capabilities:

// ES analysis process example
输入文本:"我想买一台苹果手机"

分词结果:
[] [] [] [一台] [苹果] [手机] [苹果手机]

同义词扩展:
[苹果] → [Apple, iPhone]
[手机] → [手机, 电话, mobile]

拼写纠错:
"苹果手击" → 建议 "苹果手机"

2.4 Detailed Explanation of Inaccurate Search and Lack of Word Segmentation

Problems with MySQL String Matching:

-- 搜索"笔记本"
SELECT * FROM products WHERE name LIKE '%笔记本%';
-- 结果:能找到"笔记本电脑"
-- 问题:找不到"笔记 本子"、"notebook"、"手提电脑"

Elasticsearch Smart Word Segmentation Process:

原始文本:"ThinkPad X1 Carbon超轻薄笔记本电脑"

标准分词器:
[ThinkPad] [X1] [Carbon] [超轻薄] [笔记本] [电脑]

IK分词器(中文):
[ThinkPad] [X1] [Carbon] [超] [轻薄] [超轻薄]
[笔记] [本] [笔记本] [电脑] [笔记本电脑]

拼音分词器:
[si] [kao] [pad] → 可以通过拼音搜索

N-gram分词:
[Thi] [hin] [ink] [nkP] → 支持部分匹配

3. What is Full-Text Search - Core Principle Analysis

3.1 Structured Data vs Unstructured Data

Structured Data (MySQL storage method):
┌──────┬────────┬────────┬────────┐
│  ID  │  Name  │ Price  │ Stock  │
├──────┼────────┼────────┼────────┤
│  1   │iPhone  │ 5999   │  100   │
│  2   │ 小米   │ 2999   │  200   │
└──────┴────────┴────────┴────────┘

Unstructured Data (Text content):
"This iPhone uses an A15 processor, with powerful performance,
excellent camera effects, and 20% improved battery life,
User review: 'Awesome, great value for money!'"

3.2 Detailed Explanation of Inverted Index Principle

Forward Index (MySQL):

Document ID → Content
Doc1 → "小米手机"
Doc2 → "苹果手机"
Doc3 → "小米电视"

Inverted Index (Elasticsearch):

Term → Document List
"小米" → [Doc1, Doc3]
"手机" → [Doc1, Doc2]
"苹果" → [Doc2]
"电视" → [Doc3]

Search 'Xiaomi phone':
1. Find 'Xiaomi' → Get [Doc1, Doc3]
2. Find 'phone' → Get [Doc1, Doc2]
3. Calculate intersection → Doc1 (most relevant)

3.3 Detailed Structure of Inverted Index

Complete Inverted Index Structure:

Term: "手机"
├── Document Frequency (DF): 1000 documents contain this term
├── Inverted List:
│   ├── Doc1:
│   │   ├── Term Frequency (TF): 3 times
│   │   ├── Positions: [5, 28, 102]
│   │   └── Fields: [title, description]
│   ├── Doc2:
│   │   ├── Term Frequency (TF): 1 time
│   │   ├── Positions: [15]
│   │   └── Fields: [title]
│   └── ...
└── Statistics: Highest term frequency, average term frequency, etc.

4. Elasticsearch Architecture Explained

4.1 Cluster Architecture

Elasticsearch Cluster Architecture Diagram:

┌─────────────── ES Cluster ──────────────┐
│                                         │
│  ┌─────────────────────────────────┐   │
│  │     Master Node                 │   │
│  │  • Cluster management           │   │
│  │  • Index creation/deletion      │   │
│  │  • Shard allocation             │   │
│  └─────────────────────────────────┘   │
│                                         │
│  ┌──────────┐  ┌──────────┐           │
│  │ Data     │  │ Data     │           │
│  │ Node 1   │  │ Node 2   │           │
│  │ ┌──────┐ │  │ ┌──────┐ │           │
│  │ │ P0   │ │  │ │ R0   │ │           │
│  │ ├──────┤ │  │ ├──────┤ │           │
│  │ │ R1   │ │  │ │ P1   │ │           │
│  │ └──────┘ │  │ └──────┘ │           │
│  └──────────┘  └──────────┘           │
│                                         │
│  P = Primary Shard                     │
│  R = Replica Shard                     │
└─────────────────────────────────────────┘

4.2 Data Write Process

Detailed Write Process:

Client → Coordinating Node → Primary Shard → Replica Shard

1. Client sends write request

2. Coordinating node determines shard via hash routing

3. Request forwarded to primary shard node

4. Primary shard writes successfully

5. Replicated to replica shards in parallel

6. All replicas confirm

7. Returns success response to client

Timeline:
T0 ──→ T1 ──→ T2 ──→ T3 ──→ T4
Receive   Route   Primary Shard  Replica   Respond

4.3 Query Process

Query Execution Process:

Phase 1: Query
┌─────────────────────────────────┐
│ Coordinating node sends query requests to all shards   │
│ Each shard returns Top N document IDs and scores  │
└─────────────────────────────────┘

Phase 2: Fetch
┌─────────────────────────────────┐
│ Coordinating node aggregates and sorts all results       │
│ Retrieves the complete content of the final required documents       │
└─────────────────────────────────┘

5. Elasticsearch Core Features Explained

5.1 Detailed Explanation of Query Types

// 1. Match Query - Full-Text Search
{
  "query": {
    "match": {
      "title": {
        "query": "苹果手机",
        "operator": "and"  // Must contain all terms
      }
    }
  }
}

// 2. Term Query - Exact Match
{
  "query": {
    "term": {
      "category.keyword": "手机"  // No tokenization, exact match
    }
  }
}

// 3. Range Query - Range Search
{
  "query": {
    "range": {
      "price": {
        "gte": 1000,
        "lte": 5000
      }
    }
  }
}

// 4. Bool Compound Query
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "手机"}}
      ],
      "filter": [
        {"range": {"price": {"lte": 5000}}}
      ],
      "should": [
        {"match": {"brand": "苹果"}}  // Bonus item
      ],
      "must_not": [
        {"term": {"status": "discontinued"}}
      ]
    }
  }
}

5.2 Aggregation Analysis Function

// Sales Data Analysis Example
{
  "aggs": {
    "sales_per_category": {
      "terms": {
        "field": "category"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "price"
          }
        },
        "total_sales": {
          "sum": {
            "field": "sales_count"
          }
        },
        "price_ranges": {
          "range": {
            "field": "price",
            "ranges": [
              {"to": 1000},
              {"from": 1000, "to": 5000},
              {"from": 5000}
            ]
          }
        }
      }
    }
  }
}

6. Real-world Application Case Studies

6.1 E-commerce Search Optimization Case Study

Comparison of a certain e-commerce platform’s search before and after optimization:

MetricMySQL SolutionElasticsearch SolutionImprovement Effect
Average Search Time2.3 seconds0.05 seconds46x improvement
Search Accuracy65%92%27% increase
Zero Result Rate18%3%15% decrease
Number of Servers8 servers3 servers62.5% cost savings
Concurrency Capability100 QPS5000 QPS50x improvement

Implementation Details:

  1. Data Synchronization Architecture: MySQL(主数据) → Binlog → Logstash → Elasticsearch ↓ Scheduled full synchronization (nightly)

  2. Search Optimization Strategies:

  3. Pinyin Search: Supports searching for “pinguo” to find “苹果” (Apple)

  4. Synonyms: Configured “手机” (phone), “电话” (telephone), “mobile” as synonyms

  5. Search Suggestions: Real-time prompts for possible search terms to users

  6. Correction Function: Automatically corrects common spelling errors

6.2 Log Analysis System Case Study

A certain internet company’s log analysis system:

Log Processing Flow:

Application Server → Filebeat → Logstash → Elasticsearch → Kibana
     ↓           ↓          ↓            ↓            ↓
   Generates logs    Collects      Processes and transforms      Stores and indexes     Visualizes and displays

Processing Scale:
• Log volume: 100GB per day
• Number of log entries: 1 billion entries/day
• Query response: Millisecond level
• Retention period: 30 days hot data, 1 year cold data

7. Performance Optimization Best Practices

7.1 Index Design Optimization

// Optimized Mapping Design
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart",
        "fields": {
          "keyword": {
            "type": "keyword"  // Supports exact matching
          },
          "pinyin": {
            "type": "text",
            "analyzer": "pinyin"  // Supports Pinyin search
          }
        }
      },
      "price": {
        "type": "scaled_float",
        "scaling_factor": 100  // Price precision optimization
      },
      "category": {
        "type": "keyword"  // Category does not require tokenization
      },
      "description": {
        "type": "text",
        "analyzer": "ik_smart"
      },
      "created_time": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||epoch_millis"
      }
    }
  }
}

7.2 Query Performance Optimization Tips

  1. 1. Use Filter instead of Query (when scoring is not needed) ```json // Before optimization: using query (calculates score) {“query”: {“term”: {“status”: “active”}}}

// After optimization: using filter (does not calculate score, can be cached) {“query”: {“bool”: {“filter”: {“term”: {“status”: “active”}}}}} ```

  1. 2. Reasonably set shard count ``` Shard count reference formula: Shard count = Data volume (GB) / 30GB

Example: - 100GB data: 3-4 shards - 1TB data: 35-40 shards ```

  1. 3. Batch operation optimization json // Use bulk API for batch indexing POST _bulk {"index": {"_index": "products", "_id": 1}} {"name": "iPhone", "price": 5999} {"index": {"_index": "products", "_id": 2}} {"name": "小米", "price": 2999}

8. Elasticsearch vs Traditional Databases

8.1 Applicable Scenarios Comparison

ScenarioMySQLElasticsearchRecommended Choice
Full-Text Search❌ Poor✅ ExcellentES
Transaction Support✅ Full ACID❌ No transactionsMySQL
Real-time Statistical Analysis⚠️ Average✅ ExcellentES
Relational Queries✅ Excellent❌ LimitedMySQL
Geolocation Search❌ Poor✅ ExcellentES
Log Analysis❌ Not suitable✅ SpecialtyES
Precise Numerical Calculation✅ Precise⚠️ ApproximateMySQL

8.2 Hybrid Architecture Solution

Recommended Hybrid Architecture:

        User Request

    ┌──────────────┐
    │   Application Layer     │
    └──────────────┘

    ┌──────────────────────────┐
    │      Search Requests  → ES       │
    │      Transaction Operations  → MySQL    │
    │      Cache     → Redis    │
    └──────────────────────────┘

Data Synchronization:
MySQL(Write) → Binlog → Canal/Debezium → Kafka → ES(Read)

9. Common Problems and Solutions

9.1 Data Consistency Issues

Problem: MySQL and ES data inconsistency

Solutions: 1. Dual-write strategy: Write to MySQL and ES simultaneously, use message queues to ensure eventual consistency 2. CDC (Change Data Capture): Real-time synchronization via Binlog 3. Regular verification: Scheduled tasks compare data differences and fix them

9.2 Deep Paging Problem

Problem: Extremely poor performance when querying the 10,000th page of data

Solutions:

// 1. Use search_after (recommended)
{
  "size": 10,
  "sort": [{"_id": "asc"}],
  "search_after": [10000]  // Sort value of the last document on the previous page
}

// 2. Use scroll API (suitable for export)
POST /products/_search?scroll=1m
{
  "size": 100,
  "query": {"match_all": {}}
}

10. Summary

Elasticsearch, through its inverted index, distributed architecture, and powerful full-text search capabilities, perfectly solves various problems faced by traditional databases in search scenarios. Proper use of Elasticsearch can:

  1. 1. Improve search performance: From seconds to milliseconds
  2. 2. Enhance search quality: Through relevance scoring and smart tokenization
  3. 3. Support complex analysis: Real-time aggregation and statistical analysis
  4. 4. Reduce operational costs: Fewer servers, higher efficiency

However, it’s important to note that Elasticsearch is not a replacement for MySQL, but rather a complement. In actual projects, the appropriate storage solution should be chosen based on specific scenarios, and typically a hybrid architecture of MySQL+Elasticsearch can leverage their respective strengths.