Go Engineer Training Program 009

User Center
Favorites
Manage Shipping Addresses (CRUD)
Messages

Copyinventory_srv—> userop_srv query and replace allinventory

Elasticsearch In-depth Analysis Document

1. What is Elasticsearch

Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene, capable of rapidly storing, searching, and analyzing massive amounts of data. It is a core component of the Elastic Stack (formerly ELK Stack).

2. Problems Faced by MySQL Search - In-depth Analysis

2.1 Detailed Explanation of Performance Issues

Problem Description:

-- 当数据量达到100万条时，以下查询可能需要数秒
SELECT * FROM products WHERE name LIKE '%手机%' OR description LIKE '%手机%';

Performance Comparison Data:

Data Volume	MySQL LIKE Query	Elasticsearch Full-Text Search	Performance Improvement
1万条	50ms	10ms	5倍
10万条	500ms	15ms	33倍
100万条	5000ms	20ms	250倍
1000万条	50000ms+	30ms	1600倍+

Root Causes:

Full Table Scan: LIKE ‘%keyword%’ cannot use B+ tree indexes, must scan all rows
I/O Intensive: Each query requires reading a large amount of data from disk
CPU Intensive: Performs string matching operations on every row of data
Memory Pressure: Large amounts of data loaded into memory for processing

Real-world Case:

A certain e-commerce platform’s product table has 5 million records, using MySQL fuzzy search to find “Apple phone”: - Query time: 8.3 seconds - CPU utilization: Soared to 85% - When 10 concurrent queries were made, response time increased to over 30 seconds

2.2 Detailed Explanation of Lack of Relevance Ranking

Pain Points of MySQL Query Results:

-- MySQL只能按固定规则排序
SELECT * FROM products
WHERE name LIKE '%手机%'
ORDER BY price DESC;  -- 只能按价格、时间等字段排序

Elasticsearch’s Relevance Scoring Mechanism:

搜索词："小米手机"

相关性评分计算：
┌─────────────────────────────────────┐
│ 文档1："小米手机12 Pro"              │
│ • 词频(TF)：2个关键词都出现           │
│ • 逆文档频率(IDF)：计算词的稀有度      │
│ • 字段长度：标题较短，权重更高         │
│ • 评分：9.8                         │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│ 文档2："这是一款性价比很高的手机"      │
│ • 词频(TF)：只有"手机"出现            │
│ • 逆文档频率(IDF)："手机"较常见        │
│ • 字段长度：描述较长，权重降低         │
│ • 评分：3.2                         │
└─────────────────────────────────────┘

Detailed Explanation of Relevance Factors:

TF (Term Frequency): Frequency of keywords appearing in the document
IDF (Inverse Document Frequency): Rarity of keywords across all documents
Field Length Normalization: Matches in shorter fields have higher weight than in longer fields
Field Weight Boost: Can set title to be more important than content
Query-time Weight: Can specify certain query terms as more important

2.3 Detailed Explanation of Inability to Perform Full-Text Search

Limitations of MySQL Full-Text Index:

-- MySQL全文索引创建
ALTER TABLE products ADD FULLTEXT(name, description);

-- 问题1：最小词长限制（默认4个字符）
-- "手机" 可以搜索，但 "机" 搜不到

-- 问题2：中文分词支持差
-- "苹果手机" 被当作一个整体，搜索"苹果"找不到

Elasticsearch Full-Text Search Capabilities:

// ES analysis process example
输入文本："我想买一台苹果手机"

分词结果：
[我] [想] [买] [一台] [苹果] [手机] [苹果手机]

同义词扩展：
[苹果] → [Apple, iPhone]
[手机] → [手机, 电话, mobile]

拼写纠错：
"苹果手击" → 建议 "苹果手机"

2.4 Detailed Explanation of Inaccurate Search and Lack of Word Segmentation

Problems with MySQL String Matching:

-- 搜索"笔记本"
SELECT * FROM products WHERE name LIKE '%笔记本%';
-- 结果：能找到"笔记本电脑"
-- 问题：找不到"笔记 本子"、"notebook"、"手提电脑"

Elasticsearch Smart Word Segmentation Process:

原始文本："ThinkPad X1 Carbon超轻薄笔记本电脑"

标准分词器：
[ThinkPad] [X1] [Carbon] [超轻薄] [笔记本] [电脑]

IK分词器（中文）：
[ThinkPad] [X1] [Carbon] [超] [轻薄] [超轻薄]
[笔记] [本] [笔记本] [电脑] [笔记本电脑]

拼音分词器：
[si] [kao] [pad] → 可以通过拼音搜索

N-gram分词：
[Thi] [hin] [ink] [nkP] → 支持部分匹配

3. What is Full-Text Search - Core Principle Analysis

3.1 Structured Data vs Unstructured Data

Structured Data (MySQL storage method):
┌──────┬────────┬────────┬────────┐
│  ID  │  Name  │ Price  │ Stock  │
├──────┼────────┼────────┼────────┤
│  1   │iPhone  │ 5999   │  100   │
│  2   │ 小米   │ 2999   │  200   │
└──────┴────────┴────────┴────────┘

Unstructured Data (Text content):
"This iPhone uses an A15 processor, with powerful performance,
excellent camera effects, and 20% improved battery life,
User review: 'Awesome, great value for money!'"

3.2 Detailed Explanation of Inverted Index Principle

Forward Index (MySQL):

Document ID → Content
Doc1 → "小米手机"
Doc2 → "苹果手机"
Doc3 → "小米电视"

Inverted Index (Elasticsearch):

Term → Document List
"小米" → [Doc1, Doc3]
"手机" → [Doc1, Doc2]
"苹果" → [Doc2]
"电视" → [Doc3]

Search 'Xiaomi phone':
1. Find 'Xiaomi' → Get [Doc1, Doc3]
2. Find 'phone' → Get [Doc1, Doc2]
3. Calculate intersection → Doc1 (most relevant)

3.3 Detailed Structure of Inverted Index

Complete Inverted Index Structure:

Term: "手机"
├── Document Frequency (DF): 1000 documents contain this term
├── Inverted List:
│   ├── Doc1:
│   │   ├── Term Frequency (TF): 3 times
│   │   ├── Positions: [5, 28, 102]
│   │   └── Fields: [title, description]
│   ├── Doc2:
│   │   ├── Term Frequency (TF): 1 time
│   │   ├── Positions: [15]
│   │   └── Fields: [title]
│   └── ...
└── Statistics: Highest term frequency, average term frequency, etc.

4. Elasticsearch Architecture Explained

4.1 Cluster Architecture

Elasticsearch Cluster Architecture Diagram:

┌─────────────── ES Cluster ──────────────┐
│                                         │
│  ┌─────────────────────────────────┐   │
│  │     Master Node                 │   │
│  │  • Cluster management           │   │
│  │  • Index creation/deletion      │   │
│  │  • Shard allocation             │   │
│  └─────────────────────────────────┘   │
│                                         │
│  ┌──────────┐  ┌──────────┐           │
│  │ Data     │  │ Data     │           │
│  │ Node 1   │  │ Node 2   │           │
│  │ ┌──────┐ │  │ ┌──────┐ │           │
│  │ │ P0   │ │  │ │ R0   │ │           │
│  │ ├──────┤ │  │ ├──────┤ │           │
│  │ │ R1   │ │  │ │ P1   │ │           │
│  │ └──────┘ │  │ └──────┘ │           │
│  └──────────┘  └──────────┘           │
│                                         │
│  P = Primary Shard                     │
│  R = Replica Shard                     │
└─────────────────────────────────────────┘

4.2 Data Write Process

Detailed Write Process:

Client → Coordinating Node → Primary Shard → Replica Shard

1. Client sends write request
   ↓
2. Coordinating node determines shard via hash routing
   ↓
3. Request forwarded to primary shard node
   ↓
4. Primary shard writes successfully
   ↓
5. Replicated to replica shards in parallel
   ↓
6. All replicas confirm
   ↓
7. Returns success response to client

Timeline:
T0 ──→ T1 ──→ T2 ──→ T3 ──→ T4
Receive   Route   Primary Shard  Replica   Respond

4.3 Query Process

Query Execution Process:

Phase 1: Query
┌─────────────────────────────────┐
│ Coordinating node sends query requests to all shards   │
│ Each shard returns Top N document IDs and scores  │
└─────────────────────────────────┘
           ↓
Phase 2: Fetch
┌─────────────────────────────────┐
│ Coordinating node aggregates and sorts all results       │
│ Retrieves the complete content of the final required documents       │
└─────────────────────────────────┘

5. Elasticsearch Core Features Explained

5.1 Detailed Explanation of Query Types

// 1. Match Query - Full-Text Search
{
  "query": {
    "match": {
      "title": {
        "query": "苹果手机",
        "operator": "and"  // Must contain all terms
      }
    }
  }
}

// 2. Term Query - Exact Match
{
  "query": {
    "term": {
      "category.keyword": "手机"  // No tokenization, exact match
    }
  }
}

// 3. Range Query - Range Search
{
  "query": {
    "range": {
      "price": {
        "gte": 1000,
        "lte": 5000
      }
    }
  }
}

// 4. Bool Compound Query
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "手机"}}
      ],
      "filter": [
        {"range": {"price": {"lte": 5000}}}
      ],
      "should": [
        {"match": {"brand": "苹果"}}  // Bonus item
      ],
      "must_not": [
        {"term": {"status": "discontinued"}}
      ]
    }
  }
}

5.2 Aggregation Analysis Function

// Sales Data Analysis Example
{
  "aggs": {
    "sales_per_category": {
      "terms": {
        "field": "category"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "price"
          }
        },
        "total_sales": {
          "sum": {
            "field": "sales_count"
          }
        },
        "price_ranges": {
          "range": {
            "field": "price",
            "ranges": [
              {"to": 1000},
              {"from": 1000, "to": 5000},
              {"from": 5000}
            ]
          }
        }
      }
    }
  }
}

6. Real-world Application Case Studies

6.1 E-commerce Search Optimization Case Study

Comparison of a certain e-commerce platform’s search before and after optimization:

Metric	MySQL Solution	Elasticsearch Solution	Improvement Effect
Average Search Time	2.3 seconds	0.05 seconds	46x improvement
Search Accuracy	65%	92%	27% increase
Zero Result Rate	18%	3%	15% decrease
Number of Servers	8 servers	3 servers	62.5% cost savings
Concurrency Capability	100 QPS	5000 QPS	50x improvement

Implementation Details:

Data Synchronization Architecture: MySQL(主数据) → Binlog → Logstash → Elasticsearch ↓ Scheduled full synchronization (nightly)
Search Optimization Strategies:
Pinyin Search: Supports searching for “pinguo” to find “苹果” (Apple)
Synonyms: Configured “手机” (phone), “电话” (telephone), “mobile” as synonyms
Search Suggestions: Real-time prompts for possible search terms to users
Correction Function: Automatically corrects common spelling errors

6.2 Log Analysis System Case Study

A certain internet company’s log analysis system:

Log Processing Flow:

Application Server → Filebeat → Logstash → Elasticsearch → Kibana
     ↓           ↓          ↓            ↓            ↓
   Generates logs    Collects      Processes and transforms      Stores and indexes     Visualizes and displays

Processing Scale:
• Log volume: 100GB per day
• Number of log entries: 1 billion entries/day
• Query response: Millisecond level
• Retention period: 30 days hot data, 1 year cold data

7. Performance Optimization Best Practices

7.1 Index Design Optimization

// Optimized Mapping Design
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart",
        "fields": {
          "keyword": {
            "type": "keyword"  // Supports exact matching
          },
          "pinyin": {
            "type": "text",
            "analyzer": "pinyin"  // Supports Pinyin search
          }
        }
      },
      "price": {
        "type": "scaled_float",
        "scaling_factor": 100  // Price precision optimization
      },
      "category": {
        "type": "keyword"  // Category does not require tokenization
      },
      "description": {
        "type": "text",
        "analyzer": "ik_smart"
      },
      "created_time": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||epoch_millis"
      }
    }
  }
}

7.2 Query Performance Optimization Tips

1. Use Filter instead of Query (when scoring is not needed) ```json // Before optimization: using query (calculates score) {“query”: {“term”: {“status”: “active”}}}

// After optimization: using filter (does not calculate score, can be cached) {“query”: {“bool”: {“filter”: {“term”: {“status”: “active”}}}}} ```

2. Reasonably set shard count ``` Shard count reference formula: Shard count = Data volume (GB) / 30GB

Example: - 100GB data: 3-4 shards - 1TB data: 35-40 shards ```

3. Batch operation optimization json // Use bulk API for batch indexing POST _bulk {"index": {"_index": "products", "_id": 1}} {"name": "iPhone", "price": 5999} {"index": {"_index": "products", "_id": 2}} {"name": "小米", "price": 2999}

8. Elasticsearch vs Traditional Databases

8.1 Applicable Scenarios Comparison

Scenario	MySQL	Elasticsearch	Recommended Choice
Full-Text Search	❌ Poor	✅ Excellent	ES
Transaction Support	✅ Full ACID	❌ No transactions	MySQL
Real-time Statistical Analysis	⚠️ Average	✅ Excellent	ES
Relational Queries	✅ Excellent	❌ Limited	MySQL
Geolocation Search	❌ Poor	✅ Excellent	ES
Log Analysis	❌ Not suitable	✅ Specialty	ES
Precise Numerical Calculation	✅ Precise	⚠️ Approximate	MySQL

8.2 Hybrid Architecture Solution

Recommended Hybrid Architecture:

        User Request
           ↓
    ┌──────────────┐
    │   Application Layer     │
    └──────────────┘
           ↓
    ┌──────────────────────────┐
    │      Search Requests  → ES       │
    │      Transaction Operations  → MySQL    │
    │      Cache     → Redis    │
    └──────────────────────────┘

Data Synchronization:
MySQL(Write) → Binlog → Canal/Debezium → Kafka → ES(Read)

9. Common Problems and Solutions

9.1 Data Consistency Issues

Problem: MySQL and ES data inconsistency

Solutions: 1. Dual-write strategy: Write to MySQL and ES simultaneously, use message queues to ensure eventual consistency 2. CDC (Change Data Capture): Real-time synchronization via Binlog 3. Regular verification: Scheduled tasks compare data differences and fix them

9.2 Deep Paging Problem

Problem: Extremely poor performance when querying the 10,000th page of data

Solutions:

// 1. Use search_after (recommended)
{
  "size": 10,
  "sort": [{"_id": "asc"}],
  "search_after": [10000]  // Sort value of the last document on the previous page
}

// 2. Use scroll API (suitable for export)
POST /products/_search?scroll=1m
{
  "size": 100,
  "query": {"match_all": {}}
}

10. Summary

Elasticsearch, through its inverted index, distributed architecture, and powerful full-text search capabilities, perfectly solves various problems faced by traditional databases in search scenarios. Proper use of Elasticsearch can:

1. Improve search performance: From seconds to milliseconds
2. Enhance search quality: Through relevance scoring and smart tokenization
3. Support complex analysis: Real-time aggregation and statistical analysis
4. Reduce operational costs: Fewer servers, higher efficiency

However, it’s important to note that Elasticsearch is not a replacement for MySQL, but rather a complement. In actual projects, the appropriate storage solution should be chosen based on specific scenarios, and typically a hybrid architecture of MySQL+Elasticsearch can leverage their respective strengths.