Go Engineer Training Program 009

Other Features

  • User Center
  • Favorites
  • Manage Shipping Addresses (CRUD)
  • Messages

Copyinventory_srv--> userop_srv query and replace allinventory

Elasticsearch In-depth Analysis Document

1. What is Elasticsearch

Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene, capable of rapidly storing, searching, and analyzing massive amounts of data. It is a core component of the Elastic Stack (formerly ELK Stack).

2. Problems Faced by MySQL Search - In-depth Analysis

2.1 Detailed Explanation of Performance Issues

Problem Description:

-- 当数据量达到100万条时,以下查询可能需要数秒
SELECT * FROM products WHERE name LIKE '%手机%' OR description LIKE '%手机%';

Performance Comparison Data:

Data Volume MySQL LIKE Query Elasticsearch Full-Text Search Performance Improvement
1万条 50ms 10ms 5倍
10万条 500ms 15ms 33倍
100万条 5000ms 20ms 250倍
1000万条 50000ms+ 30ms 1600倍+

Root Causes:

  1. Full Table Scan: LIKE '%keyword%' cannot use B+ tree indexes, must scan all rows
  2. I/O Intensive: Each query requires reading a large amount of data from disk
  3. CPU Intensive: Performs string matching operations on every row of data
  4. Memory Pressure: Large amounts of data loaded into memory for processing

Real-world Case:

A certain e-commerce platform's product table has 5 million records, using MySQL fuzzy search to find "Apple phone":
- Query time: 8.3 seconds
- CPU utilization: Soared to 85%
- When 10 concurrent queries were made, response time increased to over 30 seconds

2.2 Detailed Explanation of Lack of Relevance Ranking

Pain Points of MySQL Query Results:

-- MySQL只能按固定规则排序
SELECT * FROM products
WHERE name LIKE '%手机%'
ORDER BY price DESC;  -- 只能按价格、时间等字段排序

Elasticsearch's Relevance Scoring Mechanism:

搜索词:"小米手机"

相关性评分计算:
┌─────────────────────────────────────┐
│ 文档1:"小米手机12 Pro"              │
│ • 词频(TF):2个关键词都出现           │
│ • 逆文档频率(IDF):计算词的稀有度      │
│ • 字段长度:标题较短,权重更高         │
│ • 评分:9.8                         │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│ 文档2:"这是一款性价比很高的手机"      │
│ • 词频(TF):只有"手机"出现            │
│ • 逆文档频率(IDF):"手机"较常见        │
│ • 字段长度:描述较长,权重降低         │
│ • 评分:3.2                         │
└─────────────────────────────────────┘

Detailed Explanation of Relevance Factors:

  1. TF (Term Frequency): Frequency of keywords appearing in the document
  2. IDF (Inverse Document Frequency): Rarity of keywords across all documents
  3. Field Length Normalization: Matches in shorter fields have higher weight than in longer fields
  4. Field Weight Boost: Can set title to be more important than content
  5. Query-time Weight: Can specify certain query terms as more important

2.3 Detailed Explanation of Inability to Perform Full-Text Search

Limitations of MySQL Full-Text Index:

-- MySQL全文索引创建
ALTER TABLE products ADD FULLTEXT(name, description);

-- 问题1:最小词长限制(默认4个字符)
-- "手机" 可以搜索,但 "机" 搜不到

-- 问题2:中文分词支持差
-- "苹果手机" 被当作一个整体,搜索"苹果"找不到

Elasticsearch Full-Text Search Capabilities:

// ES analysis process example
输入文本:"我想买一台苹果手机"

分词结果:
[我] [想] [买] [一台] [苹果] [手机] [苹果手机]

同义词扩展:
[苹果] → [Apple, iPhone]
[手机] → [手机, 电话, mobile]

拼写纠错:
"苹果手击" → 建议 "苹果手机"

2.4 Detailed Explanation of Inaccurate Search and Lack of Word Segmentation

Problems with MySQL String Matching:

-- 搜索"笔记本"
SELECT * FROM products WHERE name LIKE '%笔记本%';
-- 结果:能找到"笔记本电脑"
-- 问题:找不到"笔记 本子"、"notebook"、"手提电脑"

Elasticsearch Smart Word Segmentation Process:

原始文本:"ThinkPad X1 Carbon超轻薄笔记本电脑"

标准分词器:
[ThinkPad] [X1] [Carbon] [超轻薄] [笔记本] [电脑]

IK分词器(中文):
[ThinkPad] [X1] [Carbon] [超] [轻薄] [超轻薄]
[笔记] [本] [笔记本] [电脑] [笔记本电脑]

拼音分词器:
[si] [kao] [pad] → 可以通过拼音搜索

N-gram分词:
[Thi] [hin] [ink] [nkP] → 支持部分匹配

3. What is Full-Text Search - Core Principle Analysis

3.1 Structured Data vs Unstructured Data

Structured Data (MySQL storage method):
┌──────┬────────┬────────┬────────┐
│  ID  │  Name  │ Price  │ Stock  │
├──────┼────────┼────────┼────────┤
│  1   │iPhone  │ 5999   │  100   │
│  2   │ 小米   │ 2999   │  200   │
└──────┴────────┴────────┴────────┘

Unstructured Data (Text content):
"This iPhone uses an A15 processor, with powerful performance,
excellent camera effects, and 20% improved battery life,
User review: 'Awesome, great value for money!'"

3.2 Detailed Explanation of Inverted Index Principle

Forward Index (MySQL):

Document ID → Content
Doc1 → "小米手机"
Doc2 → "苹果手机"
Doc3 → "小米电视"

Inverted Index (Elasticsearch):

Term → Document List
"小米" → [Doc1, Doc3]
"手机" → [Doc1, Doc2]
"苹果" → [Doc2]
"电视" → [Doc3]

Search 'Xiaomi phone':
1. Find 'Xiaomi' → Get [Doc1, Doc3]
2. Find 'phone' → Get [Doc1, Doc2]
3. Calculate intersection → Doc1 (most relevant)

3.3 Detailed Structure of Inverted Index

Complete Inverted Index Structure:

Term: "手机"
├── Document Frequency (DF): 1000 documents contain this term
├── Inverted List:
│   ├── Doc1:
│   │   ├── Term Frequency (TF): 3 times
│   │   ├── Positions: [5, 28, 102]
│   │   └── Fields: [title, description]
│   ├── Doc2:
│   │   ├── Term Frequency (TF): 1 time
│   │   ├── Positions: [15]
│   │   └── Fields: [title]
│   └── ...
└── Statistics: Highest term frequency, average term frequency, etc.

4. Elasticsearch Architecture Explained

4.1 Cluster Architecture

Elasticsearch Cluster Architecture Diagram:

┌─────────────── ES Cluster ──────────────┐
│                                         │
│  ┌─────────────────────────────────┐   │
│  │     Master Node                 │   │
│  │  • Cluster management           │   │
│  │  • Index creation/deletion      │   │
│  │  • Shard allocation             │   │
│  └─────────────────────────────────┘   │
│                                         │
│  ┌──────────┐  ┌──────────┐           │
│  │ Data     │  │ Data     │           │
│  │ Node 1   │  │ Node 2   │           │
│  │ ┌──────┐ │  │ ┌──────┐ │           │
│  │ │ P0   │ │  │ │ R0   │ │           │
│  │ ├──────┤ │  │ ├──────┤ │           │
│  │ │ R1   │ │  │ │ P1   │ │           │
│  │ └──────┘ │  │ └──────┘ │           │
│  └──────────┘  └──────────┘           │
│                                         │
│  P = Primary Shard                     │
│  R = Replica Shard                     │
└─────────────────────────────────────────┘

4.2 Data Write Process

Detailed Write Process:

Client → Coordinating Node → Primary Shard → Replica Shard

1. Client sends write request
   ↓
2. Coordinating node determines shard via hash routing
   ↓
3. Request forwarded to primary shard node
   ↓
4. Primary shard writes successfully
   ↓
5. Replicated to replica shards in parallel
   ↓
6. All replicas confirm
   ↓
7. Returns success response to client

Timeline:
T0 ──→ T1 ──→ T2 ──→ T3 ──→ T4
Receive   Route   Primary Shard  Replica   Respond

4.3 Query Process

Query Execution Process:

Phase 1: Query
┌─────────────────────────────────┐
│ Coordinating node sends query requests to all shards   │
│ Each shard returns Top N document IDs and scores  │
└─────────────────────────────────┘
           ↓
Phase 2: Fetch
┌─────────────────────────────────┐
│ Coordinating node aggregates and sorts all results       │
│ Retrieves the complete content of the final required documents       │
└─────────────────────────────────┘

5. Elasticsearch Core Features Explained

5.1 Detailed Explanation of Query Types

// 1. Match Query - Full-Text Search
{
  "query": {
    "match": {
      "title": {
        "query": "苹果手机",
        "operator": "and"  // Must contain all terms
      }
    }
  }
}

// 2. Term Query - Exact Match
{
  "query": {
    "term": {
      "category.keyword": "手机"  // No tokenization, exact match
    }
  }
}

// 3. Range Query - Range Search
{
  "query": {
    "range": {
      "price": {
        "gte": 1000,
        "lte": 5000
      }
    }
  }
}

// 4. Bool Compound Query
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "手机"}}
      ],
      "filter": [
        {"range": {"price": {"lte": 5000}}}
      ],
      "should": [
        {"match": {"brand": "苹果"}}  // Bonus item
      ],
      "must_not": [
        {"term": {"status": "discontinued"}}
      ]
    }
  }
}

5.2 Aggregation Analysis Function

// Sales Data Analysis Example
{
  "aggs": {
    "sales_per_category": {
      "terms": {
        "field": "category"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "price"
          }
        },
        "total_sales": {
          "sum": {
            "field": "sales_count"
          }
        },
        "price_ranges": {
          "range": {
            "field": "price",
            "ranges": [
              {"to": 1000},
              {"from": 1000, "to": 5000},
              {"from": 5000}
            ]
          }
        }
      }
    }
  }
}

6. Real-world Application Case Studies

6.1 E-commerce Search Optimization Case Study

Comparison of a certain e-commerce platform's search before and after optimization:

Metric MySQL Solution Elasticsearch Solution Improvement Effect
Average Search Time 2.3 seconds 0.05 seconds 46x improvement
Search Accuracy 65% 92% 27% increase
Zero Result Rate 18% 3% 15% decrease
Number of Servers 8 servers 3 servers 62.5% cost savings
Concurrency Capability 100 QPS 5000 QPS 50x improvement

Implementation Details:

  1. Data Synchronization Architecture:
    MySQL(主数据) → Binlog → Logstash → Elasticsearch

    Scheduled full synchronization (nightly)

  2. Search Optimization Strategies:

  3. Pinyin Search: Supports searching for "pinguo" to find "苹果" (Apple)
  4. Synonyms: Configured "手机" (phone), "电话" (telephone), "mobile" as synonyms
  5. Search Suggestions: Real-time prompts for possible search terms to users
  6. Correction Function: Automatically corrects common spelling errors

6.2 Log Analysis System Case Study

A certain internet company's log analysis system:

Log Processing Flow:

Application Server → Filebeat → Logstash → Elasticsearch → Kibana
     ↓           ↓          ↓            ↓            ↓
   Generates logs    Collects      Processes and transforms      Stores and indexes     Visualizes and displays

Processing Scale:
• Log volume: 100GB per day
• Number of log entries: 1 billion entries/day
• Query response: Millisecond level
• Retention period: 30 days hot data, 1 year cold data

7. Performance Optimization Best Practices

7.1 Index Design Optimization

// Optimized Mapping Design
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart",
        "fields": {
          "keyword": {
            "type": "keyword"  // Supports exact matching
          },
          "pinyin": {
            "type": "text",
            "analyzer": "pinyin"  // Supports Pinyin search
          }
        }
      },
      "price": {
        "type": "scaled_float",
        "scaling_factor": 100  // Price precision optimization
      },
      "category": {
        "type": "keyword"  // Category does not require tokenization
      },
      "description": {
        "type": "text",
        "analyzer": "ik_smart"
      },
      "created_time": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||epoch_millis"
      }
    }
  }
}

7.2 Query Performance Optimization Tips

  1. 1. Use Filter instead of Query (when scoring is not needed)
    ```json
    // Before optimization: using query (calculates score)
    {"query": {"term": {"status": "active"}}}

// After optimization: using filter (does not calculate score, can be cached)
{"query": {"bool": {"filter": {"term": {"status": "active"}}}}}
```

  1. 2. Reasonably set shard count
    ```
    Shard count reference formula:
    Shard count = Data volume (GB) / 30GB

Example:
- 100GB data: 3-4 shards
- 1TB data: 35-40 shards
```

  1. 3. Batch operation optimization
    json
    // Use bulk API for batch indexing
    POST _bulk
    {"index": {"_index": "products", "_id": 1}}
    {"name": "iPhone", "price": 5999}
    {"index": {"_index": "products", "_id": 2}}
    {"name": "小米", "price": 2999}

8. Elasticsearch vs Traditional Databases

8.1 Applicable Scenarios Comparison

Scenario MySQL Elasticsearch Recommended Choice
Full-Text Search ❌ Poor ✅ Excellent ES
Transaction Support ✅ Full ACID ❌ No transactions MySQL
Real-time Statistical Analysis ⚠️ Average ✅ Excellent ES
Relational Queries ✅ Excellent ❌ Limited MySQL
Geolocation Search ❌ Poor ✅ Excellent ES
Log Analysis ❌ Not suitable ✅ Specialty ES
Precise Numerical Calculation ✅ Precise ⚠️ Approximate MySQL

8.2 Hybrid Architecture Solution

Recommended Hybrid Architecture:

        User Request
           ↓
    ┌──────────────┐
    │   Application Layer     │
    └──────────────┘
           ↓
    ┌──────────────────────────┐
    │      Search Requests  → ES       │
    │      Transaction Operations  → MySQL    │
    │      Cache     → Redis    │
    └──────────────────────────┘

Data Synchronization:
MySQL(Write) → Binlog → Canal/Debezium → Kafka → ES(Read)

9. Common Problems and Solutions

9.1 Data Consistency Issues

Problem: MySQL and ES data inconsistency

Solutions:
1. Dual-write strategy: Write to MySQL and ES simultaneously, use message queues to ensure eventual consistency
2. CDC (Change Data Capture): Real-time synchronization via Binlog
3. Regular verification: Scheduled tasks compare data differences and fix them

9.2 Deep Paging Problem

Problem: Extremely poor performance when querying the 10,000th page of data

Solutions:

// 1. Use search_after (recommended)
{
  "size": 10,
  "sort": [{"_id": "asc"}],
  "search_after": [10000]  // Sort value of the last document on the previous page
}

// 2. Use scroll API (suitable for export)
POST /products/_search?scroll=1m
{
  "size": 100,
  "query": {"match_all": {}}
}

10. Summary

Elasticsearch, through its inverted index, distributed architecture, and powerful full-text search capabilities, perfectly solves various problems faced by traditional databases in search scenarios. Proper use of Elasticsearch can:

  1. 1. Improve search performance: From seconds to milliseconds
  2. 2. Enhance search quality: Through relevance scoring and smart tokenization
  3. 3. Support complex analysis: Real-time aggregation and statistical analysis
  4. 4. Reduce operational costs: Fewer servers, higher efficiency

However, it's important to note that Elasticsearch is not a replacement for MySQL, but rather a complement. In actual projects, the appropriate storage solution should be chosen based on specific scenarios, and typically a hybrid architecture of MySQL+Elasticsearch can leverage their respective strengths.

主题测试文章,只做测试使用。发布者:Walker,转转请注明出处:https://walker-learn.xyz/archives/6756

(0)
Walker的头像Walker
上一篇 7 hours ago
下一篇 9 hours ago

Related Posts

EN
简体中文 繁體中文 English