Go Engineer Comprehensive Course 009 [Study Notes]

Other features: Personal Center, Favorites, Manage shipping addresses (add, delete, modify, query), Messages. Copy inventory_srv --> userop_srv. Query and replace all inventory. Elasticsearch Deep Dive Document. 1. What is Elasticsearch. Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene, capable of quickly…

Other Features

  • Personal Center
  • Favorites
  • Manage Shipping Addresses (Add, Delete, Modify, Query)
  • Messages

Copy inventory_srv--> userop_srv and replace all inventory

Elasticsearch In-depth Analysis Document

1. What is Elasticsearch

Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene, capable of rapidly storing, searching, and analyzing massive amounts of data. It is a core component of the Elastic Stack (formerly ELK Stack).

2. Problems Faced by MySQL Search - In-depth Analysis

2.1 Detailed Explanation of Low Performance Issues

Problem Phenomenon:

-- 当数据量达到100万条时,以下查询可能需要数秒
SELECT * FROM products WHERE name LIKE '%手机%' OR description LIKE '%手机%';

Performance Comparison Data:

Data Volume MySQL LIKE Query Elasticsearch Full-Text Search Performance Improvement
10,000 records 50ms 10ms 5x
100,000 records 500ms 15ms 33x
1 million records 5000ms 20ms 250x
10 million records 50000ms+ 30ms 1600x+

Root Causes:

  1. Full Table Scan: LIKE '%keyword%' cannot use B+ tree indexes, requiring a scan of all rows.
  2. I/O Intensive: Each query needs to read a large amount of data from disk.
  3. CPU Intensive: String matching operations are performed on every row of data.
  4. Memory Pressure: A large amount of data is loaded into memory for processing.

Real-world Case:

An e-commerce platform's product table has 5 million records. Using MySQL fuzzy search for "Apple phone":
- Query time: 8.3 seconds
- CPU usage: Spiked to 85%
- With 10 concurrent queries, response time increased to over 30 seconds

2.2 Detailed Explanation of No Relevance Ranking Issues

Pain Points of MySQL Query Results:

-- MySQL只能按固定规则排序
SELECT * FROM products
WHERE name LIKE '%手机%'
ORDER BY price DESC;  -- 只能按价格、时间等字段排序

Elasticsearch's Relevance Scoring Mechanism:

搜索词:"小米手机"

相关性评分计算:
┌─────────────────────────────────────┐
│ 文档1:"小米手机12 Pro"              │
│ • 词频(TF):2个关键词都出现           │
│ • 逆文档频率(IDF):计算词的稀有度      │
│ • 字段长度:标题较短,权重更高         │
│ • 评分:9.8                         │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│ 文档2:"这是一款性价比很高的手机"      │
│ • 词频(TF):只有"手机"出现            │
│ • 逆文档频率(IDF):"手机"较常见        │
│ • 字段长度:描述较长,权重降低         │
│ • 评分:3.2                         │
└─────────────────────────────────────┘

Detailed Explanation of Relevance Factors:

  1. TF (Term Frequency): The frequency of a keyword appearing in a document.
  2. IDF (Inverse Document Frequency): The rarity of a keyword across all documents.
  3. Field Length Normalization: Matches in shorter fields have higher weight than in longer fields.
  4. Field Weight Boost: Allows setting titles to be more important than content.
  5. Query Time Weight: Allows specifying certain query terms as more important.

2.3 Detailed Explanation of Inability to Full-Text Search Issues

Limitations of MySQL Full-Text Indexing:

-- MySQL全文索引创建
ALTER TABLE products ADD FULLTEXT(name, description);

-- 问题1:最小词长限制(默认4个字符)
-- "手机" 可以搜索,但 "机" 搜不到

-- 问题2:中文分词支持差
-- "苹果手机" 被当作一个整体,搜索"苹果"找不到

Elasticsearch Full-Text Search Capabilities:

// ES的分析过程示例
输入文本:"我想买一台苹果手机"

分词结果:
[我] [想] [买] [一台] [苹果] [手机] [苹果手机]

同义词扩展:
[苹果] → [Apple, iPhone]
[手机] → [手机, 电话, mobile]

拼写纠错:
"苹果手击" → 建议 "苹果手机"

2.4 Detailed Explanation of Inaccurate Search and No Word Segmentation Issues

Problems with MySQL String Matching:

-- 搜索"笔记本"
SELECT * FROM products WHERE name LIKE '%笔记本%';
-- 结果:能找到"笔记本电脑"
-- 问题:找不到"笔记 本子"、"notebook"、"手提电脑"

Elasticsearch Smart Word Segmentation Process:

原始文本:"ThinkPad X1 Carbon超轻薄笔记本电脑"

标准分词器:
[ThinkPad] [X1] [Carbon] [超轻薄] [笔记本] [电脑]

IK分词器(中文):
[ThinkPad] [X1] [Carbon] [超] [轻薄] [超轻薄]
[笔记] [本] [笔记本] [电脑] [笔记本电脑]

拼音分词器:
[si] [kao] [pad] → 可以通过拼音搜索

N-gram分词:
[Thi] [hin] [ink] [nkP] → 支持部分匹配

3. What is Full-Text Search - Core Principle Analysis

3.1 Structured Data vs Unstructured Data

结构化数据(MySQL存储方式):
┌──────┬────────┬────────┬────────┐
│  ID  │  Name  │ Price  │ Stock  │
├──────┼────────┼────────┼────────┤
│  1   │iPhone  │ 5999   │  100   │
│  2   │ 小米   │ 2999   │  200   │
└──────┴────────┴────────┴────────┘

非结构化数据(文本内容):
"这款iPhone手机采用A15处理器,性能强劲,
拍照效果出色,续航能力提升20%,
用户评价:'太棒了,物超所值!'"

3.2 Detailed Explanation of Inverted Index Principle

Forward Index (MySQL):

文档ID → 内容
Doc1 → "小米手机"
Doc2 → "苹果手机"
Doc3 → "小米电视"

Inverted Index (Elasticsearch):

词项 → 文档列表
"小米" → [Doc1, Doc3]
"手机" → [Doc1, Doc2]
"苹果" → [Doc2]
"电视" → [Doc3]

搜索"小米手机":
1. 查找"小米" → 得到 [Doc1, Doc3]
2. 查找"手机" → 得到 [Doc1, Doc2]
3. 计算交集 → Doc1(最相关)

3.3 Detailed Structure of Inverted Index

完整的倒排索引结构:

词项:"手机"
├── 文档频率(DF):1000个文档包含此词
├── 倒排列表:
│   ├── Doc1:
│   │   ├── 词频(TF):3次
│   │   ├── 位置:[5, 28, 102]
│   │   └── 字段:[title, description]
│   ├── Doc2:
│   │   ├── 词频(TF):1次
│   │   ├── 位置:[15]
│   │   └── 字段:[title]
│   └── ...
└── 统计信息:最高词频、平均词频等

4. Elasticsearch Architecture Explained

4.1 Cluster Architecture

Elasticsearch集群架构图:

┌─────────────── ES Cluster ──────────────┐
│                                         │
│  ┌─────────────────────────────────┐   │
│  │     Master Node (主节点)         │   │
│  │  • 集群管理                      │   │
│  │  • 索引创建/删除                 │   │
│  │  • 分片分配                      │   │
│  └─────────────────────────────────┘   │
│                                         │
│  ┌──────────┐  ┌──────────┐           │
│  │ Data     │  │ Data     │           │
│  │ Node 1   │  │ Node 2   │           │
│  │ ┌──────┐ │  │ ┌──────┐ │           │
│  │ │ P0   │ │  │ │ R0   │ │           │
│  │ ├──────┤ │  │ ├──────┤ │           │
│  │ │ R1   │ │  │ │ P1   │ │           │
│  │ └──────┘ │  │ └──────┘ │           │
│  └──────────┘  └──────────┘           │
│                                         │
│  P = Primary Shard (主分片)            │
│  R = Replica Shard (副本分片)          │
└─────────────────────────────────────────┘

4.2 Data Write Process

写入流程详解:

客户端 → 协调节点 → 主分片 → 副本分片

1. 客户端发送写请求
   ↓
2. 协调节点通过hash路由确定分片
   ↓
3. 请求转发到主分片节点
   ↓
4. 主分片写入成功
   ↓
5. 并行复制到副本分片
   ↓
6. 所有副本确认
   ↓
7. 返回成功响应给客户端

时间线:
T0 ──→ T1 ──→ T2 ──→ T3 ──→ T4
接收   路由   主分片  副本   响应

4.3 Query Process

查询执行过程:

Phase 1: Query(查询阶段)
┌─────────────────────────────────┐
│ 协调节点向所有分片发送查询请求   │
│ 每个分片返回Top N的文档ID和分数  │
└─────────────────────────────────┘
           ↓
Phase 2: Fetch(获取阶段)
┌─────────────────────────────────┐
│ 协调节点整合所有结果并排序       │
│ 获取最终需要的文档完整内容       │
└─────────────────────────────────┘

5. Elasticsearch Core Features Explained

5.1 Query Types Explained

// 1. Match查询 - 全文搜索
{
  "query": {
    "match": {
      "title": {
        "query": "苹果手机",
        "operator": "and"  // 必须包含所有词
      }
    }
  }
}

// 2. Term查询 - 精确匹配
{
  "query": {
    "term": {
      "category.keyword": "手机"  // 不分词,精确匹配
    }
  }
}

// 3. Range查询 - 范围查询
{
  "query": {
    "range": {
      "price": {
        "gte": 1000,
        "lte": 5000
      }
    }
  }
}

// 4. Bool复合查询
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "手机"}}
      ],
      "filter": [
        {"range": {"price": {"lte": 5000}}}
      ],
      "should": [
        {"match": {"brand": "苹果"}}  // 加分项
      ],
      "must_not": [
        {"term": {"status": "discontinued"}}
      ]
    }
  }
}

5.2 Aggregation Analysis Functionality

// 销售数据分析示例
{
  "aggs": {
    "sales_per_category": {
      "terms": {
        "field": "category"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "price"
          }
        },
        "total_sales": {
          "sum": {
            "field": "sales_count"
          }
        },
        "price_ranges": {
          "range": {
            "field": "price",
            "ranges": [
              {"to": 1000},
              {"from": 1000, "to": 5000},
              {"from": 5000}
            ]
          }
        }
      }
    }
  }
}

6. Practical Application Case Studies

6.1 E-commerce Search Optimization Case Study

Comparison of an E-commerce Platform's Search Optimization Before and After:

Metric MySQL Solution Elasticsearch Solution Improvement
Average Search Time 2.3 seconds 0.05 seconds 46x improvement
Search Accuracy 65% 92% 27% increase
Zero Result Rate 18% 3% 15% decrease
Number of Servers 8 3 62.5% cost savings
Concurrency Capability 100 QPS 5000 QPS 50x improvement

Implementation Details:

  1. Data Synchronization Architecture:
    MySQL(Primary Data) → Binlog → Logstash → Elasticsearch

    Scheduled Full Synchronization (Nightly)
  2. Search Optimization Strategies:
  3. Pinyin Search: Supports searching for "pinguo" to find "苹果" (Apple)
  4. Synonyms: Configured "手机" (phone), "电话" (telephone), "mobile" as synonyms
  5. Search Suggestions: Real-time suggestions for possible search terms
  6. Correction Function: Automatically corrects common spelling errors

6.2 Log Analysis System Case Study

Log Analysis System of an Internet Company:

日志处理流程:

应用服务器 → Filebeat → Logstash → Elasticsearch → Kibana
     ↓           ↓          ↓            ↓            ↓
   产生日志    收集      处理转换      存储索引     可视化展示

处理规模:
• 日志量:每天100GB
• 日志条数:10亿条/天
• 查询响应:毫秒级
• 保存周期:30天热数据,1年冷数据

7. Performance Optimization Best Practices

7.1 Index Design Optimization

// 优化的Mapping设计
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart",
        "fields": {
          "keyword": {
            "type": "keyword"  // 支持精确匹配
          },
          "pinyin": {
            "type": "text",
            "analyzer": "pinyin"  // 支持拼音搜索
          }
        }
      },
      "price": {
        "type": "scaled_float",
        "scaling_factor": 100  // 价格精度优化
      },
      "category": {
        "type": "keyword"  // 分类不需要分词
      },
      "description": {
        "type": "text",
        "analyzer": "ik_smart"
      },
      "created_time": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||epoch_millis"
      }
    }
  }
}

7.2 Query Performance Optimization Techniques

  1. Use Filter Instead of Query (when scoring is not needed)
    ```json
    // Before optimization: using query (calculates score)
    {"query": {"term": {"status": "active"}}}

// After optimization: using filter (does not calculate score, can be cached)
{"query": {"bool": {"filter": {"term": {"status": "active"}}}}}
<ol>
<li><strong>Set Shard Count Appropriately</strong>

Shard count formula reference:
Shard count = Data size (GB) / 30GB

Example:
- 100GB data: 3-4 shards
- 1TB data: 35-40 shards
```

  1. Batch Operation Optimization
    json
    // Use bulk API for batch indexing
    POST _bulk
    {"index": {"_index": "products", "_id": 1}}
    {"name": "iPhone", "price": 5999}
    {"index": {"_index": "products", "_id": 2}}
    {"name": "小米", "price": 2999}

8. Elasticsearch vs Traditional Databases

8.1 Comparison of Applicable Scenarios

Scenario MySQL Elasticsearch Recommended Choice
Full-Text Search ❌ Poor ✅ Excellent ES
Transaction Support ✅ Full ACID ❌ No Transactions MySQL
Real-time Statistical Analysis ⚠️ Average ✅ Excellent ES
Relational Queries ✅ Excellent ❌ Limited MySQL
Geospatial Search ❌ Poor ✅ Excellent ES
Log Analysis ❌ Unsuitable ✅ Specialty ES
Precise Numerical Calculation ✅ Precise ⚠️ Approximate MySQL

8.2 Hybrid Architecture Solution

推荐的混合架构:

        用户请求
           ↓
    ┌──────────────┐
    │   应用层     │
    └──────────────┘
           ↓
    ┌──────────────────────────┐
    │      搜索请求  → ES       │
    │      事务操作  → MySQL    │
    │      缓存     → Redis    │
    └──────────────────────────┘

数据同步:
MySQL(写) → Binlog → Canal/Debezium → Kafka → ES(读)

9. Common Problems and Solutions

9.1 Data Consistency Issues

Problem: Inconsistency between MySQL and ES data.

Solutions:
1. Dual-write Strategy: Write to both MySQL and ES simultaneously, using a message queue to ensure eventual consistency.
2. CDC (Change Data Capture): Real-time synchronization via Binlog.
3. Regular Verification: Scheduled tasks to compare data differences and fix them.

9.2 Deep Paging Issues

Problem: Extremely poor performance when querying data on the 10,000th page.

Solutions:

// 1. Use search_after (recommended)
{
  "size": 10,
  "sort": [{"_id": "asc"}],
  "search_after": [10000]  // The sort value of the last document on the previous page
}

// 2. Use scroll API (suitable for exporting)
POST /products/_search?scroll=1m
{
  "size": 100,
  "query": {"match_all": {}}
}

10. Summary

Elasticsearch perfectly solves various problems faced by traditional databases in search scenarios through its inverted index, distributed architecture, and powerful full-text search capabilities. Proper use of Elasticsearch can:

  1. Improve search performance: From seconds to milliseconds.
  2. Enhance search quality: Through relevance scoring and smart word segmentation.
  3. Support complex analysis: Real-time aggregation and statistical analysis.
  4. Reduce operational costs: Fewer servers, higher efficiency.

However, it is important to note that Elasticsearch is not a replacement for MySQL, but rather a complement. In actual projects, the appropriate storage solution should be chosen based on specific scenarios, and a hybrid architecture of MySQL + Elasticsearch can usually leverage their respective advantages.

主题测试文章,只做测试使用。发布者:Walker,转转请注明出处:https://walker-learn.xyz/archives/4782

(0)
Walker的头像Walker
上一篇 Nov 24, 2025 01:00
下一篇 1 day ago

Related Posts

  • Love sports, challenge limits, embrace nature.

    Passion. In this fast-paced era, we are surrounded by the pressures of work and life, often neglecting our body's needs. However, exercise is not just a way to keep fit; it's a lifestyle that allows us to unleash ourselves, challenge our limits, and dance with nature. Whether it's skiing, rock climbing, surfing, or running, cycling, yoga, every sport allows us to find our inner passion and feel the vibrancy of life. Sport is a self-challenge. Challenging limits is not exclusive to professional athletes; it's a goal that everyone who loves sports can pursue. It can...

    Personal Feb 26, 2025
    1.3K00
  • [Opening]

    I am Walker, born in the early 1980s, a journeyer through code and life. A full-stack development engineer, I navigate the boundaries between front-end and back-end, dedicated to the intersection of technology and art. Code is the language with which I weave dreams; projects are the canvas on which I paint the future. Amidst the rhythmic tapping of the keyboard, I explore the endless possibilities of technology, allowing inspiration to bloom eternally within the code. An avid coffee enthusiast, I am captivated by the poetry and ritual of every pour-over. In the rich aroma and subtle bitterness of coffee, I find focus and inspiration, mirroring my pursuit of excellence and balance in the world of development. Cycling...

    Feb 6, 2025 Personal
    2.2K00
  • Go Engineering Systematic Course 014 [Study Notes]

    RocketMQ Quick Start. Go to our various configurations (podman) to see how it's installed. Introduction to Concepts: RocketMQ is a distributed messaging middleware open-sourced by Alibaba and an Apache top-level project. Core components: NameServer: Service discovery and routing; Broker: Message storage, delivery, and fetching; Producer: Message producer (sends messages); Consumer: Message consumer (subscribes to and consumes messages); Topic/Tag: Topic/...

    Personal Nov 25, 2025
    16500
  • Go Engineer Comprehensive Course 011 [Study Notes]

    Inverted Index for Queries
    1. What is an Inverted Index?
    An Inverted Index is a data structure used to quickly find documents containing specific terms. It is one of the core technologies of search engines.
    1.1 Basic Concepts
    Forward Index: Document ID → Document Content (list of terms)
    Inverted Index: Term → List of Document IDs containing the term
    1.2 Why is it called "Inverted"?
    An inverted index reverses the traditional relationship of "which terms a document contains" to "in which documents a term appears...

    Personal Nov 25, 2025
    20000
  • Go Engineer Comprehensive Course: Protobuf Guide [Study Notes]

    Protocol Buffers Getting Started Guide 1. Introduction Protocol Buffers (protobuf for short) is a language-agnostic, platform-agnostic, extensible structured data serialization mechanism developed by Google. Compared with serialization methods such as JSON and XML, protobuf is smaller, faster, and simpler. Project homepage: https://github.com/protocolbuffers/prot…

    Personal Nov 25, 2025
    1.2K00
EN
简体中文 繁體中文 English
欢迎🌹 Coding never stops, keep learning! 💡💻 光临🌹