Other Features
- User Center
- Favorites
- Manage Shipping Addresses (CRUD)
- Messages
Copy
inventory_srv-->userop_srvquery and replace allinventory
Elasticsearch In-depth Analysis Document
1. What is Elasticsearch
Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene, capable of rapidly storing, searching, and analyzing massive amounts of data. It is a core component of the Elastic Stack (formerly ELK Stack).
2. Problems Faced by MySQL Search - In-depth Analysis
2.1 Detailed Explanation of Performance Issues
Problem Description:
-- 当数据量达到100万条时,以下查询可能需要数秒
SELECT * FROM products WHERE name LIKE '%手机%' OR description LIKE '%手机%';
Performance Comparison Data:
| Data Volume | MySQL LIKE Query | Elasticsearch Full-Text Search | Performance Improvement |
|---|---|---|---|
| 1万条 | 50ms | 10ms | 5倍 |
| 10万条 | 500ms | 15ms | 33倍 |
| 100万条 | 5000ms | 20ms | 250倍 |
| 1000万条 | 50000ms+ | 30ms | 1600倍+ |
Root Causes:
- Full Table Scan: LIKE '%keyword%' cannot use B+ tree indexes, must scan all rows
- I/O Intensive: Each query requires reading a large amount of data from disk
- CPU Intensive: Performs string matching operations on every row of data
- Memory Pressure: Large amounts of data loaded into memory for processing
Real-world Case:
A certain e-commerce platform's product table has 5 million records, using MySQL fuzzy search to find "Apple phone":
- Query time: 8.3 seconds
- CPU utilization: Soared to 85%
- When 10 concurrent queries were made, response time increased to over 30 seconds
2.2 Detailed Explanation of Lack of Relevance Ranking
Pain Points of MySQL Query Results:
-- MySQL只能按固定规则排序
SELECT * FROM products
WHERE name LIKE '%手机%'
ORDER BY price DESC; -- 只能按价格、时间等字段排序
Elasticsearch's Relevance Scoring Mechanism:
搜索词:"小米手机"
相关性评分计算:
┌─────────────────────────────────────┐
│ 文档1:"小米手机12 Pro" │
│ • 词频(TF):2个关键词都出现 │
│ • 逆文档频率(IDF):计算词的稀有度 │
│ • 字段长度:标题较短,权重更高 │
│ • 评分:9.8 │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 文档2:"这是一款性价比很高的手机" │
│ • 词频(TF):只有"手机"出现 │
│ • 逆文档频率(IDF):"手机"较常见 │
│ • 字段长度:描述较长,权重降低 │
│ • 评分:3.2 │
└─────────────────────────────────────┘
Detailed Explanation of Relevance Factors:
- TF (Term Frequency): Frequency of keywords appearing in the document
- IDF (Inverse Document Frequency): Rarity of keywords across all documents
- Field Length Normalization: Matches in shorter fields have higher weight than in longer fields
- Field Weight Boost: Can set title to be more important than content
- Query-time Weight: Can specify certain query terms as more important
2.3 Detailed Explanation of Inability to Perform Full-Text Search
Limitations of MySQL Full-Text Index:
-- MySQL全文索引创建
ALTER TABLE products ADD FULLTEXT(name, description);
-- 问题1:最小词长限制(默认4个字符)
-- "手机" 可以搜索,但 "机" 搜不到
-- 问题2:中文分词支持差
-- "苹果手机" 被当作一个整体,搜索"苹果"找不到
Elasticsearch Full-Text Search Capabilities:
// ES analysis process example
输入文本:"我想买一台苹果手机"
分词结果:
[我] [想] [买] [一台] [苹果] [手机] [苹果手机]
同义词扩展:
[苹果] → [Apple, iPhone]
[手机] → [手机, 电话, mobile]
拼写纠错:
"苹果手击" → 建议 "苹果手机"
2.4 Detailed Explanation of Inaccurate Search and Lack of Word Segmentation
Problems with MySQL String Matching:
-- 搜索"笔记本"
SELECT * FROM products WHERE name LIKE '%笔记本%';
-- 结果:能找到"笔记本电脑"
-- 问题:找不到"笔记 本子"、"notebook"、"手提电脑"
Elasticsearch Smart Word Segmentation Process:
原始文本:"ThinkPad X1 Carbon超轻薄笔记本电脑"
标准分词器:
[ThinkPad] [X1] [Carbon] [超轻薄] [笔记本] [电脑]
IK分词器(中文):
[ThinkPad] [X1] [Carbon] [超] [轻薄] [超轻薄]
[笔记] [本] [笔记本] [电脑] [笔记本电脑]
拼音分词器:
[si] [kao] [pad] → 可以通过拼音搜索
N-gram分词:
[Thi] [hin] [ink] [nkP] → 支持部分匹配
3. What is Full-Text Search - Core Principle Analysis
3.1 Structured Data vs Unstructured Data
Structured Data (MySQL storage method):
┌──────┬────────┬────────┬────────┐
│ ID │ Name │ Price │ Stock │
├──────┼────────┼────────┼────────┤
│ 1 │iPhone │ 5999 │ 100 │
│ 2 │ 小米 │ 2999 │ 200 │
└──────┴────────┴────────┴────────┘
Unstructured Data (Text content):
"This iPhone uses an A15 processor, with powerful performance,
excellent camera effects, and 20% improved battery life,
User review: 'Awesome, great value for money!'"
3.2 Detailed Explanation of Inverted Index Principle
Forward Index (MySQL):
Document ID → Content
Doc1 → "小米手机"
Doc2 → "苹果手机"
Doc3 → "小米电视"
Inverted Index (Elasticsearch):
Term → Document List
"小米" → [Doc1, Doc3]
"手机" → [Doc1, Doc2]
"苹果" → [Doc2]
"电视" → [Doc3]
Search 'Xiaomi phone':
1. Find 'Xiaomi' → Get [Doc1, Doc3]
2. Find 'phone' → Get [Doc1, Doc2]
3. Calculate intersection → Doc1 (most relevant)
3.3 Detailed Structure of Inverted Index
Complete Inverted Index Structure:
Term: "手机"
├── Document Frequency (DF): 1000 documents contain this term
├── Inverted List:
│ ├── Doc1:
│ │ ├── Term Frequency (TF): 3 times
│ │ ├── Positions: [5, 28, 102]
│ │ └── Fields: [title, description]
│ ├── Doc2:
│ │ ├── Term Frequency (TF): 1 time
│ │ ├── Positions: [15]
│ │ └── Fields: [title]
│ └── ...
└── Statistics: Highest term frequency, average term frequency, etc.
4. Elasticsearch Architecture Explained
4.1 Cluster Architecture
Elasticsearch Cluster Architecture Diagram:
┌─────────────── ES Cluster ──────────────┐
│ │
│ ┌─────────────────────────────────┐ │
│ │ Master Node │ │
│ │ • Cluster management │ │
│ │ • Index creation/deletion │ │
│ │ • Shard allocation │ │
│ └─────────────────────────────────┘ │
│ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Data │ │ Data │ │
│ │ Node 1 │ │ Node 2 │ │
│ │ ┌──────┐ │ │ ┌──────┐ │ │
│ │ │ P0 │ │ │ │ R0 │ │ │
│ │ ├──────┤ │ │ ├──────┤ │ │
│ │ │ R1 │ │ │ │ P1 │ │ │
│ │ └──────┘ │ │ └──────┘ │ │
│ └──────────┘ └──────────┘ │
│ │
│ P = Primary Shard │
│ R = Replica Shard │
└─────────────────────────────────────────┘
4.2 Data Write Process
Detailed Write Process:
Client → Coordinating Node → Primary Shard → Replica Shard
1. Client sends write request
↓
2. Coordinating node determines shard via hash routing
↓
3. Request forwarded to primary shard node
↓
4. Primary shard writes successfully
↓
5. Replicated to replica shards in parallel
↓
6. All replicas confirm
↓
7. Returns success response to client
Timeline:
T0 ──→ T1 ──→ T2 ──→ T3 ──→ T4
Receive Route Primary Shard Replica Respond
4.3 Query Process
Query Execution Process:
Phase 1: Query
┌─────────────────────────────────┐
│ Coordinating node sends query requests to all shards │
│ Each shard returns Top N document IDs and scores │
└─────────────────────────────────┘
↓
Phase 2: Fetch
┌─────────────────────────────────┐
│ Coordinating node aggregates and sorts all results │
│ Retrieves the complete content of the final required documents │
└─────────────────────────────────┘
5. Elasticsearch Core Features Explained
5.1 Detailed Explanation of Query Types
// 1. Match Query - Full-Text Search
{
"query": {
"match": {
"title": {
"query": "苹果手机",
"operator": "and" // Must contain all terms
}
}
}
}
// 2. Term Query - Exact Match
{
"query": {
"term": {
"category.keyword": "手机" // No tokenization, exact match
}
}
}
// 3. Range Query - Range Search
{
"query": {
"range": {
"price": {
"gte": 1000,
"lte": 5000
}
}
}
}
// 4. Bool Compound Query
{
"query": {
"bool": {
"must": [
{"match": {"title": "手机"}}
],
"filter": [
{"range": {"price": {"lte": 5000}}}
],
"should": [
{"match": {"brand": "苹果"}} // Bonus item
],
"must_not": [
{"term": {"status": "discontinued"}}
]
}
}
}
5.2 Aggregation Analysis Function
// Sales Data Analysis Example
{
"aggs": {
"sales_per_category": {
"terms": {
"field": "category"
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
},
"total_sales": {
"sum": {
"field": "sales_count"
}
},
"price_ranges": {
"range": {
"field": "price",
"ranges": [
{"to": 1000},
{"from": 1000, "to": 5000},
{"from": 5000}
]
}
}
}
}
}
}
6. Real-world Application Case Studies
6.1 E-commerce Search Optimization Case Study
Comparison of a certain e-commerce platform's search before and after optimization:
| Metric | MySQL Solution | Elasticsearch Solution | Improvement Effect |
|---|---|---|---|
| Average Search Time | 2.3 seconds | 0.05 seconds | 46x improvement |
| Search Accuracy | 65% | 92% | 27% increase |
| Zero Result Rate | 18% | 3% | 15% decrease |
| Number of Servers | 8 servers | 3 servers | 62.5% cost savings |
| Concurrency Capability | 100 QPS | 5000 QPS | 50x improvement |
Implementation Details:
-
Data Synchronization Architecture:
MySQL(主数据) → Binlog → Logstash → Elasticsearch
↓
Scheduled full synchronization (nightly) -
Search Optimization Strategies:
- Pinyin Search: Supports searching for "pinguo" to find "苹果" (Apple)
- Synonyms: Configured "手机" (phone), "电话" (telephone), "mobile" as synonyms
- Search Suggestions: Real-time prompts for possible search terms to users
- Correction Function: Automatically corrects common spelling errors
6.2 Log Analysis System Case Study
A certain internet company's log analysis system:
Log Processing Flow:
Application Server → Filebeat → Logstash → Elasticsearch → Kibana
↓ ↓ ↓ ↓ ↓
Generates logs Collects Processes and transforms Stores and indexes Visualizes and displays
Processing Scale:
• Log volume: 100GB per day
• Number of log entries: 1 billion entries/day
• Query response: Millisecond level
• Retention period: 30 days hot data, 1 year cold data
7. Performance Optimization Best Practices
7.1 Index Design Optimization
// Optimized Mapping Design
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart",
"fields": {
"keyword": {
"type": "keyword" // Supports exact matching
},
"pinyin": {
"type": "text",
"analyzer": "pinyin" // Supports Pinyin search
}
}
},
"price": {
"type": "scaled_float",
"scaling_factor": 100 // Price precision optimization
},
"category": {
"type": "keyword" // Category does not require tokenization
},
"description": {
"type": "text",
"analyzer": "ik_smart"
},
"created_time": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||epoch_millis"
}
}
}
}
7.2 Query Performance Optimization Tips
- 1. Use Filter instead of Query (when scoring is not needed)
```json
// Before optimization: using query (calculates score)
{"query": {"term": {"status": "active"}}}
// After optimization: using filter (does not calculate score, can be cached)
{"query": {"bool": {"filter": {"term": {"status": "active"}}}}}
```
- 2. Reasonably set shard count
```
Shard count reference formula:
Shard count = Data volume (GB) / 30GB
Example:
- 100GB data: 3-4 shards
- 1TB data: 35-40 shards
```
- 3. Batch operation optimization
json
// Use bulk API for batch indexing
POST _bulk
{"index": {"_index": "products", "_id": 1}}
{"name": "iPhone", "price": 5999}
{"index": {"_index": "products", "_id": 2}}
{"name": "小米", "price": 2999}
8. Elasticsearch vs Traditional Databases
8.1 Applicable Scenarios Comparison
| Scenario | MySQL | Elasticsearch | Recommended Choice |
|---|---|---|---|
| Full-Text Search | ❌ Poor | ✅ Excellent | ES |
| Transaction Support | ✅ Full ACID | ❌ No transactions | MySQL |
| Real-time Statistical Analysis | ⚠️ Average | ✅ Excellent | ES |
| Relational Queries | ✅ Excellent | ❌ Limited | MySQL |
| Geolocation Search | ❌ Poor | ✅ Excellent | ES |
| Log Analysis | ❌ Not suitable | ✅ Specialty | ES |
| Precise Numerical Calculation | ✅ Precise | ⚠️ Approximate | MySQL |
8.2 Hybrid Architecture Solution
Recommended Hybrid Architecture:
User Request
↓
┌──────────────┐
│ Application Layer │
└──────────────┘
↓
┌──────────────────────────┐
│ Search Requests → ES │
│ Transaction Operations → MySQL │
│ Cache → Redis │
└──────────────────────────┘
Data Synchronization:
MySQL(Write) → Binlog → Canal/Debezium → Kafka → ES(Read)
9. Common Problems and Solutions
9.1 Data Consistency Issues
Problem: MySQL and ES data inconsistency
Solutions:
1. Dual-write strategy: Write to MySQL and ES simultaneously, use message queues to ensure eventual consistency
2. CDC (Change Data Capture): Real-time synchronization via Binlog
3. Regular verification: Scheduled tasks compare data differences and fix them
9.2 Deep Paging Problem
Problem: Extremely poor performance when querying the 10,000th page of data
Solutions:
// 1. Use search_after (recommended)
{
"size": 10,
"sort": [{"_id": "asc"}],
"search_after": [10000] // Sort value of the last document on the previous page
}
// 2. Use scroll API (suitable for export)
POST /products/_search?scroll=1m
{
"size": 100,
"query": {"match_all": {}}
}
10. Summary
Elasticsearch, through its inverted index, distributed architecture, and powerful full-text search capabilities, perfectly solves various problems faced by traditional databases in search scenarios. Proper use of Elasticsearch can:
- 1. Improve search performance: From seconds to milliseconds
- 2. Enhance search quality: Through relevance scoring and smart tokenization
- 3. Support complex analysis: Real-time aggregation and statistical analysis
- 4. Reduce operational costs: Fewer servers, higher efficiency
However, it's important to note that Elasticsearch is not a replacement for MySQL, but rather a complement. In actual projects, the appropriate storage solution should be chosen based on specific scenarios, and typically a hybrid architecture of MySQL+Elasticsearch can leverage their respective strengths.
主题测试文章,只做测试使用。发布者:Walker,转转请注明出处:https://walker-learn.xyz/archives/6756