Performance Optimization and pprof
1. Measure Before Optimizing
"Premature optimization is the root of all evil." — Donald Knuth
Optimization process:
1. Write correct code first
2. Use Benchmark to identify performance bottlenecks
3. Use pprof to pinpoint the exact location
4. Optimize → Measure again → Compare
2. pprof Tool
2.1 Integration in HTTP Services
import _ "net/http/pprof" // 只需导入即可
func main() {
// 如果已有 HTTP 服务,pprof 自动注册到 DefaultServeMux
http.ListenAndServe(":8080", nil)
}
// 如果用 gin/echo 等框架,单独启动 pprof 服务
func main() {
go func() {
http.ListenAndServe(":6060", nil) // pprof 单独端口
}()
// 启动主服务...
}
Visit http://localhost:6060/debug/pprof/ to view the overview.
2.2 Usage in Non-HTTP Programs
import "runtime/pprof"
func main() {
// CPU Profile
cpuFile, _ := os.Create("cpu.prof")
defer cpuFile.Close()
pprof.StartCPUProfile(cpuFile)
defer pprof.StopCPUProfile()
// 业务代码...
doWork()
// Heap Profile
heapFile, _ := os.Create("heap.prof")
defer heapFile.Close()
pprof.WriteHeapProfile(heapFile)
}
2.3 Profile Types
| Profile | Description | HTTP Path |
|---|---|---|
| CPU | CPU usage hotspots | /debug/pprof/profile?seconds=30 |
| Heap | Heap memory allocation | /debug/pprof/heap |
| Allocs | Cumulative memory allocation | /debug/pprof/allocs |
| Goroutine | Goroutine stack | /debug/pprof/goroutine |
| Block | Blocking waits | /debug/pprof/block |
| Mutex | Lock contention | /debug/pprof/mutex |
| Threadcreate | Thread creation | /debug/pprof/threadcreate |
# Collect 30-second CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
# Collect heap memory
go tool pprof http://localhost:6060/debug/pprof/heap
# View goroutines (detect leaks)
go tool pprof http://localhost:6060/debug/pprof/goroutine
# Need to enable block/mutex profiling first
runtime.SetBlockProfileRate(1)
runtime.SetMutexProfileFraction(1)
3. pprof Visualization
3.1 Command-line Interactive Mode
go tool pprof cpu.prof
# Or remote collection
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
# After entering interactive mode:
(pprof) top 10 # Top 10 hotspot functions
(pprof) top -cum # Sort by cumulative time
(pprof) list funcName # View function source-level time consumption
(pprof) web # Open SVG graph in browser (requires graphviz)
(pprof) png # Output PNG graph
(pprof) peek funcName # View callers and callees
Interpreting top output:
flat flat% sum% cum cum%
3.20s 40.00% 40.00% 3.20s 40.00% runtime.memclrNoHeapPointers
1.60s 20.00% 60.00% 4.80s 60.00% main.processData
0.80s 10.00% 70.00% 0.80s 10.00% runtime.memmove
- flat: Time spent in the function itself (excluding calls to other functions)
- cum (cumulative): Total time spent in the function (including all called sub-functions)
- sum%: Cumulative percentage
3.2 Flame Graph
# Go 1.11+ built-in Web UI (recommended)
go tool pprof -http=:8081 cpu.prof
# Browser opens automatically, showing:
# - Top: Function ranking
# - Graph: Call graph
# - Flame Graph: Flame graph
# - Source: Source level
# - Peek: Upstream/downstream relationships
Flame Graph reading method:
- X-axis: Sampling proportion (wider = more time spent)
- Y-axis: Call stack depth (higher = deeper call stack)
- Color: No special meaning, only for differentiation
- Focus: Widest block at the top = most time-consuming leaf function
3.3 go tool trace
Finer granularity than pprof, showing goroutine scheduling, GC events, etc.
import "runtime/trace"
func main() {
f, _ := os.Create("trace.out")
defer f.Close()
trace.Start(f)
defer trace.Stop()
// 业务代码...
}
go tool trace trace.out
# Browser opens, showing:
# - Goroutine analysis: goroutine count and execution timeline
# - Network/Sync blocking: Network and lock blocking
# - Syscall blocking: System call blocking
# - Scheduler latency: Scheduling latency
# - GC events: GC timeline
4. Benchmark-Driven Optimization
4.1 Combined benchmark + pprof
# Run benchmark and generate CPU profile
go test -bench=BenchmarkProcess -cpuprofile=cpu.prof -benchmem
# Analyze
go tool pprof -http=:8081 cpu.prof
# Run benchmark and generate memory profile
go test -bench=BenchmarkProcess -memprofile=mem.prof -benchmem
go tool pprof -http=:8081 mem.prof
4.2 benchstat for comparing test results
go install golang.org/x/perf/cmd/benchstat@latest
# Before optimization
go test -bench=. -count=10 > old.txt
# After modifying code
go test -bench=. -count=10 > new.txt
# Compare
benchstat old.txt new.txt
Output example:
name old time/op new time/op delta
Process-8 4.50ms ± 2% 1.20ms ± 1% -73.3% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
Process-8 1.20MB ± 0% 0.04MB ± 0% -96.7% (p=0.000 n=10+10)
5. Common Optimization Techniques
5.1 Reduce Memory Allocations
// Bad: Allocates on every call
func bad() []byte {
buf := make([]byte, 1024)
return buf
}
// Good: Use sync.Pool
var bufPool = sync.Pool{
New: func() interface{} { return make([]byte, 1024) },
}
func good() []byte {
buf := bufPool.Get().([]byte)
defer bufPool.Put(buf)
// Use buf...
return buf
}
// Good: Pre-allocate slice
func preallocSlice(n int) []int {
result := make([]int, 0, n) // Known capacity
for i := 0; i < n; i++ {
result = append(result, i)
}
return result
}
5.2 String Concatenation Optimization
// Benchmark comparison (concatenate 10000 times)
// Slow: + concatenation (~50ms, many memory allocations)
func concatPlus(n int) string {
s := ""
for i := 0; i < n; i++ {
s += "a"
}
return s
}
// Medium: fmt.Sprintf (~5ms)
func concatSprintf(n int) string {
return fmt.Sprintf("%s%s", a, b)
}
// Fast: strings.Builder (~0.01ms, recommended)
func concatBuilder(n int) string {
var b strings.Builder
b.Grow(n) // Pre-allocate
for i := 0; i < n; i++ {
b.WriteString("a")
}
return b.String()
}
// Fast: bytes.Buffer (~0.01ms)
func concatBuffer(n int) string {
var buf bytes.Buffer
buf.Grow(n)
for i := 0; i < n; i++ {
buf.WriteString("a")
}
return buf.String()
}
// Special case: strings.Join (all segments known)
func concatJoin(parts []string) string {
return strings.Join(parts, "")
}
5.3 Avoid Unnecessary Reflection
// Slow: Reflection
func setFieldReflect(obj interface{}, name string, value interface{}) {
v := reflect.ValueOf(obj).Elem()
f := v.FieldByName(name)
f.Set(reflect.ValueOf(value))
}
// Fast: Direct assignment or use of interfaces
type Setter interface {
SetName(string)
}
func setField(obj Setter, name string) {
obj.SetName(name) // Interface method calls are ~100x faster than reflection
}
5.4 Judicious Use of Goroutines
// Bad: Create a goroutine for each small task
for _, item := range items {
go process(item) // Millions of goroutines, high scheduling overhead
}
// Good: Worker Pool to control concurrency
func processAll(items []Item) {
sem := make(chan struct{}, runtime.NumCPU())
var wg sync.WaitGroup
for _, item := range items {
wg.Add(1)
sem <- struct{}{} // Limit concurrency
go func(it Item) {
defer wg.Done()
defer func() { <-sem }()
process(it)
}(item)
}
wg.Wait()
}
5.5 Reduce Lock Contention
// Bad: Global large lock
var mu sync.Mutex
var globalMap = make(map[string]int)
// Good: Sharded Map
const shardCount = 32
type ShardedMap struct {
shards [shardCount]struct {
sync.RWMutex
data map[string]int
}
}
func (m *ShardedMap) getShard(key string) int {
h := fnv.New32a()
h.Write([]byte(key))
return int(h.Sum32()) % shardCount
}
func (m *ShardedMap) Set(key string, val int) {
idx := m.getShard(key)
m.shards[idx].Lock()
m.shards[idx].data[key] = val
m.shards[idx].Unlock()
}
5.6 Struct Field Alignment
// Bad: Field order leads to memory waste (padding)
type Bad struct {
a bool // 1B + 7B padding
b int64 // 8B
c bool // 1B + 7B padding
} // Total 24B
// Good: Sort by size in descending order
type Good struct {
b int64 // 8B
a bool // 1B
c bool // 1B + 6B padding
} // Total 16B
// Check tool
// go install golang.org/x/tools/go/analysis/passes/fieldalignment/cmd/fieldalignment@latest
// fieldalignment -fix ./...
6. Practical Example: Optimizing a Slow API
Problem: GET /api/users average response 500ms
Step 1: Collect CPU Profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
(Simultaneously stress test the API with wrk/hey/ab)
Step 2: View Top Hotspots
(pprof) top
→ Found encoding/json.Marshal accounts for 40%
→ Found database/sql.Query accounts for 30%
Step 3: View Specific Code
(pprof) list handleUsers
→ Full user data serialized on every request
→ SELECT * executed on every request without pagination
Step 4: Optimize
1. Add pagination (LIMIT/OFFSET)
2. Use jsoniter instead of encoding/json
3. Add Redis cache for hot data
4. Use sync.Pool to reuse buffers
Step 5: Compare
Before optimization: 500ms, 100 allocs/op, 5MB/op
After optimization: 50ms, 20 allocs/op, 200KB/op
Quick Reference
| Command | Purpose |
|---|---|
go build -gcflags="-m" |
View escape analysis |
go test -bench=. -benchmem |
Run benchmark tests |
go test -bench=. -cpuprofile=cpu.prof |
Generate CPU profile |
go test -bench=. -memprofile=mem.prof |
Generate memory profile |
go tool pprof -http=:8081 cpu.prof |
Web UI analysis |
go tool pprof profile_url |
Command-line analysis |
go tool trace trace.out |
Trace analysis |
GODEBUG=gctrace=1 ./app |
Print GC logs |
GOGC=200 |
Adjust GC trigger threshold |
GOMEMLIMIT=1GiB |
Set memory soft limit |
主题测试文章,只做测试使用。发布者:Walker,转转请注明出处:https://walker-learn.xyz/archives/6767