Performance Optimization and pprof

1. Measure Before Optimizing

"Premature optimization is the root of all evil." — Donald Knuth

Optimization process:
1. Write correct code first
2. Use Benchmark to identify performance bottlenecks
3. Use pprof to pinpoint the exact location
4. Optimize → Measure again → Compare

2. pprof Tool

2.1 Integration in HTTP Services

import _ "net/http/pprof" // 只需导入即可

func main() {
    // 如果已有 HTTP 服务，pprof 自动注册到 DefaultServeMux
    http.ListenAndServe(":8080", nil)
}

// 如果用 gin/echo 等框架，单独启动 pprof 服务
func main() {
    go func() {
        http.ListenAndServe(":6060", nil) // pprof 单独端口
    }()
    // 启动主服务...
}

Visit http://localhost:6060/debug/pprof/ to view the overview.

2.2 Usage in Non-HTTP Programs

import "runtime/pprof"

func main() {
    // CPU Profile
    cpuFile, _ := os.Create("cpu.prof")
    defer cpuFile.Close()
    pprof.StartCPUProfile(cpuFile)
    defer pprof.StopCPUProfile()

    // 业务代码...
    doWork()

    // Heap Profile
    heapFile, _ := os.Create("heap.prof")
    defer heapFile.Close()
    pprof.WriteHeapProfile(heapFile)
}

2.3 Profile Types

Profile	Description	HTTP Path
CPU	CPU usage hotspots	/debug/pprof/profile?seconds=30
Heap	Heap memory allocation	/debug/pprof/heap
Allocs	Cumulative memory allocation	/debug/pprof/allocs
Goroutine	Goroutine stack	/debug/pprof/goroutine
Block	Blocking waits	/debug/pprof/block
Mutex	Lock contention	/debug/pprof/mutex
Threadcreate	Thread creation	/debug/pprof/threadcreate

# Collect 30-second CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# Collect heap memory
go tool pprof http://localhost:6060/debug/pprof/heap

# View goroutines (detect leaks)
go tool pprof http://localhost:6060/debug/pprof/goroutine

# Need to enable block/mutex profiling first
runtime.SetBlockProfileRate(1)
runtime.SetMutexProfileFraction(1)

3. pprof Visualization

3.1 Command-line Interactive Mode

go tool pprof cpu.prof
# Or remote collection
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# After entering interactive mode:
(pprof) top 10          # Top 10 hotspot functions
(pprof) top -cum        # Sort by cumulative time
(pprof) list funcName   # View function source-level time consumption
(pprof) web             # Open SVG graph in browser (requires graphviz)
(pprof) png             # Output PNG graph
(pprof) peek funcName   # View callers and callees

Interpreting top output:

      flat  flat%   sum%        cum   cum%
     3.20s 40.00% 40.00%      3.20s 40.00%  runtime.memclrNoHeapPointers
     1.60s 20.00% 60.00%      4.80s 60.00%  main.processData
     0.80s 10.00% 70.00%      0.80s 10.00%  runtime.memmove

flat: Time spent in the function itself (excluding calls to other functions)
cum (cumulative): Total time spent in the function (including all called sub-functions)
sum%: Cumulative percentage

3.2 Flame Graph

# Go 1.11+ built-in Web UI (recommended)
go tool pprof -http=:8081 cpu.prof

# Browser opens automatically, showing:
# - Top: Function ranking
# - Graph: Call graph
# - Flame Graph: Flame graph
# - Source: Source level
# - Peek: Upstream/downstream relationships

Flame Graph reading method:
- X-axis: Sampling proportion (wider = more time spent)
- Y-axis: Call stack depth (higher = deeper call stack)
- Color: No special meaning, only for differentiation
- Focus: Widest block at the top = most time-consuming leaf function

3.3 go tool trace

Finer granularity than pprof, showing goroutine scheduling, GC events, etc.

import "runtime/trace"

func main() {
    f, _ := os.Create("trace.out")
    defer f.Close()
    trace.Start(f)
    defer trace.Stop()

    // 业务代码...
}

go tool trace trace.out
# Browser opens, showing:
# - Goroutine analysis: goroutine count and execution timeline
# - Network/Sync blocking: Network and lock blocking
# - Syscall blocking: System call blocking
# - Scheduler latency: Scheduling latency
# - GC events: GC timeline

4. Benchmark-Driven Optimization

4.1 Combined benchmark + pprof

# Run benchmark and generate CPU profile
go test -bench=BenchmarkProcess -cpuprofile=cpu.prof -benchmem

# Analyze
go tool pprof -http=:8081 cpu.prof

# Run benchmark and generate memory profile
go test -bench=BenchmarkProcess -memprofile=mem.prof -benchmem
go tool pprof -http=:8081 mem.prof

4.2 benchstat for comparing test results

go install golang.org/x/perf/cmd/benchstat@latest

# Before optimization
go test -bench=. -count=10 > old.txt

# After modifying code
go test -bench=. -count=10 > new.txt

# Compare
benchstat old.txt new.txt

Output example:

name       old time/op    new time/op    delta
Process-8    4.50ms ± 2%    1.20ms ± 1%  -73.3%  (p=0.000 n=10+10)

name       old alloc/op   new alloc/op   delta
Process-8    1.20MB ± 0%    0.04MB ± 0%  -96.7%  (p=0.000 n=10+10)

5. Common Optimization Techniques

5.1 Reduce Memory Allocations

// Bad: Allocates on every call
func bad() []byte {
    buf := make([]byte, 1024)
    return buf
}

// Good: Use sync.Pool
var bufPool = sync.Pool{
    New: func() interface{} { return make([]byte, 1024) },
}
func good() []byte {
    buf := bufPool.Get().([]byte)
    defer bufPool.Put(buf)
    // Use buf...
    return buf
}

// Good: Pre-allocate slice
func preallocSlice(n int) []int {
    result := make([]int, 0, n) // Known capacity
    for i := 0; i < n; i++ {
        result = append(result, i)
    }
    return result
}

5.2 String Concatenation Optimization

// Benchmark comparison (concatenate 10000 times)

// Slow: + concatenation (~50ms, many memory allocations)
func concatPlus(n int) string {
    s := ""
    for i := 0; i < n; i++ {
        s += "a"
    }
    return s
}

// Medium: fmt.Sprintf (~5ms)
func concatSprintf(n int) string {
    return fmt.Sprintf("%s%s", a, b)
}

// Fast: strings.Builder (~0.01ms, recommended)
func concatBuilder(n int) string {
    var b strings.Builder
    b.Grow(n) // Pre-allocate
    for i := 0; i < n; i++ {
        b.WriteString("a")
    }
    return b.String()
}

// Fast: bytes.Buffer (~0.01ms)
func concatBuffer(n int) string {
    var buf bytes.Buffer
    buf.Grow(n)
    for i := 0; i < n; i++ {
        buf.WriteString("a")
    }
    return buf.String()
}

// Special case: strings.Join (all segments known)
func concatJoin(parts []string) string {
    return strings.Join(parts, "")
}

5.3 Avoid Unnecessary Reflection

// Slow: Reflection
func setFieldReflect(obj interface{}, name string, value interface{}) {
    v := reflect.ValueOf(obj).Elem()
    f := v.FieldByName(name)
    f.Set(reflect.ValueOf(value))
}

// Fast: Direct assignment or use of interfaces
type Setter interface {
    SetName(string)
}
func setField(obj Setter, name string) {
    obj.SetName(name) // Interface method calls are ~100x faster than reflection
}

5.4 Judicious Use of Goroutines

// Bad: Create a goroutine for each small task
for _, item := range items {
    go process(item) // Millions of goroutines, high scheduling overhead
}

// Good: Worker Pool to control concurrency
func processAll(items []Item) {
    sem := make(chan struct{}, runtime.NumCPU())
    var wg sync.WaitGroup
    for _, item := range items {
        wg.Add(1)
        sem <- struct{}{} // Limit concurrency
        go func(it Item) {
            defer wg.Done()
            defer func() { <-sem }()
            process(it)
        }(item)
    }
    wg.Wait()
}

5.5 Reduce Lock Contention

// Bad: Global large lock
var mu sync.Mutex
var globalMap = make(map[string]int)

// Good: Sharded Map
const shardCount = 32
type ShardedMap struct {
    shards [shardCount]struct {
        sync.RWMutex
        data map[string]int
    }
}

func (m *ShardedMap) getShard(key string) int {
    h := fnv.New32a()
    h.Write([]byte(key))
    return int(h.Sum32()) % shardCount
}

func (m *ShardedMap) Set(key string, val int) {
    idx := m.getShard(key)
    m.shards[idx].Lock()
    m.shards[idx].data[key] = val
    m.shards[idx].Unlock()
}

5.6 Struct Field Alignment

// Bad: Field order leads to memory waste (padding)
type Bad struct {
    a bool   // 1B + 7B padding
    b int64  // 8B
    c bool   // 1B + 7B padding
} // Total 24B

// Good: Sort by size in descending order
type Good struct {
    b int64  // 8B
    a bool   // 1B
    c bool   // 1B + 6B padding
} // Total 16B

// Check tool
// go install golang.org/x/tools/go/analysis/passes/fieldalignment/cmd/fieldalignment@latest
// fieldalignment -fix ./...

6. Practical Example: Optimizing a Slow API

Problem: GET /api/users average response 500ms

Step 1: Collect CPU Profile
  go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
  (Simultaneously stress test the API with wrk/hey/ab)

Step 2: View Top Hotspots
  (pprof) top
  → Found encoding/json.Marshal accounts for 40%
  → Found database/sql.Query accounts for 30%

Step 3: View Specific Code
  (pprof) list handleUsers
  → Full user data serialized on every request
  → SELECT * executed on every request without pagination

Step 4: Optimize
  1. Add pagination (LIMIT/OFFSET)
  2. Use jsoniter instead of encoding/json
  3. Add Redis cache for hot data
  4. Use sync.Pool to reuse buffers

Step 5: Compare
  Before optimization: 500ms, 100 allocs/op, 5MB/op
  After optimization:  50ms,  20 allocs/op, 200KB/op

Quick Reference

Command	Purpose
`go build -gcflags="-m"`	View escape analysis
`go test -bench=. -benchmem`	Run benchmark tests
`go test -bench=. -cpuprofile=cpu.prof`	Generate CPU profile
`go test -bench=. -memprofile=mem.prof`	Generate memory profile
`go tool pprof -http=:8081 cpu.prof`	Web UI analysis
`go tool pprof profile_url`	Command-line analysis
`go tool trace trace.out`	Trace analysis
`GODEBUG=gctrace=1 ./app`	Print GC logs
`GOGC=200`	Adjust GC trigger threshold
`GOMEMLIMIT=1GiB`	Set memory soft limit

主题测试文章，只做测试使用。发布者：Walker，转转请注明出处：https://walker-learn.xyz/archives/6767

Go Engineer System Course 020