最近写了一段代码,精简一下大概是这样:
type config struct {
fk2Opts map[string]string
}
type doublebuffer struct {
c1 config
c2 config
currentIdx int64
l sync.RWMutex
}
var option doublebuffer
func readConfig(key string) bool {
option.l.RLock()
defer option.l.RUnlock()
idx := option.currentIdx
if idx == 0 {
_, ok1 := option.c1.fk2Opts[key]
return ok1
}
_, ok1 := option.c2.fk2Opts[key]
return ok1
}
func writeToConfig() {
fk2Opts := map[string]string{}
for i := 0; i < 10; i++ {
fk2Opts[fmt.Sprint(i)] = fmt.Sprint(i)
}
option.l.Lock()
defer option.l.Unlock()
idx := option.currentIdx
if idx == 0 {
option.c1.fk2Opts = fk2Opts
} else {
option.c2.fk2Opts = fk2Opts
}
option.currentIdx = 1 - option.currentIdx
}
一个全局的变量,内容是双 buffer,然后每过一段固定的时间,从 db 里加载内容,更新到下一个缓冲区,更新完毕之后,切换索引。整个操作一气呵成,毫无破绽。
然后程序竟然无征兆地崩溃了。
unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x484a0f]
goroutine 5337 [running]:
runtime.throw(0xdab912, 0x5)
/usr/local/go1.9.2/src/runtime/panic.go:605 +0x95 fp=0xc4219109e8 sp=0xc4219109c8 pc=0x454485
runtime.sigpanic()
/usr/local/go1.9.2/src/runtime/signal_unix.go:374 +0x227 fp=0xc421910a38 sp=0xc4219109e8 pc=0x46be87
runtime.aeshashbody()
/usr/local/go1.9.2/src/runtime/asm_amd64.s:1127 +0x38f fp=0xc421910a40 sp=0xc421910a38 pc=0x484a0f
runtime.mapaccess2(0xce03a0, 0xc420702ff0, 0xc421910b18, 0x0, 0xf)
/usr/local/go1.9.2/src/runtime/hashmap.go:416 +0x85 fp=0xc421910ac8 sp=0xc421910a40 pc=0x42dd35
git.xiaojukeji.com/gulfstream/ufs/dict.GetFeatureOptionByFeatureKey(0x11c6e20, 0xc42068e6c0, 0x1, 0xefcec37200000000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/tmp/go-build1536727247/src/git.xiaojukeji.com/gulfstream/ufs/dict/api.go:36 +0x3bd fp=0xc421910e28 sp=0xc421910ac8 pc=0xa9df1d
同时还有一些 GC 时崩溃的日志:
runtime: nelems=170 nalloc=21 previous allocCount=20 nfreed=65535
fatal error: sweep increased allocation count
惊慌地找到了 runtime throw 的堆栈,看到了 0x0,本能地觉得应该是一个 map 的并发读写的问题,于是在线下开始找系统中所有的 map DATA RACE。修复一波,重新上线~
竟然还是崩溃。。这下是得认真看一下了。先构造一个 map 的并发读写的例子:
package main
import (
"time"
)
var m = map[string]string{}
func main() {
go func() {
for {
m["a"] = "1"
}
}()
go func() {
for {
_ = m["b"]
}
}()
time.Sleep(time.Minute)
}
go run -race map_race.go
这样一定可以先看到有 DATA RACE,然后程序崩溃:
DATA RACE:
WARNING: DATA RACE
Read at 0x00c420096060 by goroutine 6:
runtime.mapaccess1_faststr()
/usr/local/go/src/runtime/hashmap_fast.go:172 +0x0
main.main.func2()
/Users/xargin/map_race.go:17 +0x6b
Previous write at 0x00c420096060 by goroutine 5:
runtime.mapassign_faststr()
/usr/local/go/src/runtime/hashmap_fast.go:694 +0x0
main.main.func1()
/Users/xargin/map_race.go:12 +0x6f
然后崩溃:
fatal error: concurrent map read and map write
goroutine 19 [running]:
runtime.throw(0x10a0940, 0x21)
/usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc42003ef00 sp=0xc42003eee0 pc=0x104f141
runtime.mapaccess1_faststr(0x108d3a0, 0xc420096060, 0x109d4a2, 0x1, 0x1112c40)
/usr/local/go/src/runtime/hashmap_fast.go:181 +0x706 fp=0xc42003efa0 sp=0xc42003ef00 pc=0x1032546
main.main.func2()
/Users/xargin/map_race.go:17 +0x6c fp=0xc42003efe0 sp=0xc42003efa0 pc=0x107e98c
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc42003efe8 sp=0xc42003efe0 pc=0x1077951
created by main.main
/Users/xargin/map_race.go:15 +0x5b
显然我们遇到的不是一个 map 并发读写的错误,而是类似内存 corruption 的错误,仔细看一下报错时的堆栈对应的 Go 源代码:
hashmap.go:416
415 alg := t.key.alg
416 hash := alg.hash(key, uintptr(h.hash0))
417 m := uintptr(1)<<h.B - 1
418 b := (*bmap)(unsafe.Pointer(uintptr(h.buckets) + (hash&m)*uintptr(t.bucketsize)))
asm_amd64.s
1126 // start with last (possibly overlapping) block
1127 MOVOU -128(AX)(CX*1), X8 <--- 崩溃在这里
1128 MOVOU -112(AX)(CX*1), X9
1129 MOVOU -96(AX)(CX*1), X10
1130 MOVOU -80(AX)(CX*1), X11
1131 MOVOU -64(AX)(CX*1), X12
1132 MOVOU -48(AX)(CX*1), X13
1133 MOVOU -32(AX)(CX*1), X14
1134 MOVOU -16(AX)(CX*1), X15
在计算 aes hash 的时候引用了一块无效的内存,然后就收到了操作系统发来的 SIGSEGV 信号,程序崩溃。
可这里的问题是,这里 aeshash 的对象是我们函数的输入参数,也就是 m[key] 里的这个 key,这个 string 所在的内存怎么可能为空呢?按照 Go 的设计理念,你不可能通过一个 string 类型的变量构造出一个访问非法内存地址的 case 的,所以这是一个 Go 本身的 bug。翻阅了半天 Go 的升级日志,最后从官方的日志中找到一个类似的 case:
package main
import "runtime/debug"
type T struct {
// >= 16 bytes to avoid tiny alloc.
a, b int
}
func main() {
debug.SetGCPercent(1)
for i := 0; i < 100000; i++ {
m := make(map[*T]struct{}, 0)
for j := 0; j < 20; j++ {
// During the call to mapassign_fast64, the key argument
// was incorrectly treated as a uint64. If the stack was
// scanned during that call, the only pointer to k was
// missed, leading to *k being collected prematurely.
k := new(T)
m[k] = struct{}{}
}
}
}
可以看到,循环中的指针变量做 map 的 key 使用,会在 GC 时触发该 bug。在 1.9.2 上是必现的,
目前线上的构建系统中 1.9 版本的 Go 只有 1.9.2,所以使用该版本的同学如果不幸碰到了和我们类似的问题,可以考虑升一下级了。