Go 1.9.2 的 bug

最近写了一段代码,精简一下大概是这样:


type config struct {
	fk2Opts map[string]string
}

type doublebuffer struct {
	c1         config
	c2         config
	currentIdx int64
	l          sync.RWMutex
}

var option doublebuffer

func readConfig(key string) bool {
	option.l.RLock()
	defer option.l.RUnlock()

	idx := option.currentIdx
	if idx == 0 {
		_, ok1 := option.c1.fk2Opts[key]
		return ok1
	}

	_, ok1 := option.c2.fk2Opts[key]
	return ok1
}

func writeToConfig() {
	fk2Opts := map[string]string{}
	for i := 0; i < 10; i++ {
		fk2Opts[fmt.Sprint(i)] = fmt.Sprint(i)
	}

	option.l.Lock()
	defer option.l.Unlock()

	idx := option.currentIdx
	if idx == 0 {
		option.c1.fk2Opts = fk2Opts
	} else {
		option.c2.fk2Opts = fk2Opts
	}

	option.currentIdx = 1 - option.currentIdx
}

一个全局的变量,内容是双 buffer,然后每过一段固定的时间,从 db 里加载内容,更新到下一个缓冲区,更新完毕之后,切换索引。整个操作一气呵成,毫无破绽。

然后程序竟然无征兆地崩溃了。


unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x484a0f]

goroutine 5337 [running]:
runtime.throw(0xdab912, 0x5)
        /usr/local/go1.9.2/src/runtime/panic.go:605 +0x95 fp=0xc4219109e8 sp=0xc4219109c8 pc=0x454485
runtime.sigpanic()
        /usr/local/go1.9.2/src/runtime/signal_unix.go:374 +0x227 fp=0xc421910a38 sp=0xc4219109e8 pc=0x46be87
runtime.aeshashbody()
        /usr/local/go1.9.2/src/runtime/asm_amd64.s:1127 +0x38f fp=0xc421910a40 sp=0xc421910a38 pc=0x484a0f
runtime.mapaccess2(0xce03a0, 0xc420702ff0, 0xc421910b18, 0x0, 0xf)
        /usr/local/go1.9.2/src/runtime/hashmap.go:416 +0x85 fp=0xc421910ac8 sp=0xc421910a40 pc=0x42dd35
git.xiaojukeji.com/gulfstream/ufs/dict.GetFeatureOptionByFeatureKey(0x11c6e20, 0xc42068e6c0, 0x1, 0xefcec37200000000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /tmp/go-build1536727247/src/git.xiaojukeji.com/gulfstream/ufs/dict/api.go:36 +0x3bd fp=0xc421910e28 sp=0xc421910ac8 pc=0xa9df1d

同时还有一些 GC 时崩溃的日志:

runtime: nelems=170 nalloc=21 previous allocCount=20 nfreed=65535
fatal error: sweep increased allocation count

惊慌地找到了 runtime throw 的堆栈,看到了 0x0,本能地觉得应该是一个 map 的并发读写的问题,于是在线下开始找系统中所有的 map DATA RACE。修复一波,重新上线~

竟然还是崩溃。。这下是得认真看一下了。先构造一个 map 的并发读写的例子:

package main

import (
	"time"
)

var m = map[string]string{}

func main() {
	go func() {
		for {
			m["a"] = "1"
		}
	}()
	go func() {
		for {
			_ = m["b"]
		}
	}()
	time.Sleep(time.Minute)
}

go run -race map_race.go

这样一定可以先看到有 DATA RACE,然后程序崩溃:

DATA RACE:

WARNING: DATA RACE
Read at 0x00c420096060 by goroutine 6:
  runtime.mapaccess1_faststr()
      /usr/local/go/src/runtime/hashmap_fast.go:172 +0x0
  main.main.func2()
      /Users/xargin/map_race.go:17 +0x6b

Previous write at 0x00c420096060 by goroutine 5:
  runtime.mapassign_faststr()
      /usr/local/go/src/runtime/hashmap_fast.go:694 +0x0
  main.main.func1()
      /Users/xargin/map_race.go:12 +0x6f

然后崩溃:

fatal error: concurrent map read and map write

goroutine 19 [running]:
runtime.throw(0x10a0940, 0x21)
	/usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc42003ef00 sp=0xc42003eee0 pc=0x104f141
runtime.mapaccess1_faststr(0x108d3a0, 0xc420096060, 0x109d4a2, 0x1, 0x1112c40)
	/usr/local/go/src/runtime/hashmap_fast.go:181 +0x706 fp=0xc42003efa0 sp=0xc42003ef00 pc=0x1032546
main.main.func2()
	/Users/xargin/map_race.go:17 +0x6c fp=0xc42003efe0 sp=0xc42003efa0 pc=0x107e98c
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc42003efe8 sp=0xc42003efe0 pc=0x1077951
created by main.main
	/Users/xargin/map_race.go:15 +0x5b

显然我们遇到的不是一个 map 并发读写的错误,而是类似内存 corruption 的错误,仔细看一下报错时的堆栈对应的 Go 源代码:

hashmap.go:416

 415         alg := t.key.alg
 416         hash := alg.hash(key, uintptr(h.hash0))
 417         m := uintptr(1)<<h.B - 1
 418         b := (*bmap)(unsafe.Pointer(uintptr(h.buckets) + (hash&m)*uintptr(t.bucketsize)))

asm_amd64.s

1126         // start with last (possibly overlapping) block
1127         MOVOU   -128(AX)(CX*1), X8   <--- 崩溃在这里
1128         MOVOU   -112(AX)(CX*1), X9
1129         MOVOU   -96(AX)(CX*1), X10
1130         MOVOU   -80(AX)(CX*1), X11
1131         MOVOU   -64(AX)(CX*1), X12
1132         MOVOU   -48(AX)(CX*1), X13
1133         MOVOU   -32(AX)(CX*1), X14
1134         MOVOU   -16(AX)(CX*1), X15

在计算 aes hash 的时候引用了一块无效的内存,然后就收到了操作系统发来的 SIGSEGV 信号,程序崩溃。

可这里的问题是,这里 aeshash 的对象是我们函数的输入参数,也就是 m[key] 里的这个 key,这个 string 所在的内存怎么可能为空呢?按照 Go 的设计理念,你不可能通过一个 string 类型的变量构造出一个访问非法内存地址的 case 的,所以这是一个 Go 本身的 bug。翻阅了半天 Go 的升级日志,最后从官方的日志中找到一个类似的 case:

package main

import "runtime/debug"

type T struct {
  // >= 16 bytes to avoid tiny alloc.
  a, b int
}

func main() {
  debug.SetGCPercent(1)
  for i := 0; i < 100000; i++ {
    m := make(map[*T]struct{}, 0)
    for j := 0; j < 20; j++ {
      // During the call to mapassign_fast64, the key argument
      // was incorrectly treated as a uint64. If the stack was
      // scanned during that call, the only pointer to k was
      // missed, leading to *k being collected prematurely.
      k := new(T)
      m[k] = struct{}{}
    }
  }
}

可以看到,循环中的指针变量做 map 的 key 使用,会在 GC 时触发该 bug。在 1.9.2 上是必现的,

目前线上的构建系统中 1.9 版本的 Go 只有 1.9.2,所以使用该版本的同学如果不幸碰到了和我们类似的问题,可以考虑升一下级了。

Xargin

Xargin

If you don't keep moving, you'll quickly fall behind
Beijing