go-redis 和 redis server 版本错位导致的高延时问题一例

公司内有多个 go-redis 的 client 和多种 redis cluster 版本,业务压测发现即使只访问 redis 的接口也可能会延迟达到秒级,非常反直觉。

我们用不同版本的 go-redis 和不同版本的 redis cluster 来简单做个压测,redis 命令用简单的 get,kv 大小都是一个字节,这里把数据先放出来:

Client 版本
Server 版本
平均延迟
qps
v6
5.0
2.58ms
42138.85
v8
5.0
2.29ms
43567.68
v6
6.0
549.22ms
117
v8
6.0
2.48ms
41794.86

可以看到第三行数据异常,访问 redis 竟然有 500ms,说明如果 client 和 server 版本没没对上确实会有比较麻烦的问题。

下面我们来分析一下这个问题。

压测环境准备

安装 redis-server,没啥可说的,不要用 mac 的 brew,因为 brew 安装的版本没有 create-cluster 脚本工具。从官网下载分发版本就可以,在 mac 上到 redis 根目录执行一次 make 就好。

这里我们只测试server 5.0 和 6.0。

进入到 redis 的 utils/create-cluster 目录,直接执行

./create-cluster start
./create-cluster create

默认应该是启动三主三从,启动了以后随便连一个实例就可以进去了

~ ❯❯❯ redis-cli -c -p 30001
127.0.0.1:30002> set 1 1

提前把 kv 存好,然后用 30001,30002,30003 这三个实例来创建 client。

client 端要分别准备 go-redis/redis 和 go-redis/redis/v8。

v6 版本:

package main

import (
        "fmt"
        "log"
        "net/http"

        _ "net/http/pprof"

        "github.com/go-redis/redis"
)

var cli = redis.NewClusterClient(&redis.ClusterOptions{
        Addrs:        []string{"127.0.0.1:30001", "127.0.0.1:30002", "127.0.0.1:30003"},
})

func sayhello(wr http.ResponseWriter, r *http.Request) {
        s := cli.Get( "1")
        err := s.Err()
        if err != nil {
                fmt.Println(err)
        }
}

func main() {
        http.HandleFunc("/", sayhello)
        err := http.ListenAndServe(":10003", nil)
        if err != nil {
                log.Fatal("ListenAndServe:", err)
        }
}

v8 版本:

package main

import (
        "context"
        "fmt"
        "log"
        "net/http"

        _ "net/http/pprof"

        "github.com/go-redis/redis/v8"
)

var cli = redis.NewClusterClient(&redis.ClusterOptions{
        Addrs:        []string{"127.0.0.1:30001", "127.0.0.1:30002", "127.0.0.1:30003"},
})

func sayhello(wr http.ResponseWriter, r *http.Request) {
        s := cli.Get(context.TODO(), "1")
        err := s.Err()
        if err != nil {
                fmt.Println(err)
        }
}

func main() {
        http.HandleFunc("/", sayhello)
        err := http.ListenAndServe(":10003", nil)
        if err != nil {
                log.Fatal("ListenAndServe:", err)
        }
}

启动之后直接用 wrk 就可以了:

~ ❯❯❯ wrk -t10 -c100 -d60 http://localhost:10003

分别对不同版本的 client 和 server 端排列组合压测,就可以得到本文开头的数据表格了。

问题分析

为啥 v6 的 go-redis/redis 配合 server 6.0 cluster 会有这么大的延迟?先用 pprof 看一下,压测的时候 goroutine 都在干什么:

goroutine profile: total 205
100 @ 0x1047b2c88 0x1047ac364 0x1047dc484 0x10481f31c 0x10481fe54 0x10481fe45 0x1048a511c 0x1048af4b8 0x104944678 0x1047e1c94
#	0x1047dc483	internal/poll.runtime_pollWait+0xa3		/usr/local/go/src/runtime/netpoll.go:302
#	0x10481f31b	internal/poll.(*pollDesc).wait+0x2b		/usr/local/go/src/internal/poll/fd_poll_runtime.go:83
#	0x10481fe53	internal/poll.(*pollDesc).waitRead+0x1e3	/usr/local/go/src/internal/poll/fd_poll_runtime.go:88
#	0x10481fe44	internal/poll.(*FD).Read+0x1d4			/usr/local/go/src/internal/poll/fd_unix.go:167
#	0x1048a511b	net.(*netFD).Read+0x2b				/usr/local/go/src/net/fd_posix.go:55
#	0x1048af4b7	net.(*conn).Read+0x37				/usr/local/go/src/net/net.go:183
#	0x104944677	net/http.(*connReader).backgroundRead+0x47	/usr/local/go/src/net/http/server.go:672

69 @ 0x1047b2c88 0x1047c43a4 0x1047c4381 0x1047dddc8 0x1047f27a0 0x10499b324 0x10499b2bd 0x1049ab318 0x1049a226c 0x1049a25dc 0x1049a2830 0x1049a2748 0x1049ab744 0x1049b00ec 0x10494b06c 0x10494c744 0x10494d95c 0x10494a05c 0x1047e1c94
#	0x1047dddc7	sync.runtime_SemacquireMutex+0x27				/usr/local/go/src/runtime/sema.go:71
#	0x1047f279f	sync.(*Mutex).lockSlow+0x17f					/usr/local/go/src/sync/mutex.go:162
#	0x10499b323	sync.(*Mutex).Lock+0xa3						/usr/local/go/src/sync/mutex.go:81
#	0x10499b2bc	github.com/go-redis/redis/internal.(*Once).Do+0x3c		/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/internal/once.go:50
#	0x1049ab317	github.com/go-redis/redis.(*cmdsInfoCache).Get+0x47		/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/command.go:1963
#	0x1049a226b	github.com/go-redis/redis.(*ClusterClient).cmdInfo+0x2b		/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/cluster.go:754
#	0x1049a25db	github.com/go-redis/redis.(*ClusterClient).cmdSlotAndNode+0x5b	/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/cluster.go:790
#	0x1049a282f	github.com/go-redis/redis.(*ClusterClient).defaultProcess+0xaf	/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/cluster.go:918
#	0x1049a2747	github.com/go-redis/redis.(*ClusterClient).Process+0x37		/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/cluster.go:905
#	0x1049ab743	github.com/go-redis/redis.(*cmdable).Get+0xf3			/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/commands.go:756
#	0x1049b00eb	main.sayhello+0x3b						/Users/sg00186ml/test/go/redis/goredis/v6.go:25
#	0x10494b06b	net/http.HandlerFunc.ServeHTTP+0x3b				/usr/local/go/src/net/http/server.go:2084
#	0x10494c743	net/http.(*ServeMux).ServeHTTP+0x143				/usr/local/go/src/net/http/server.go:2462
#	0x10494d95b	net/http.serverHandler.ServeHTTP+0x3fb				/usr/local/go/src/net/http/server.go:2916
#	0x10494a05b	net/http.(*conn).serve+0x56b					/usr/local/go/src/net/http/server.go:1966

29 @ 0x1047b2c88 0x1047c43a4 0x1047c4381 0x1047dddc8 0x1047f27a0 0x10499b324 0x10499b2bd 0x1049ab318 0x1049a226c 0x1049a24f8 0x1049a25f0 0x1049a2830 0x1049a2748 0x1049ab744 0x1049b00ec 0x10494b06c 0x10494c744 0x10494d95c 0x10494a05c 0x1047e1c94
#	0x1047dddc7	sync.runtime_SemacquireMutex+0x27				/usr/local/go/src/runtime/sema.go:71
#	0x1047f279f	sync.(*Mutex).lockSlow+0x17f					/usr/local/go/src/sync/mutex.go:162
#	0x10499b323	sync.(*Mutex).Lock+0xa3						/usr/local/go/src/sync/mutex.go:81
#	0x10499b2bc	github.com/go-redis/redis/internal.(*Once).Do+0x3c		/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/internal/once.go:50
#	0x1049ab317	github.com/go-redis/redis.(*cmdsInfoCache).Get+0x47		/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/command.go:1963
#	0x1049a226b	github.com/go-redis/redis.(*ClusterClient).cmdInfo+0x2b		/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/cluster.go:754
#	0x1049a24f7	github.com/go-redis/redis.(*ClusterClient).cmdSlot+0x147	/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/cluster.go:780
#	0x1049a25ef	github.com/go-redis/redis.(*ClusterClient).cmdSlotAndNode+0x6f	/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/cluster.go:791
#	0x1049a282f	github.com/go-redis/redis.(*ClusterClient).defaultProcess+0xaf	/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/cluster.go:918
#	0x1049a2747	github.com/go-redis/redis.(*ClusterClient).Process+0x37		/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/cluster.go:905
#	0x1049ab743	github.com/go-redis/redis.(*cmdable).Get+0xf3			/Users/sg00186ml/go/pkg/mod/github.com/go-redis/redis@v6.15.9+incompatible/commands.go:756
#	0x1049b00eb	main.sayhello+0x3b						/Users/sg00186ml/test/go/redis/goredis/v6.go:25
#	0x10494b06b	net/http.HandlerFunc.ServeHTTP+0x3b				/usr/local/go/src/net/http/server.go:2084
#	0x10494c743	net/http.(*ServeMux).ServeHTTP+0x143				/usr/local/go/src/net/http/server.go:2462
#	0x10494d95b	net/http.serverHandler.ServeHTTP+0x3fb				/usr/local/go/src/net/http/server.go:2916
#	0x10494a05b	net/http.(*conn).serve+0x56b					/usr/local/go/src/net/http/server.go:1966

还是我以前文章写的,业务的性能问题,大部分都是在锁上,这次的锁是基础库的锁。需要我们简单读一下 go-redis 的代码。

在启动 cluster client 的时候,每一个 cluster client 都有一个叫 cmdsInfoCache 的对象:


// http://redis.io/topics/cluster-spec.
func NewClusterClient(opt *ClusterOptions) *ClusterClient {
   .....

   c := &ClusterClient{
      opt:   opt,
      nodes: newClusterNodes(opt),
   }
   .....
   c.cmdsInfoCache = newCmdsInfoCache(c.cmdsInfo) // 这里
   ......

   return c
}

这个对象在这个 client 上第一次执行任意命令时(我们这里的代码例子是 redis 的 Get 命令),会进行初始化:

// get 命令的实现
func (c *cmdsInfoCache) Get() (map[string]*CommandInfo, error) {
   err := c.once.Do(func() error {
      cmds, err := c.fn()
      if err != nil {
         return err
      }
      c.cmds = cmds
      return nil
   })
   return c.cmds, err
}

可以看到,每个命令的前面几行都要执行这个 c.once.Do 逻辑,这个到底是个啥?

func (c *ClusterClient) cmdsInfo() (map[string]*CommandInfo, error) {
   addrs, err := c.nodes.Addrs() // 获取节点信息,没有网络交互
   if err != nil {
      return nil, err
   }

   var firstErr error
   for _, addr := range addrs {
      node, err := c.nodes.Get(addr)
      ......

      info, err := node.Client.Command().Result()
      ......
   }
   return nil, firstErr
}

这里会向 cluster 中的任意一个节点发送 command 命令,只要有一个节点响应了,就把结果返回,并将其缓存在 c.cmds 对象中了。

函数返回的 err 是空的话,cmdsInfoCache 就初始化完成了,根据 sync.Once 的性质,下次所有命令的 once.Do 就只是简单执行一个 atomic.Load,成本不高。

然而我们在长时间压测之后,发现每次命令还是会卡在锁上,说明这里的 err 是非空的,这是怎么回事?

简单来说,redis server 6.0 给 client 返回的 command 命令的响应,在 go-redis/redis v6 版本 parse 会出错:

Fix performance degradation due to non-caching of command info in Redis 6 Cluster by yuuki · Pull Request #1355 · go-redis/redis
What is the problem?I observed performance degradation when using redis.ClusterClient in Redis 6.pprof reported the following bottleneck caused by mutex lock ( redis/command.go ...

每次 parse 都出错,那自然每次 once.Do 都会进 slow path 了,redis cluster 的 client 是全局公用,所以这里的锁是个全局锁,并且锁内有较慢的网络调用。

从 PR 中的描述来看,这个修复直接进入了最新的两个 client 版本。按照大多开源社区项目惯例,官方只维护最新的两个版本以避免太多版本同时维护造成开发过重的负担。

Go 语言本身也是这样的,如果你用的不是最新的两个版本之一,给官方报一些诡异的问题,一般也会被劝诫先升到新版本再试试。

Xargin

Xargin

If you don't keep moving, you'll quickly fall behind
Beijing