这是我校招刚入职 Shopee 时遇到的一个问题。Shopee 私有云上 WAF 给内部用户提供了设置 IP 黑白名单规则的能力,所有规则存储在 MySQL 中。我校招刚入职时从已离职前辈的手中接过了这套系统。但很快发现每次修改规则后的 5min 内读到的数据不稳定——新规则时而查得到,时而查不到,也经常有用户反馈这个问题。排查发现原因是服务代码中使用了内存缓存,而这个服务部署了两个实例,实例之间没有同步写请求。如果写后读的读写请求被路由到不同的实例上,就无法读到最新数据。而内存缓存的过期时间被设置为 5min。

查了下这个服务的运维记录,在我入职之前做过一次扩容,从单实例扩容到双实例。之前的研发同事维护 WAF 时一直是单实例运行,所以没出过问题。后来他离职了,别的同事扩容时可能也没意识到会造成不一致的问题。于是问题就到了我这儿。

引入 Redis

我首先想到的解决办法是把内存缓存换成了 Redis,但上线灰度阶段 Redis 带宽被打满,排查发现是因为有些规则的封禁 IP 列表很长,导致传输数据量非常大。


由于 WAF 规则读多写少,绝大多数时候从 Redis 读到的数据不会有变化。有经验的老同事建议用 Redis 维护版本号,规则数据仍然存在内存缓存中。经过反复推敲,最终的设计的架构如下。


  • 写操作比较简单,使用当前微秒时间戳作为新的版本号,做如下四件事:写 DB,更新 redis 版本号,更新本地内存缓存中的数据和版本号,四件事的顺序可交换
  • 读操作稍微复杂一点:先读 redis 中的版本号,如果本地版本号没有过期(绝大多数情况)就直接从本地内存缓存中读数据。对于 redis 与内存中版本号不一致和 redis 没读到(expired)的情况要单独处理,处理逻辑如伪代码所示
  • 如果一微秒内有多个写请求,仍然可能出现不一致。不过 Shopee WAF 的实际使用场景不太会有如此频繁的更新,所以我就没做处理了。不过时间戳在这里只用来判等,不会比较大小,因此可以用任何一种分布式唯一 ID 解决方案替换时间戳
  • 版本号不对用户暴露,事实上同一版本号可能会读到不同的规则数据,但这并不会破坏最终一致性
func Set(key, data) {
    newVer := time()

    WriteMySQL(key, data)
    redis.Set(key, newVer, exprire=5min)

func Read(key) Data {
    ver := redis.Get(key)
    if ver != nil {
        if localCacheVer.Load() == ver {
            // Local cache is up-to-date, just use it
            return localCacheData.Load()
    } else {    // This version has expired
        ver := time()
        res := redis.SetNX(key, ver, expire=5min)
        if res == false {
            // Another instance has proceded, use that version
            ver = redis.Get(key)
    data := ReadFromMySQL(key)
    return data

TLA+ 形式化验证

恰好当时自学了 TLA+,顺手写了下这个设计对应的 TLA+ 公式,果然成功通过了最终一致性的验证。写这篇总结的时候感觉应该是线性一致的,但没有验证。
最开始的持续 5min 的接口返回数据不一致问题成功得到了解决。

// ================ tla file ================

---- MODULE waf ----

VARIABLE redisVer, localVer, pc, threadVer, DBData, localData, threadData
CONSTANTS DataDomain, ProcSet, r1, r2, r3, t1, t2, t3

vars == << redisVer, localVer, pc, threadVer, localData, threadData, DBData>>

Init == /\ redisVer = -1 /\ localVer = -1 /\ localData = "" /\ DBData = ""
        /\ threadVer = [self \in ProcSet |-> -1]
        /\ pc = [self \in ProcSet |-> "A"]
        /\ threadData = [self \in ProcSet |-> ""]

RedisExpire == /\ threadData = [self \in ProcSet |-> DBData]
               /\ redisVer' = -1
               /\ DBData' \in DataDomain
               /\ UNCHANGED <<localVer, threadVer, localData, threadData, pc>>

ReadRedis(self) == /\ pc[self] = "A"
                   /\ threadVer' = [threadVer EXCEPT ![self] = redisVer]
                   /\ / /\ redisVer = -1
                         /\ pc' = [pc EXCEPT ![self] = "C"]
                      / /\ redisVer # -1
                         /\ pc' = [pc EXCEPT ![self] = "F"]
                   /\ UNCHANGED <<localVer, redisVer, localData, threadData, DBData>>

SetRedis(self) == /\ pc[self] = "C"
                  /\ / /\ redisVer # -1    * SetNX failed => use existing redis
                        /\ redisVer' = redisVer
                        /\ threadVer' = [threadVer EXCEPT ![self] = redisVer] * Not strictly the same!
                     / /\ redisVer = -1    * SetNX ok => change redis
                        /\ redisVer' \in 1600012345..1600012350
                        /\ threadVer' = [threadVer EXCEPT ![self] = redisVer']
                  /\ pc' = [pc EXCEPT ![self] = "I"]
                  /\ UNCHANGED <<localVer, localData, threadData, DBData>>

CheckLocal(self) == /\ pc[self] = "F"
                    /\ / /\ localVer = threadVer[self]    * Normal case
                          /\ threadData' = [threadData EXCEPT ![self] = localData]
                          /\ pc' = [pc EXCEPT ![self] = "H"]
                       / /\ localVer # threadVer[self]
                          /\ pc' = [pc EXCEPT ![self] = "I"]
                          /\ threadData' = threadData
                    /\ UNCHANGED <<redisVer, localVer, localData, threadVer, DBData>>

SetLocal(self) == /\ pc[self] = "I"
                  /\ localVer' = threadVer[self]
                  /\ localData' = DBData
                  /\ threadData' = [threadData EXCEPT ![self] = DBData]
                  /\ pc' = [pc EXCEPT ![self] = "H"]
                  /\ UNCHANGED <<redisVer, threadVer, DBData>>

ReturnResult(self) == /\ pc[self] = "H"
                      /\ pc' = [pc EXCEPT ![self] = "Done"]
                      /\ UNCHANGED <<redisVer, localVer, threadVer, localData, threadData, DBData>>

Again(self) == /\ pc[self] = "Done"
               /\ pc' = [pc EXCEPT ![self] = "A"]
               /\ UNCHANGED <<redisVer, localVer, threadVer, localData, threadData, DBData>>

Terminating == /\ \A self \in ProcSet: pc[self] = "Done"
               /\ UNCHANGED vars

Proceed(t) == ReadRedis(t) / SetRedis(t) / CheckLocal(t) / SetLocal(t) / ReturnResult(t) / Again(t)

Next == / RedisExpire
        / \E t \in ProcSet: Proceed(t)

FairForEveryone == \A t \in ProcSet: SF_vars(Proceed(t))

Spec == /\ Init /\ [][Next]_vars /\ FairForEveryone

symm == Permutations({r1, r2, r3}) \union Permutations({t1, t2, t3})

EventualCons == \A v \in DataDomain: DBData = v ~> threadData = [t \in ProcSet |-> v]
ECSpec == Spec /\ EventualCons

// ======= cfg file ========

    DataDomain = {r1, r2}
    r1 = r1
    r2 = r2
    r3 = r3
    ProcSet = {t1, t2, t3}
    t1 = t1
    t2 = t2
    t3 = t3


