放牛的

放牛的日子,是人生初恋的诗...

0%

Raft共识算法

关于Raft共识算法,网上资料很多,看完了往往还是一头雾水。

有2个很好的站点,可以动态的观察Raft执行过程:

关于为什么取名为Raft,特地查了一下,在Google groups里面找到了答案:

People ask me why it’s called “Raft” every now and then, and I didn’t have anything public on that until now. I wrote this up earlier in the year in a private email. I want a URL for it, so I’m sending it here and spamming all of you.

There’s a few reasons we came up with the name Raft:

- It’s not quite an acronym, but we were thinking about the words ‘reliable’, ‘replicated’, ‘redundant’, and ‘fault-tolerant’.

- We were thinking about logs and what can be built using them.

- We were thinking about the island of Paxos and how to escape it.

As a plus, we were using the randomly generated name Cheesomi in the paper before we came up with the name Raft in September 2012. The name appeared just over 100 times in our paper submission back then, so switching to the shorter name actually helped shrink the paper down quite a bit.

If you want even more detail, we had trouble coming up with a good name, so we made it an explicit agenda item during a RAMCloud meeting. I found two photos of the whiteboard during/after that meeting that I attached here. Looks like the top contenders were Raft, Knox (as in Fort Knox, I guess), Redundo, and Cloudsense (no clue). I don’t remember the details of how we ended up with Raft, since it didn’t obviously win the vote, but I do remember the name caught on really quickly. People seemed to like it almost right away. I’m so fucking glad it’s not called Redundo.

Thanks for listening,

Diego

1. Raft共识算法用来解决什么问题?

简言之,用来解决分布式存储一致性的问题。用来协调集群节点答成共识。

2. Raft怎么做到的?

主要通过2件事:Leader选举 和 日志复制。

2.1 Leader选举

  • 节点状态
    • 角色(state)
      • Follower
      • Candidate
      • Leader
    • 任期(term)
      • 递增的数字,每一次选举,增1
    • 日志索引(log index)
      • 递增的数字,每一次请求增1
  • 2个超时时间
    • Candidate timeout
      • Follower、Candidate没有收到心跳时
    • heartbeat timeout
      • Leader每隔该时间,固定发送心跳包
  • 角色变更
    • 开始:全部是Follower
    • Follower
      • 选举超时:从Follower变成Candidate
    • Candidate
      • 收到多数节点票数:从Candidate变成Leader
      • 发现当前Leader 或 更新的Term:从Candidate 变成 Follower
    • Leader
      • 发现拥有更高Term或更大Index的节点:从Leader变成Follower
  • 消息
    • Request Vote(Candidate发起,请求投票)
      • 发起方:Candidate
      • 请求:term,index
      • 返回:投票 1 或 0
    • Append Entries(Leader发起,心跳)
      • 发起方:Leader
      • 请求:term,index,entries
      • 返回:

2.2 日志复制

  • Leader收到Client请求
  • Leader先写自己的Uncommitted日志
  • 下一次心跳将数据发送到Follower
  • Leader收到多数Follower写uncommitted成功,Leader 改为Committed,并返回客户端成功
  • 下一次心跳,Follower改为Committed