SSpike in Under Replicated and Offline Partition while restrating Zookeeper nodes

williamTiller · February 8, 2022, 3:50am

Hello! We have kafka/zookeeper setup on kubernetes, and I’ve noticed that whenever we do a rolling restart of the zookeeper nodes, we see a brief spike in Under Replicated and Offline Partitions in our kafka cluster. Have y’all seen this before? I’ll write down everything I’ve observed/tried in a thread here

williamTiller · February 8, 2022, 4:23am

• Deleting just the zookeeper leader does not lead to the urp spike
• Deleting just a zookeeper follower replica does not lead to the urp spike
• The zookeeper leader comes back up correctly after a restart. We have a readiness probe configured that checks that the srvr four letter command returns correctly.
• I am trying to slow down the restarts between zookeeper pods. Kubernetes has a feature on their readiness probes that allow you to configure the time before the readiness check executes. I’ve tried to set this value to 300s, and I’m still seeing some urps when the second zookeeper pod is restarted
• I confirmed that we do not override the zookeeper session timeout from 18s on the broker configs
• When the zookeeper pod exits, I see some logs for the other replicas that indicate the connection isn’t shutdown cleanly:

"2022-01-19T17:47:21Z WARNING org.apache.zookeeper.server.quorum.QuorumCnxManager RecvWorker:3 Interrupting SendWorker thread from RecvWorker. sid: 3. myId: 2"
"2022-01-19T17:47:21Z ERROR org.apache.zookeeper.server.quorum.LearnerHandler LearnerHandler-/&lt;IP&gt; Unexpected exception causing shutdown while sock still open"```
However, restarting just _one_ zookeeper pod does not result in the urp spike, it's only when I do two pods in succession.

ernestB · February 8, 2022, 5:38am

Are u having 3 pods for zookeeper and kafka service each

williamTiller · February 8, 2022, 5:50am

We have 3 zoookeeper pods, 12 kafka brokers