Understanding the rebalancing process in Kafka and optimizing using CooperativeStickyAssignor

Hi Team, I need help on understanding the rebalancing process in Kafka. I am seeing a complete stop the world kind of situation for my applications and rebalance is taking huge time.

I have a application which is running in kubernetes and it can scale upto 20 pods, during scale up/down as consumers are increasing / decreasing, i see all the consumers are stopping and the messages are not consumed for 1-5 mintues - until the rebalance completes.

i read multiple articles and gone through the suggestion of using CooperativeStickyAssignor as partition strategy, but this also does not seem to be solving the problem. Kept rebalance listener also but that does not seem to be of any use.

Please help to understand this problem and behaviour of CooperativeStickyAssignor https://www.confluent.io/blog/incremental-cooperative-rebalancing-in-kafka/ https://cwiki.apache.org/confluence/display/KAFKA/KIP-429%3A+Kafka+Consumer+Incremental+Rebalance+Protocol

Yeah, group rebalances are stop the world events, and is a known problem when scaling up and down in K8s. How many consumers in the entire consumer group?

There could be 8-20 consumers

And how many topic partitions do they subscribe to?

They have around 8 partitions, 5 topics.

I can see that the CooperativeStickyAssignor strategy working correctly, but still sometimes, its taking huge time in rebalance

Found that - CooperativeStickyAssignor is causing the duplicates also.

So, is that 8 partitions across 5 topics or 5 topics with 8 partitions each?

It would be good to capture some logs and/or metrics on this to see what’s going on. Which Kafka client library are you using?

5 topics each having 8 partitions. I am using kafka-clients-3.5.1 java lib. Had tried multiple things, Using CooperativeStickyAssignor, etc but nothing seems working out as it is causing duplicates. Saw some open bugs also around this - where it is said that this does not work when rebalance is happening and we are calling commitSync, etc.

As a fix, i have reverted to default range assignor. Using topic wise CG now to limit the number of consumers and for faster rebalance.

Ah interesting, which bugs out of curiosity? Glad the range assignor worked for you, but cooperative-sticky shouldn’t be slower so might be a bug I can work on

https://github.com/reactor/reactor-kafka/issues/314

Found 1-2 more stackoverflow links with similar problem.