Hi Team, I need help on understanding the rebalancing process in Kafka. I am seeing a complete stop the world kind of situation for my applications and rebalance is taking huge time.
I have a application which is running in kubernetes and it can scale upto 20 pods, during scale up/down as consumers are increasing / decreasing, i see all the consumers are stopping and the messages are not consumed for 1-5 mintues - until the rebalance completes.
i read multiple articles and gone through the suggestion of using CooperativeStickyAssignor as partition strategy, but this also does not seem to be solving the problem. Kept rebalance listener also but that does not seem to be of any use.
Yeah, group rebalances are stop the world events, and is a known problem when scaling up and down in K8s. How many consumers in the entire consumer group?
5 topics each having 8 partitions. I am using kafka-clients-3.5.1 java lib. Had tried multiple things, Using CooperativeStickyAssignor, etc but nothing seems working out as it is causing duplicates. Saw some open bugs also around this - where it is said that this does not work when rebalance is happening and we are calling commitSync, etc.
As a fix, i have reverted to default range assignor. Using topic wise CG now to limit the number of consumers and for faster rebalance.
Ah interesting, which bugs out of curiosity? Glad the range assignor worked for you, but cooperative-sticky shouldn’t be slower so might be a bug I can work on