I have a trouble in rebalancing of my kafka streams app. I’m using 6.0.1-ce (2.6.x).
Because a batch job writes 100 millions messages into the input topic, my kafka streams app usually have only one node. Once the messages arrive, the system will scale up to 5 nodes.
The topic has 30 partitions, the kafka streams app does a group by and an aggregation in memory state store with log enabled, and there are 30 partitions for the reparition topic.
The problem is originally all 60 partitions are assigned to one node, after scaling up, the input 30 partitions are assigned to different nodes, all messages of the input topic are consumed quickly. In two hours, all input topic partitions lags are zero.
After 14 hours (rebalancing happens every 15 mins), only about 8 partitions of the repartition topic are assigned to the other 4 nodes, and the original node still have 22 partitions. The CPU metrics show the original node keeps busy, but other 4 nodes are almost idle. kafka-consumer-groups show that only the repartitions belonging to the original nodes have lags, the other partitions of input and repartition topic have zero lag.
Why the partitions of the repartition topic are not assigned to the other nodes? I think the cooperative rebalancing consider the lags.
How can I fix it?