Hey, we have a number of Kafka cluster where we consistently have “more than average” cpu usage on the first broker. Looking deeper at this: Despite not having more partitions or more messages, we have a high network thread usage. I’m assuming this is mainly related to bootstrapping targeting the first broker in the list first from all of our clients (we have many!). We’ve eliminated most of the usecases of clients doing excessive bootstrapping but overall there are simply too many clients to make sure everyone is integrating optimally - we need to accept some overhead.
I’m thinking about ideas on what could be done with this. My thoughts:
- Actually make sure the most busy clients have “random” order of bootstrap brokers
a. Talking to ~10 teams would likely make the spread “random” - Revisit our cruise control setup to ensure first brokers will have less load
a. if so: how? - Actually run larger instance type for first broker to cater for the load
a. Seems like bad practice but also doesn’t make sense to scale up all brokers given the load spread - There’s a risk I am wrong and it’s not bootstrapping - any other ideas on reason for "first" broker to be hit harder consistently?