Managing Kafka clusters to prevent disk storage issues using read-only mode

vincentH · August 6, 2024, 2:38am

Hi everyone,

My team and I provide managed Kafka clusters for our organization, and we’ve encountered an issue where clients sometimes leave producers running on non-production clusters, causing the Kafka-data disk to run out of storage. This has led to brokers failing due to the lack of available disk space. Our current solution involves increasing the data disk size and then deleting messages and decreasing the retention period to free up space.

We’re looking for a “read-only” solution to prevent this. Specifically, we want to disable producing to the cluster once the VM’s disk usage reaches a certain threshold (e.g., 80%), while still allowing consuming. Our proposed solution is to increase the min.insync.replicas above the maximum broker count. This would prevent producers from pushing to the cluster since the required number of ISRs wouldn’t be met, but consuming would still work. We plan to dynamically implement this change via Kafka brokers, and when the disk usage decreases to a certain level (e.g., 70%), we will disable the read-only mode and revert the min.insync.replicas to normal value.

We understand that this solution may not be considered best practice, as it relies on broker configurations and could potentially be overridden by topic-level configurations. However, it’s the best solution we’ve found so far.

Does anyone have thoughts on this approach or suggestions for a better solution?

Thanks in advance for your input!

oWoods · August 6, 2024, 3:04am

I have not thought of this before, but setting min.isr would only work if they are submitting with acks=all, if they did 0 or 1, they would still be able to produce.

ACLs and you remove publishing permissions when threshold is met - -then you can set this to the most troublesome publisher — instead of punishing all producers.

You could establish quotas and throttle production down as well (again, just thinking outloud on this one)

rolandHawkins · August 6, 2024, 3:22am

I would do the acl thing mentioned. If for no other reason changing the default min.isr is either a topic configuration (LOTS of topics could take a long time), or require a broker restart

rosa · August 6, 2024, 4:14am

+1 to throttling and quotas, this is what we do.

And ACL publishing as last resort also sounds good.

Whether it’s easier to educate your users or you need to implement automation for this will depend on your organization. We run a performance environment with low retention and accept that things can get a bit messy there in times of high test rates while running other stable testing in different. We don’t accept any performance testing being run in the stable testing environment.

vincentH · August 6, 2024, 4:34am

Hey all! Thanks for your answers.
May you have example for throttling & quotas limitation?
From what we tried, it didn’t work as expected and managed to block the produce according to our needs, but maybe we didn’t configure it properly.

An ACL-based solution is less suitable because there is a significant portion of responsibility on the client. Unless there’s a global ACL that can be applied and removed to affect everyone, do you know something like it?

Since these are non-production clusters, the goal is essentially to reach an “out of disk” state without Kafka crashing and needing to increase storage.

rolandHawkins · August 6, 2024, 5:28am

Acls have a global value. You can DENY everyone produce requests. However remember allows are evaluated first, so if you have a global or topic allow it will take precedence.