Upgrading k8s with blue/green - how to handle mongodb and rabbitmq statefulsets?

ernestB · January 16, 2024, 5:12pm

Hello everyone, I have an issue I was not able to find a solution or some kind of alternative.
I want to upgrade k8s with blue/green - idea is to create new cluster on higher version (green). If everything works fine there, decomission old cluster (blue).
Although, I have mongodb and rabbitmq statefulsets so the process itself complicates.

I can use federation and shovel for rabbitmq I guess (but it is not possible to keep active-active connection with both blue and green)
On mongodb site I thought to create separate instance on green (new),assign it to mongo cluster in blue(old eks) and then make that solo instance primary (in green). However, I am not sure how the mongo will communicate in this side…

Pergaps anyone with extensive k8s knowledge could guide me a little bit or advise anything?

johnC · January 16, 2024, 6:12pm

This seems like a nightmare migration to me and you would definitely want to test it out in dev/test before doing in a prod cluster. Which raises the question whether there is any benefit doing it that way versus just testing the in-place upgrade in the same order. Given the new version-skew policy (https://kubernetes.io/releases/version-skew-policy/) you can upgrade your control plane up to 3 versions beyond your worker nodes and either gradually upgrade your workers, or add new ones at the new version and gradually roll to them.

johnC · January 16, 2024, 6:25pm

FWIW, we have done in-place upgrades from 1.15 through current 1.27 in 3-version increments with no issues or downtime. YMMV. I don’t think your stateful workloads have much interaction with the k8s control plane and that is the main change you would be “blue-greening” across versions.

ernestB · January 16, 2024, 7:11pm

Thank you for such an effort to provide some insights. It is really valuable.
The reason I choose this way is rollback possibility. As I understand, you can’t rollback once your control plane is upgraded. You can only use lower or higher version of node group version. Am I correct? Or is there a way to overcome this?

johnC · January 16, 2024, 7:35pm

It’s true, rollback is messy. Basically restore from backup. On the other hand staying still means rolling out of support. Upgrade is inevitable! You have to fix the issues. Our approach is to find the issues in lower environments before doing in prod.
We always test the upgrade first multiple times in ephemeral/sandbox clusters to get the automation down. Then do it in customer dev and see if issues surface with customer deployments, etc. If issues are found we focus on fixing the issues at the deployment level since staying still isn’t really an option.
Then move to test, and by the time we do prod we are pretty confident we won’t have to move back.
If you aren’t already, have a look at kubent (kube no trouble) for finding deprecated object versions in your cluster.