Postgres 15.2 (AWS Aurora RDS) autovacuum issue - dead tuples not cleaning up

I have some weird behaviour with our postgres 15.2 (AWS Aurora RDS) database. We have some pretty heavy load during specific times of the day at some point (auto)vacuuming does not seem to work properly anymore on any table. We have 1 writer and 4 reader instances and have phases of downtime where autovacuum should be able to catch up on cleaning the tables. At some point it seems that neither a manual nor an autovacuum is able to clean up dead tuples anymore.
When it is in this state I created a completely new table, added some rows and did some delete and update operations. Nothing other than me was accessing it and the dead tuples still remained after all vacuum attempts. This behaviour causes the number of dead tuples to grow to unsustainable levels. It is not just large tables that cannot be vacuumed anymore, but all tables, even if there is only 4 live tuples and 1 dead tuple. Did anyone ever experience anything like this? (more info in the thread)

• I don’t see any long running queries, stale replica slots, … Basically anything mentioned in this post:
• After a restart for a short while I can vacuum some of the bigger tables and clean them up before we get into this weird lock state again
• We increased our autovacuum_naptime to 1800 (we had the same problems at 60 and 600) because our ACU Utilization was causing us problems
• We increased our autovacuum_freeze_max_age and vacuum_freeze_table_age to 750 million and 600 million respectively
• Also I haven’t seen any long running vacuum jobs, even cleaning up 23 Million dead tuples took around 30-ish seconds when I did it manually.

Yes, but on aurora serverless v1 which was utter garbage. Since moving to v2 most of our DB problems went away. I think the underlying cause was probably in the replication logic but that’s a guess rather than anything I’ve got strong evidence for.

Also take a look at - might give you a pointer or two.

Do manual vacuums also always work? Or do they also fail to clean anything up / take so long as to be useless?

Manual vacuum also did not work. They don’t take long, but they are not cleaning up anything. We got it under control right now, changing around the table freeze and other parameters a bit, but it is still a weird behaviour that the vacuum could not any table.

We are already using serverless v2.