Hi. We’re hitting an issue on RDS (Aurora serverless, Postgres) where a certain job hits the database, the DB capacity ramps up to 100% utilization and (this is the annoying bit) then stays there, even after all the load has disappeared. Flipping over the cluster and promoting the reader to writer seems to wake it up and it goes back to normal, but I’m wondering if anyone else has ran into a similar thing? And does anyone have any interesting thoughts on solving it other than taking the load off so it doesn’t hit 100% capacity in the first place?
I think the answer is to contect aws support and ask them what is causing the issue, but it isn’t clear what “db capacity” refers to (it sounds like cpu, but if it is, then it isn’t clear what “all the load disappears” means, because CPU is a component of load.)
Capacity as in ACU usage, referred to as ACUUtilization or ServerlessDatabaseCapacity - the former is really just the latter expressed as a percentage of total allocation (in this case 100% of 64 ACUs). No it’s not CPU - that spikes very high (~70%) but then comes down to less than 2% whilst the ACU capacity (i.e. the thing we’re paying for sticks at 100% (i.e. 64). Ditto for DBLoadNonCPU which is roughly correlated in our case with the CPU load.
What does P99 metric tell you?
Since this is serverless I would ping AWS about it
Interesting question. How would I even see that?
In the RDS console under monitoring
Click “View in CloudWatch” (important for percentiles)
Maybe I’m being dumb, but all I see in either the Monitoring tab or in Cloudwatch for RDS is these 40 metrics.
Wanna jump on a screen share?