Anyone have any idea why AWS aurora chooses to restart replicas if the replication lag goes above a certain amount? I’m semi ignorant but I feel like if you had confidence in the system you wouldn’t want to restart a replica under those conditions because it would presumably just add to the lag in the end
Can it be the case that the replicas felt behind that much that there are not enough WAL segments kept to keep up so they need to be recovered from the latest snapshot completely? IDK if Aurora can suffer from this issue, but I saw similar things with RDS
I am not sure if its actually the case. Aurora underneath has a different architecture and basically desegregates the storage from the compute based on global replicated WAL… so I am not sure how much it differs frome the original PG
Maybe it does not need to keep segments for replication if the whole storage is WAL, but I dont remember the details
> If the log record refers to a page in the reader’s buffer cache, it uses the log applicator to apply the specified redo operation to the page in the cache
From section 4.2.2 of “Amazon Aurora: Design considerations for high throughput cloud-native relational databases”
I assume said restart clears the buffer cache. Making every read after that hit distributed storage and lag effectively zero.
Consider that when it restarts, it is (likely) switching readers behind the scenes, to an instance that is fully caught up. So this gives them a catch all fix regardless of what is causing the replica lag. On a side note, you can use the secret function aurora_replica_status()
to monitor replica lag if you want (it only provides visibility at the node layer, and not the underlying architecture, but it’s probably better than other options).