I’ve got a ETL process that consumes messages from a SQS queue and there will be cases where the data fails validation so I want to do secondary processing. Would it make sense to use a dead-letter queue redrive by letting the message drop into a dead letter queue and have redrive send it to the secondary queue? Other option is to have the ETL process forward it to this secondary queue when validation fails
I would have the ETL process forward it explicitly, because it lets you distinguish between three outcomes:
• Data passes validation, is successfully processed
• Data fails validation, is sent to the secondary queue
• Data passes validation, but fails processing for an unexpected reason
If you dump everything from the dead-letter queue to secondary processing, you might not spot issues in (3)
Could you not have 3 queues:
… and set it up so primary DLQs to secondary and secondary DLQs to failed, then you’d just configure the max receives appropriately.
Never needed to do something like that, but would assume you could layer them like that (you’ll want to do a quick test to make sure they filter through as expected).
Of course the other way would be just to try primary & secondary processing in the same request, but there may be a good reason as to why you are unable to do that.
It does feel like using DLQ for secondary processing would make the process a little opaque. The plan is that for anything that fails validation will go into secondary processing, if it passes then it continues down the pipeline, if it fails the data ends up in a S3 bucket. So for that final hard failure I could use a final queue and have lambda consume then pass it on to Kinesis
> It does feel like using DLQ for secondary processing would make the process a little opaque.
How so? You’d have clear, easy to access stats on how often/rate of something that fails primary processing as it is its own queue :shrug: and you could define its own SQS parameters etc.
I’d have metrics but not necessarily the why they’re going into DLQ, sending validation failures to a separate queue would give me better visibility. So I could see the number of messages in DLQ going up and alarm but then an engineer would need to dig into logs to see why it is
That sounds like just a case of making the logging more comprehensive/visible to those who need it
Oh definitely, we use datadog and don’t forward cloudwatch metrics so I’ll need to check with a team on if we just rely on log based metrics. That’s the sticking point as we rely on dashboard metrics for transparency
This is what I came up with:
- Data comes in and populates incoming queue.
- Airflow DAG consumes messages, pipeline has a step for handling validation failure. If failure is handled, continue as usual but log a warning.
- If failure cannot be handled, log error and dump message into DLQ. Once failure is resolved by magic, redrive back to incoming queue.