I am listening to s3 object creation events in my datalake. I have an sns → sqs → lambda setup. This works great for new objects. However there is some data in the datalake that had been there before we implemented this architecture of processing the files. Is there a solution to help me replay processing of these older/existing s3 objects ?
Replay events, I don’t believe so, but you could maybe look into S3 inventory and then maybe producing your own “synthetic” S3-style events from that big list to your SNS topic. You’d of course need to filter that list down to what hasn’t had an event yet though.
Can you elaborate for me what you mean by s3 inventory?
Do you ahve something in mind when you say synthetic s3 style events?
There is one posssible solution that im thinking about to replay the processing.
The idea is to make use of the Object Tags Added
event bridge rule.
- In my lambda i would look for this event coming through and get the tag of the object.
- I bulk tag a set of s3 objects with something like
back-date-processing: true
- Any objects that come up through that event with that tag i can process as usual. .
Have you come across this before ?
Google “s3 inventory”, it will produce a set of CSVs which represents an index of your objects in S3 and then you could decide programmatically which of those S3 objects you send your own event (which looks/structured like an S3 event) as a way of generating events from that CSV report.