Listening to S3 object creation events and replaying older files

edward · February 20, 2024, 12:25am

I am listening to s3 object creation events in my datalake. I have an sns → sqs → lambda setup. This works great for new objects. However there is some data in the datalake that had been there before we implemented this architecture of processing the files. Is there a solution to help me replay processing of these older/existing s3 objects ?

jennifer · February 20, 2024, 1:00am

Replay events, I don’t believe so, but you could maybe look into S3 inventory and then maybe producing your own “synthetic” S3-style events from that big list to your SNS topic. You’d of course need to filter that list down to what hasn’t had an event yet though.

edward · February 20, 2024, 1:15am

Can you elaborate for me what you mean by s3 inventory?

edward · February 20, 2024, 1:30am

Do you ahve something in mind when you say synthetic s3 style events?

edward · February 20, 2024, 1:38am

There is one posssible solution that im thinking about to replay the processing.

The idea is to make use of the Object Tags Added event bridge rule.

In my lambda i would look for this event coming through and get the tag of the object.
I bulk tag a set of s3 objects with something like back-date-processing: true
Any objects that come up through that event with that tag i can process as usual. .

Have you come across this before ?

jennifer · February 20, 2024, 2:30am

Google “s3 inventory”, it will produce a set of CSVs which represents an index of your objects in S3 and then you could decide programmatically which of those S3 objects you send your own event (which looks/structured like an S3 event) as a way of generating events from that CSV report.