Cost-effective approach for Lambda-SQS: Single or multiple message processing

Hi friends,

I am trying to determine what is more cost effective when it comes to lambda execution sqs trigger, processing a single message at a time or multiple messages.

I have a SQS and every hour it receives around 40k messages. I have a lambda function that processes the messages on this Q. Currently the lambda function has batchSize=1 and maxConcurrency=50 _I_s there some way i can determine if its more cost effective for each exec to process multiple messages, without actually making this change.

Also interested if you’ve done this kind of testing before and what you found out if anything.

there are way too many variables to determine this. How much setup work you do OUTSIDE the main handler is a big component of how much better it will be to change the batch size. But in-batch error rate is important - if every other batch will be thrown away because of a failure, that won’t be good. Your best bet is to ensure your lambda is coded properly and then just test different batch sizes and concurrency counts. The right answer could be 1/50 or 100/100.

the other variable you aren’t noting is how much memory (and so by proxy how much CPU) you have allocated to the lambda (and whether the underlying runtime can make use of it). There are times a 128MB lambda works, but a 512MB lambda works more than 4x faster, so it’s cheaper.

Ye i imagined this would be complex to figure out without trial and error

I wish there was a way to measure or graph easily the execution time and the amount of memory was used etc

I had to icnrease the memory usage from 128mb to 512mb because 40% of the messages ( for example) could require me to fetch a large amount of data and therefore requires more memory to buffer 4 out of 10 times before storing the data into s3.

but i dont know if its giong to need 512mb of ram until ive fetched the data ( at which poitn the lambda is already executing ). So ive had to set a default value to of ram to something much higher than 60% of the messages

this 40/60% is something im guessing based on logs as ive not found a good way to measure that in cloudwatch metrics

How much setup work you do OUTSIDE the main handler is a big component of how much better it will be to change the batch size.

Can you go into more what you mean by this? For context i have a prisma binary that my lambda requires before execution

But in-batch error rate is important - if every other batch will be thrown away because of a failure, that won't be good.

Are you suggesting if one message in the batch fails, the entire batch is thrown out until retried?

there is a graph of execution time and memory used, you use this to get those: https://repost.aws/knowledge-center/lambda-function-memory-usage-monitoring

if you can interrogate the size of the data you need BEFORE fetching it (i.e. like doing a HEAD on the object before GET in S3) then maybe you want to migrate this to a step functions workflow - the first step just figures out whether you need the “BIG” lambda or the “TINY” lambda and farms it off to the appropriate SQS queue. Then you run a 128MB lambda to categorize, another 128MB lambda to process “small” batches, and a 512MB lambda to process big batches (or maybe the batch size trigger on the large is only 2 or 3 instead of 10, and you can keep it 128MB or 256MB).

How much setup work you do OUTSIDE the main handler - since lambda contexts are reused across function invocations in limited time spans, if you do all the heavy setup work outside the main handler, you can gain the benefit of reusing that over multiple invocations. Setting up sdk clients, fetching secrets, connecting to DB connection pools (or making one), computing a complex hash, downloading binaries to tmp, whatever else you’re doing, if you can put it outside the handler, you can save yourself that time on the next invocation, you only pay the tax on what’s outside the main handler on cold start, but once it’s warm, you can keep reusing it. Even S3 = boto.client("s3") takes a few ms, no need to do it inside the handler.

in-batch error rate - yes. https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html

By default, if your function encounters an error while processing a batch, all messages in that batch become visible in the queue again. For this reason, your function code must be able to process the same message multiple times without unintended side effects.

but you can fix that if you write some more code: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting