On Wednesday, June 27, 2018, Amazon Web Services released SQS triggers for Lambda functions. Those of you who have been building serverless applications with AWS Lambda probably know how big of a deal this is. Until now, the AWS Simple Queue Service (SQS) was generally a pain to deal with for serverless applications. Communicating with SQS is simple and straightforward, but there was no way to automatically consume messages without implementing a series of hacks. In general, these hacks “worked” and were fairly manageable. However, as your services became more complex, dealing with concurrency and managing fan out made your applications brittle and error prone. SQS triggers solve all of these problems. 👊
Update September 3, 2020: There are a number of important recommendations available in the AWS Developer Guide for using SQS as a Lambda Trigger: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html.
The most important are:
“To allow your function time to process each batch of records, set the source queue’s visibility timeout to at least 6 times the timeout that you configure on your function. The extra time allows for Lambda to retry if your function execution is throttled while your function is processing a previous batch.”
“To give messages a better chance to be processed before sending them to the dead-letter queue, set the maxReceiveCount
on the source queue’s redrive policy to at least 5.”
It is also imperative that you set a minimum concurrency of 5 on your processing Lambda function due to the initial scaling behavior of the SQS Poller.
Update November 19, 2019: AWS announced support for SQS FIFO queues as a Lambda event source (announcement here). FIFO queues guarantee message order, which means only one Lambda function is invoked per MessageGroupId
.
Update December 6, 2018: At some point over the last few months AWS fixed the issue with the concurrency limits and the redrive policy. See Additional experiments with concurrency and redrive polices below.
Audio Version (please note that this audio version is out of date given the new updates)
Attaching Consumers to Message Brokers
It’s a common architecture design pattern to attach consumers to message brokers in distributed applications. This becomes even more important with microservices as it allows communication between different components. Depending on the type of work to be done (high versus low priority), message brokers can be passive, simply storing messages and waiting for a message-oriented middleware to poll it, or active, where it will route the message for you.
RabbitMQ, for example, allows you to create bindings that attach workers to queues. You can run workers in the background with something like supervisor, which will get messages pushed to them as RabbitMQ receives them. Until now, SQS has lacked the ability to do this type of “push” and instead required constant polling to achieve a similar effect. This constant polling might make sense for high volume queues, but for smaller, occasional jobs, even running a Lambda function every minute would be a waste of resources.
Setting up an SQS trigger in Lambda is simple through the AWS Console. SAM templates also support this (https://github.com/becloudway/aws-lambda-sqs-sam) so you can set it up using that as well. The team at Serverless has implemented this is working to add this in too. The only required settings are the queue you want to access and the “batch size”, which is the maximum number of messages that will be read from your queue at once (up to 10 at a time). Be sure to configure your IAM permissions properly, you need read and write privileges.
Running Some Experiments
I set up two test functions to run some experiments. The first was the function that was triggered by SQS and received the messages:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
let counter = 1 let messageCount = 0 let funcId = 'id'+parseInt(Math.random()*1000) exports.handler = async (event) => { // Record number of messages received if (event.Records) { messageCount += event.Records.length } console.log(funcId + ' REUSE: ', counter++) console.log('Message Count: ', messageCount) console.log(JSON.stringify(event)) return 'done' }; |
The second was a “queue flooder” that just generated random messages and sent them to the queue. Remember that SQS can only handle batches of 10:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
const AWS = require('aws-sdk') const SQS = new AWS.SQS() const queue = 'https://sqs.us-east-1.amazonaws.com/XXXXXXXXXX/test-sqs-trigger-queue' exports.handler = async (event) => { // Flood SQS Queue for (let i=0; i<50; i++) { await SQS.sendMessageBatch({ Entries: flooder(), QueueUrl: queue }).promise() } return 'done' } const flooder = () => { let entries = [] for (let i=0; i<10; i++) { entries.push({ Id: 'id'+parseInt(Math.random()*1000000), MessageBody: 'value'+Math.random() }) } return entries } |
I then ran my queue flooder. It sent 500 messages to SQS, which triggered my receiver function and drained the queue! 🙌
The event looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
{ "Records": [ { "messageId": "9cf06c9b-e919-4ef9-8485-3d13c347a4d1", "receiptHandle": "AQEBJRZxkQUWQYAwBMPpN4...rVCoU70HTdEVH4eKZXuPUVBw==", "body": "value0.6888803697786434", "attributes": { "ApproximateReceiveCount": "1", "SentTimestamp": "1530189332727", "SenderId": "AROAI62MWIO3S4UBJVPVG:sqs-flooder", "ApproximateFirstReceiveTimestamp": "1530189332728" }, "messageAttributes": {}, "md5OfBody": "7ce3453347fd9bd30281384c304a1f9d", "eventSource": "aws:sqs", "eventSourceARN": "arn:aws:sqs:us-east-1:XXXXXXXX:test-sqs-trigger-queue", "awsRegion": "us-east-1" } ] } |
The SQS trigger spawned 5 concurrent Lambda functions that took less than 2 seconds to process all of the messages. Pretty sweet! This got me thinking about how Lambda would handle thousands of messages as once. Would it spawn hundreds of concurrent functions? 🤔
Concurrency Control for Managing Throughput
My first thought was that Lambda’s default scaling behavior would continue to fan out to process messages in the queue. This behavior makes sense if you’re doing something like processing images and there are no bottlenecks to contend with. However, what if you’re writing to a database, calling an external API, or throttling requests to some other backend service? Unlimited scaling is sure to cause issues if hundreds or thousands of Lambda functions are competing for the same resource. Luckily for us, we have concurrency control! 🤘🏻
I ran a few experiments to see exactly how Lambda would handle throttling requests. I used my queue flooder again and set the concurrency to 1. I sent 100 messages to the queue. Here are the log results:
It ultimately used two functions to handle the workload, but they executed serially, never exceeding my concurrency limit. As you can see, 100 messages were eventually processed over the course of 40 total invocations. Brilliant! 😀
I ran a second experiment to scale up my workload. I set the concurrency to 2 and flooded the queue with 200 messages this time.
Four functions in total were used, with only two running concurrently. And all 200 messages were processed successfully! 😎
Additional experiments with concurrency and redrive polices
Update December 6, 2018: The “Lambda service” now takes concurrency into account and no longer considers throttled invocations to be failed delivery attempts. This means that you can set a redrive policy on your queue and the system will only forward messages to a DLQ if there is an error while processing the message! 🙌
I’ve left the original information below to preserve the history, but most of this section is no longer relevant.
I had a really great comment that pointed out how adding redrive policies to SQS queues causes issues when you set a low concurrency. I ran some additional tests and discovered some really interesting behaviors.
It appears that the “Lambda service” polls the queue, and puts messages “in flight” without consideration of the concurrency limits. I added a 100ms wait time to my receiver function and noticed that messages towards the end of processing have ApproximateReceiveCount
s of 4 or more. I’m sure this would be exacerbated by longer execution times and higher message volumes. The good news is that each message was only processed by my function one time.
I then set a redrive policy of But now this got me thinking about error handling. So I added some code to trigger an error in my receiver (1
, and ran my “queue flooder” again. This time a large percentage ended up in the Dead Letter Queue. This makes sense given how the Lambda Service polls for messages.throw new Error(‘some error’)
) for every 10 requests, this turned out to be a really bad idea! I set my redrive policy to 1,000 Maximum Receives and then I sent 100 messages into the queue. 90 of the messages got processed (as expected), but the other 10 just kept on firing my Lambda function. Over and over again. 😳 I assume that it would have eventually stopped after those 10 messages had been tried 1,000 times each.
I think the concurrency control needs a bit of work. However, I believe that Lambda triggers are still usable if you manage the redrive policy yourself. If returning errors from the Lambda function doesn’t tell the Lambda service to DLQ the message, then perhaps this just needs to be handled by our receiver functions. I don’t think I really care how many times a message needs to be retried as long as it is eventually processed. If there is an issue with the message (e.g. it throws some error and needs to be inspected later), then I can handle that in a number of ways, including pushing it to a DLQ myself or logging it somewhere else.
It probably comes down to your particular use case, but throttling still seems possible, so long as you don’t rely on SQS’s built in redrive functionality.
This is a Game Changer 🚀
As you probably can tell, I’m a bit excited about this. Being able to use the Simple Queue Service as a true message broker that you can attach consumers to changes the way we’ll build serverless applications. One of the biggest challenges has been throttling backend resources to avoid hitting service limits. With SQS triggers and concurrency control we can now offload expensive (or service limited) jobs without the need to hack something with scheduled tasks or CloudWatch log triggers. There’s no longer a need to manage our own fan out operations to process queues. We can now simply choose how many “workers” we want processing our queued requests.
Think of the possibilities! We can trigger SQS from Dead Letter Queues to create our own redrive processes. We could queue data mutations, knowing that they will be processed almost instantaneously by our workers (reducing latency and costs of API responses). We can batch process and throttle remote API calls. My brain is spinning. 🤯
The team at AWS has once again pushed the envelope for serverless computing. Congrats! Know that your work is appreciated. You’re awesome. 🙇♂️
Update: Read the official announcement by AWS.
Interested in learning more about serverless? Check out 10 Things You Need To Know When Building Serverless Applications to jumpstart your serverless knowledge.
If you want to learn more about serverless security, read my posts Securing Serverless: A Newbie’s Guide and Event Injection: A New Serverless Attack Vector.
Tags: aws, aws lambda, message broker, serverless, sqs
Did you like this post? 👍 Do you want more? 🙌 Follow me on Twitter or check out some of the projects I’m working on.
When you set the concurrency limit on the lambdas reading from the SQS, did their throttling metric increase during the flooder test, or did they just not try and spin up new lambdas beyond their limit without giving any errors about concurrency limits being hit?
The throttling metric did go up, but it didn’t throw any errors. It appears that they’ve designed it to handle concurrency so that it will process messages serially as the queue is worked down.
Thanks for such a great post. I have a really dumb question that I think you answered several different way already, but for clarification:
Does setting a concurrency limit on the Lambda fn mean that the queue will be completed however quickly the Lambda functions can work? Is the below example what you’re saying:
===
– Silly Lambda fn: “sleep(1000).then(succeed)”
– Lambda fn triggered by SQS – batch size 1
– Lambda fn set to concurrency of 1
– Add 100 messages to queue
– SQS queue is drained after 100 invocations, approximately 100 seconds
(Adjusting to concurrency 2 is approximately 50s, etc.)
===
The “amount of time” isn’t necessarily my concern, but perhaps it helps illustrate “work” happening — I’m mainly curious how the queue gets drained. We’re currently polling SQS and then, if there are messages, invoking other Lambdas so we don’t DDOS ourselves.
Bur what I *think* I’m excited about here is that we don’t have to do gymnastics to drain the queue — it Just Works(TM). Am I understanding correctly?
Hi Tom,
Great question. The docs aren’t very specific (https://docs.aws.amazon.com/lambda/latest/dg/invoking-lambda-function.html#supported-event-source-sqs), but I’m pretty sure my tests confirm your assumptions.
According to Amazon’s announcement:
“The Lambda service monitors the number of inflight messages, and when it detects that this number is trending up, it will increase the polling frequency by 20 ReceiveMessage requests per minute and the function concurrency by 60 calls per minute. As long as the queue remains busy it will continue to scale until it hits the function concurrency limits. As the number of inflight messages trends down Lambda will reduce the polling frequency by 10 ReceiveMessage requests per minute and decrease the concurrency used to invoke our function by 30 calls per-minute.”
What I gather from that is that the Lambda service is still technically polling SQS (just more efficiently than we could). So if you hit your concurrency limit, it will process as many messages as possible with the available functions, and then when a function completes, it will invoke the function again and again until all messages are drained.
– Jeremy
Great post! Really good timing too!
Question. So far, I’ve been using SNS to do this kind of job. I publish to a SNS topic and that triggers a lambda function. If a backend service has rate limits, I just tune the concurrency limits. Could you maybe explain the differences between using SNS and SQS with Lambda?
Thanks!
Hi Sebastian,
That is a really great question. The difference between SQS and SNS comes down to retry behavior (more info here). SNS events are not stream-based and are invoked asynchronously. This means that after 3 attempts, the message will fail and either be discarded or sent to a Dead Letter Queue. If you are setting concurrency limits on Lambdas that are processing SNS messages, it is possible that high volumes could cause messages to fail 3 times and therefore be discarded.
SQS, on the other hand, is a poll-based event source that is not stream-based. This means it will continue to poll the SQS queue with the resources (number of functions) available to it. Even if you set a concurrency limit to your functions, the messages will simply remain in the queue until there is a function available to process it. This is different from stream-based events, since those are BLOCKING, whereas SQS polling is not. According to the updated documentation: “If you don’t require ordered processing of events, the advantage of using Amazon SQS queues is that AWS Lambda will continue to process new messages, regardless of a failed invocation of a previous message. In other words, processing of new messages will not be blocked.”
Another important factor is the retry intervals. If an SNS message fails due to concurrency issues, the message is tried again “with delays between retries”. This means that your throttling could significantly delay an SNS message getting processed by your function. SQS would just keep on chugging through the queue as soon as it had capacity to handle more messages.
Hope that makes sense!
– Jeremy
Hi Jeremy,
Thanks for a great post. I have a question though.
You wrote in one of the comments “What I gather from that is that the Lambda service is still technically polling SQS (just more efficiently than we could). ”
Does it mean that our function is being executed in some (dynamic) time intervals and we PAY for each invocation no matter if there were messages in the queue or not?
My next question regards notifying SQS when you’re done with the message. I don’t see any code in your first function for deleting a message from a queue. I understand that you need to do that after consuming it, right?
Hi Pawel,
According to the SQS event documentation, you can “configure AWS Lambda to poll for these messages as they arrive and then pass the event to a Lambda function invocation.” The announcement also states that the “Lambda service monitors the number of inflight messages”. I don’t know exactly what they are doing internally, but my tests show that the Lambda functions begin consuming messages almost instantaneously once they’re added to the queue. To answer your question, you DO NOT pay for this monitoring service, only when messages are passed to your Lambda is it counted as an “invocation”.
One of the other great things about this service is that you do not need to manage messages in SQS directly. The monitoring service (or whatever AWS calls it), grabs messages from the queue and passes them to your Lambda function. If your function executes successfully, the service will remove the messages for you. If your function fails or takes longer to execute than the message retention period, it will either be returned to the queue or sent to a Dead Letter Queue.
Hope that helps,
Jeremy
Your experience with concurrency is a lot different than what I am observing.
When I set the concurrency to 1 and drop multiple items into the queue, the “Throttled invocations” metric goes up and those messages that triggered the throttling never get processed, but instead go straight to my dead queue. Example: I drop 5 things into the queue. Lambda only ever tries to process 2 of them, I get 3 throttled invocations, and 3 items end up in my dead queue.
This is with a queue redrive policy set to have maximum receives of 1, so that behavior is expected if throttling counts as a failed processing attempt. But the way you describe it is that it shouldn’t count as an error and will intelligently process the messages serially, which from my experience is not the case.
Dave,
Thanks for the comment. I ran some additional tests and got some very interesting findings.
First of all, you are right about the redrive policies. It appears that the “Lambda service” polls the queue, and puts messages “in flight” without consideration of concurrency limits set. I added a 100ms wait time to my receiver function and noticed that messages towards the end of processing have
ApproximateReceiveCount
s of 4 or more. I’m sure this would be exacerbated by longer execution times and higher message volumes. The good news is that each message was only processed by my function one time.When I set a redrive policy, like you mention, then (depending on the execution time and message volume) a large percentage end up in the Dead Letter Queue. This got me thinking about error handling. So I added some code to trigger an error in my receiver (
throw new Error('some error')
) for every 10 requests, this turned out to be a really bad idea! I set my redrive policy to 1,000 Maximum Receives and then I sent 100 messages into the queue. 90 of the messages got processed (as expected), but the other 10 just kept on firing my Lambda function. Over and over again. I assume that it would have eventually stopped after those 10 messages had been tried 1,000 times each.I think the concurrency control needs a bit of work. However, I believe that Lambda triggers are still usable if you manage the redrive policy yourself. If returning errors from the Lambda function doesn’t tell the Lambda service to DLQ the message, then perhaps this just needs to be handled by our receiver functions. I don’t think I really care how many times a message needs to be retried as long as it eventually does. If there is an issue with the message (e.g. it throws some error and needs to be inspected later), then I can handle that in a number of ways, including pushing it to a DLQ or logging it somewhere else.
It probably comes down to your particular use case, but throttling still seems possible, so long as you don’t rely on SQS’s built in redrive functionality.
Good luck,
Jeremy
As you mention above, “if your function executes successfully, the service will remove the messages for you.” Otherwise they’re all re-queued.
I’m wondering if you can avoid re-queueing/processing the whole batch of received messages (since it will send up to 10 at a time by default) if you’re unable to process, say, the 7th message in the batch.
One suggestion I’ve seen is to have you’re lambda manually delete messages in the batch as they’re successfully handled. You’d then get automagic retries on any failed message and those not yet processed, without having to re-process those that were already successfully handled.
That suggestion is made in this article: https://serverless.com/blog/aws-lambda-sqs-serverless-integration/#batch-size-and-error-handling-with-the-sqs-integration
However, haven’t yet verified that empirically or seen any other mentions of this technique.
As long as the function doesn’t error, the messages are removed from the queue. I think the better thing to do with batches is to configure your function to write to your own DLQ if a message isn’t processed correctly. This way you avoid requeuing all the messages. Not the ideal scenario, since you lose out on automatic retries, but you could configure a trigger on the DLQ as well.
it’s great . But I have a question. I am begin studying and doing project with aws service. Let me ask something that I want to know. SQS trigger spawned 5 concurrent Lambda functions. Therefore, I received 5 event records payload when I pushed data to Queue. I want to send bulk email as a camping mail using by aws SES service without looping. How can I ? give me some suggestion pls.
Hi Steven,
If you only want to receive ONE message per Lambda invocation, then you need to set the “Batch Size” to
1
when you configure your SQS trigger. This may result in more concurrent Lambdas, unless you set an invocation throttle.However, I would suggest (especially if you are building an app that sends messages via SES), to set your batch size to
10
and use thesendBulkTemplatedEmail
method from the SDK (https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/SES.html#sendBulkTemplatedEmail-property). This would allow you to reduce the number of Lambda invocations (saving you money) and reduce the number of calls to SES (which would be much faster).Hope that helps,
Jeremy
I’m having a hard time finding any specific examples/docs about best practices for configuring DLQs in conjunction with SQS queues that are being worked by Lambda functions. I want to be able to scale my DynamoDb r/w capacity units based on when there are messages in the queue, which means if anything goes wrong, I don’t want failing messages to stay in the main queue after being retried a few times. Maybe it’s supposed to be obvious but it’s not clear to me how DLQ configs play with the “Lambda service”. Any tips?
Hi Andrew,
SQS Triggers with Lambda are still a bit new, so there is still quite a bit of experimentation going on. As I mention in my post, if you are throttling at all, don’t use DLQs directly with SQS. The throttling is likely to DLQ messages that are valid.
My suggestion would be to use a Lambda function along with an SQS trigger. You could set a CloudWatch alarm to monitor your SQS queue that would then trigger a Lambda function that could scale your DynamoDB throughput (although I think auto-scaling would solve this part for you). If your SQS consumer Lambda is getting errors because your DynamoDB capacity isn’t high enough, fail gracefully and let the message go back to the queue. If a message is bad, e.g. it is incorrectly formatted, use the Lambda function to move it to a DLQ and remove it from the main SQS queue.
Hope that helps,
Jeremy
Hello, Jeremy!
Thanks for the excellent post. I have a question, that you may be able to help me:
Scenario :
– We may have a high volume of messages on our queue.
– When a message is received it’s integrated with an endpoint which can cope with a limited amount of transactions per second.
– We need to authenticate with the endpoint before communicating (authentication token will be valid for 24 hours)
– The endpoint may not be available 24/7, having expected downtime due to maintenance sporadically.
We believe using SQS triggers we will be able to handle the high-volume Queue coping with the limited amount of transactions allowed by the endpoint.
Authentication probably won’t be an issue as we could keep the token between different Lambda executions, using the approach you described on https://www.jeremydaly.com/reuse-database-connections-aws-lambda/
Our question is how to cope with endpoint downtime as we don’t want to invoke our lambda function millions of times when there is a downtime on the endpoint (what would increase our costs). We would like to retry only after a period of time (We can avoid the communication with the endpoint by using variable values shared between lambda executions, however, we wouldn’t like to have our function triggered – as we would still pay for triggering the lambda function)
We thought about maybe when getting a service unavailable, trying in someway disable SQS trigger via lambda code (we can’t have manual work involved) and then created a scheduled event which would check the endpoint health enabling SQS trigger again.
Could you let me know, what do you think would be the best solution to our problem?
Thanks a lot for you help
Hi Mauricio,
Great question. If you have to achieve near realtime throughput to the processing endpoint, then SQS triggers make sense. If you start to see a high number of failures (either via a CloudWatch alarm monitoring your queue or with some other error counter), you could trigger another Lambda function and use the updateEventSourceMapping method in the AWS SDK to “disable” the event source temporarily. Then perhaps a CloudWatch rule that runs every few minutes to check to see if the endpoint is live again that would reenable the event source. There might be a bit of a delay with that method, but it could work.
What I might do instead, is use the circuit breaker pattern (explained here) to fail gracefully if the endpoint is down. Then send all new messages to a dead letter queue so that the Lambda function doesn’t keep hanging. Then you could periodically check the dead letter queue and work that down once your endpoint was live again.
Hope that gives you some ideas.
– Jeremy
Awesome post Jeremy. Exactly what I was looking for. Answers a lot of my questions already from the post and the useful comments. So on setting a concurrency limit to my lambda function, I was hoping the polling by the aws lambda service to my sqs queue will be in accordance with the concurrency I set. Say I have a concurrency limit of 1. I was hoping to lambda service understand this and poll less frequently based on the concurrency limit and also maybe my function’s time out period. But what I saw was the number of inflight messages bloating and most of the messages having a very high receive count when they finally get processed. This I guess can be considered as lot of polls by lambda service which will not be used by my lambda function(I read and also tried experimenting to know that throttles are considered failures). They all will be eventually processed, but with lots of retries because of the concurrency limits. I was wondering if this was the expected. The doc says polling right now scales up based on the number of inflight messages. It scales up the polling and function concurrency until function concurrency hits max. Yet to check how low the poll frequency can get to for a function with concurrency 1. It would be awesome if lambda service intelligently reduces the number of unwanted polls based on function throttling.
Your thoughts on if kinesis would be better suited for this?. The doc clearly says the lambda service will not pull new records from the kinesis shards if the function is throttled.
Hi Vignesh,
Unfortunately, throttling is not well integrated with SQS polling. There is a good post here that explains how the scaling works. It also confirms that throttled requests will still incur charges for the SQS polling. Kinesis isn’t much different. Lambda will still continue to poll the stream and just keep failing until there is concurrency available to process the shard. Neither is preferable for LONG-TERM workloads. If you are consistently streaming events (in Kinesis or SQS) that exceed your throughput processing capabilities, then you’re going to end up with a continuous backlog.
My advice is to size your backend system capacity (e.g your RDS cluster) to handle typical load, and then rely on throttling as a means to handle traffic spikes. That will minimize hitting concurrency limits and allow you to process most requests in near realtime. If you DO NOT need near realtime, you might be better off batching data instead of trying to stream it.
Hope that helps,
Jeremy
Thanks Jeremy. And for not near real time, you had mentioned its better to batching data instead of streaming. What do you mean by batching data.
Hi Vignesh,
Rather than having consumers try to keep up with incoming data (and provide enough concurrency to handle the throughput), you could have a function that periodically polled the queue or Kinesis stream and attempted to load hundreds or thousands of messages at once. You have to make multiple calls to SQS since the maximum batch size is 10, but this could then reduce strain on downstream resources by limiting the number of simultaneous connections.
Hope that helps,
Jeremy
A bit of experimentation on SQS Lambda triggers, with throttling and concurrency. I was having issues with messages ending up in the DLQ.
Lambda: Concurrency = 2, BatchSize = 10
SQS: Receive Message Wait Time = 0, Delivery Delay = 0, Visibility Timeout = 30, DLQ set after 3 retries
Sent 10 SQS messages (in a for-loop one at a time, very close together but not batch).
What seems to happen here is many of the messages show up in flight immediately.
The end result is ~6 messages get processed by the lambda, ~4 end up in the DLQ.
The lambda throttle metric goes up.
Each lambda execution log shows only ONE record in event.Records.length – even if batchSize is 10.
If I bump the “Maximum Receives” before DLQ from 3 to 10, then all 10 are processed and none end up in the DLQ.
Some of the ‘later’ messages, (i.e. 6-10) have ApproximateReceiveCount = 5.
Ran another experiment where I set the lambda concurrency to 0 then all messages end up in DLQ.
If I turn off the SQS trigger on the lambda, but all 10 SQS messages in the queue then turn on the lambda SQS trigger then I successfully get event.Records.length === 10.
I guess it seems like the poller is running all the time and bringing messages in-flight too quickly. This ‘uses up’ my two function concurrency limit too quickly and that also means my lambda is called without batching too well (even with various delay/wait time settings tried on SQS queue). The follow-up messages continue to be re-received but since the lambda is throttled (and processing some message), after 3 attempts they end up in the DLQ.
Hi Eamonn,
As I suggest in my article, you should not rely on an automatic DLQ if you are using SQS triggers with function concurrency. The poller doesn’t interface well with the concurrency setting and will keep trying to process messages based on the volume of the queue, not the available processing concurrency. IF you need to limit concurrency, I suggest handling the DLQing yourself via your triggered Lambda.
In regards to the low batch sizes, this has to do with the amount of messages in the queue. When I flooded the queue with thousands of messages, I often got 10 messages in a batch. If the queue is lower volume with a slower population rate, it is expected that the batches would be smaller.
Hope that helps,
Jeremy
hey,
thanks for the explanation.
i still don´t get one thing:
the SQS trigger the lambda function once there is a visible message in the queue.
after trigger the message, this message is invisible for the time that we configured, and waiting for response from our function in order to delete it from the queue.
if we didn’t notify the queue, the message become available again in the queue.
how can we notify the SQS from our lambda function that we handled it and we want to delete it from the queue?
thanks!
The message is invisible if you are trying to “read” it again, but you are still able to delete it using the message ID and the AWS SDK.
how do i get the message id?
my function triggered automatically as an event source from SQS.
i didn’t “pull” the message, it was sent to my handler (lambda) that i configured.
thus, i don’t have a Message object.
i followed this article.
https://docs.aws.amazon.com/lambda/latest/dg/java-handler-using-predefined-interfaces.html
The
event
that triggers the Lambda function will have themessageId
in it. See here: https://docs.aws.amazon.com/lambda/latest/dg/eventsources.html#eventsources-sqsSQSClient deleteMessage only supports recipient handle and not messageId and when I try to delete a SQSMessage in lambda by its recipient handle i receive this error:
com.amazonaws.services.sqs.model.ReceiptHandleIsInvalidException: The input receipt handle is invalid. (Service: AmazonSQS; Status Code: 404; Error Code: ReceiptHandleIsInvalid; Request ID: …)
Hi Soheil,
That’s strange. I have no problem with the Node SDK deleting messages when pulling from the Lambda poller. Might you be manipulating the
receiptHandle
in some way that is causing it to fail?– Jeremy
Hey Jeremy,
I’m having a problem just like this one here:
https://serverfault.com/questions/937950/lambda-scheduling-gets-stuck
In my case I have a lambda trigger with 30 reservedConcurrency and sended 1000 messages to the SQS. After a while it gets stucked, not processing any message, and after a while initiates the process again. So if my process duration is 1 second, we can get to a number of about 1000 seconds to process all, sending the first 30 and as soon as some of them gets finished another one gets initiated right? But in my case it’s getting 50x more time because of this pause between processes. =(
I was thinking that this behaviour has something to do with the visibility timeout, but I played this already with no sucess.
There’s an alternative for this kind of behaviour?
Thanks!
Great info, Pedro. I will do some investigating and see if I can reproduce this.
Here is how AWS describes the intended behavior (from https://docs.aws.amazon.com/lambda/latest/dg/scaling.html):
Poll-based event sources that are not stream-based: For Lambda functions that process Amazon SQS queues, AWS Lambda will automatically scale the polling on the queue until the maximum concurrency level is reached, where each message batch can be considered a single concurrent unit. AWS Lambda’s automatic scaling behavior is designed to keep polling costs low when a queue is empty while simultaneously enabling you to achieve high throughput when the queue is being used heavily.
When an Amazon SQS event source mapping is initially enabled, Lambda begins long-polling the Amazon SQS queue. Long polling helps reduce the cost of polling Amazon Simple Queue Service by reducing the number of empty responses, while providing optimal processing latency when messages arrive.
As the influx of messages to a queue increases, AWS Lambda automatically scales up polling activity until the number of concurrent function executions reaches 1000, the account concurrency limit, or the (optional) function concurrency limit, whichever is lower. Amazon Simple Queue Service supports an initial burst of 5 concurrent function invocations and increases concurrency by 60 concurrent invocations per minute.
I would expect maybe a bit of a delay, but what you’re describing seems like a bug. You should also report this behavior to the Lambda team at AWS.
– Jeremy
Hi Jeremy,
On December 6, 2018 you reported that “The “Lambda service” now takes concurrency into account and no longer considers throttled invocations to be failed delivery attempts.”.
Unfortunately I’m still seeing throttled invocations going to my DLQ with an SQS queue and a Lambda in us-east-1. I’m wondering whether there was some configuration you needed to change, or whether you are working in a different region that might have had an update applied which hasn’t been rolled out to us-east-1. Or maybe (hopefully) I’ve missed something else important, because I really want to use this capability but this issue is a real blocker! Any extra info you have about why you thought the issue had been resolved would be greatly appreciated.
FYI: The test case I’ve been working with is as follows:
Primary Queue:
– default visibility time out = 3 seconds
– redrive policy, max receives = 1
Lambda function
– timeout = 3 seconds
– code is node.js which just waits for 2.9 seconds before succeeding
– max concurrency = 1
So this test is designed to enforce that only one function executes at a time, and it takes the full amount of time available to the function, and the queue’s visibility timeout, but always succeeds, and therefore should never result in DLQ items.
My test:
– Add 20 items to the queue.
– Enable the SQS event source for the lambda
Observed results:
– All 20 messages moved to “Messages in Flight” pretty much straight away
– The Lambda monitoring reports 19 throttles
– After 3 seconds 1 message is successfully processed and 19 messages are in the dead letter queue.
– There are no errors / time outs logged to cloud watch
– There are no errors reported in the lambda monitoring
So the throttles still appear to be treated as failed invocations from what I can tell.
Hi James,
I’m not doing anything special and I’m in us-east-1. I’m going to run my experiments again and use your numbers above. I’ll report back soon.
– Jeremy
James,
I ran my experiments again and I was able to duplicate your issue using the same settings you had. I think this has to do with your low concurrency setting and
maxReceiveCount
. According to the documentation on Lambda throttling, “Amazon Simple Queue Service supports an initial burst of 5 concurrent function invocations and increases concurrency by 60 concurrent invocations per minute.” If you watch the “Messages in Flight” count as you process a large batch, this appears to be the case. Therefore, regardless of your concurrency setting, it will start with 5 concurrent ones. This, coupled with amaxReceiveCount
of 1 and a lowVisibilityTimeout
, appears to conflict with the built-in throttling behavior of the SQS Poller.I tried a number of combinations, and it appears that the best performance (with no DLQs) comes from a minimum concurrency of 5, a
VisibilityTimeout
of 30 seconds, and amaxReceiveCount
of 3. I believe this has to do with some of the limitations of the service, namely that as much as it may try to throttle the polling, it’s possible that it will occasionally requeue a message. With amaxReceiveCount
of only 1, this immediately sends the message to a DLQ. The same holds true with theVisibilityTimeout
. When the poller over subscribes, the throttled messages need to wait longer in flight to be processed. Hence a higher likelihood of requeuing and eventual DLQ.So it appears that these settings must be tweaked for your use case, which is a bit frustrating. Throttling is a popular reason to use a queue, and if you can’t find the right settings, you may need to go back to handling your own DLQ as I had originally suggested. I’m going to bring this up with the Lambda team and see if I can get some more clarity.
– Jeremy
Hi Jeremy,
The behaviour of SQS seems a little counter intuitive when using multiple lambda triggers. When creating a single message only the first lambda seems to be actually triggered. This is not clear from the UI, which will happily allow you to create multiple lambda triggers with the suggestion that they will be processed in parallel – which doesn’t appear to be the case. The workaround appears to use Kinesis or SNS, which in my opinion negates using SQS somewhat. Perhaps the clue is in the name: “simple” queue service, but I do think that the docs could make the behaviour a bit clearer. What’s your experience of using SQS with multiple lambda triggers?
Hi Mark,
It is strange how you can have multiple Lambda functions subscribed to the same SQS topic. Once a message is “in flight”, it is no longer available, so if you have multiple Lambdas consuming the same queue, then the first one to grab the message wins. If you need more than one Lambda to process a message, then SNS, Kinesis, or EventBridge would be your best bet.
Hi Jeremy,
I have one quick question. I’m attempting to append new records to a single S3 object, and I want to avoid concurrent writes to the S3 object, e.g. avoid two or more Lambdas writing new records to the same S3 object which would cause a race condition. My first idea was to use Firehose, but Firehose can’t use a single S3 object as an output. So my second idea was to use SQS, where each new record is added to SQS and then a Lambda would be triggered that would add all the new records to the single S3 object. The problem I’m seeing with SQS is that if I add 4 records close to each other, then I will see 4 Lambda invocations, so basically 4 Lambdas will concurrently write a single record to the same S3 object. Is there a way to have SQS wait for a few seconds for more messages before sending it in a batch to Lambda? I tried playing with the SQS Receive Message Wait Time and Delivery Delay, but it still doesn’t seem to send the messages in a batch, even if the Batch Size is set to 10 on the trigger.
My last idea is basically to run Lambda every minute that will process all the SQS messages, but I would much rather have it only run when there are messages, and every 10 seconds or so instead of every minute.
Any ideas on what approach I could take to solve this? Thanks!
Hi Jon,
How are you appending new records to an S3 object exactly? If the throughput supports it, you could use a single Kinesis shard to process each record in parallel, as opposed to having multiple Lambdas triggered with SQS. The other option (and this depends on your S3 appending technique), would be to upload separate objects with SQS, and then run some sort of a sweep process every so often to combine those into your mail S3 file.
– Jeremy
Hi Jeremy,
thanks for this post (and all of your work). I’m trying to set up the same experiment you did but I don’t understand the results. I’m putting the concurrency to 1, but I see four functions being executed in the CloudWatch events. How do I know that just one of them is executed at the same time? Taking a look at the start and end time? Why AWS is using four different lambdas to do this job?
Thanks!
Hi Vicenç,
What are your other settings for things like redrive, message visibility, function timeout, and batch size? Throttled Lambdas with SQS need a minimum concurrency setting of 5, otherwise the polling service will too aggressively take messages from the queue.
– Jeremy
Hi Jeremy,
Thanks for the post. Not 100% related to this article but I’m hoping you can help me anyway. The Q worker needs to work for +- 2 minutes. I’m happy with this time as there is lots to do. However, not sure how I can configure it so that Q allows such a long time for the lambda worker function.
Instead what I’m getting is retries.
Thanks, Rishad