Serverless Consumers with Lambda and SQS Triggers

On Wednesday, June 27, 2018, Amazon Web Services released SQS triggers for Lambda functions. Those of you who have been building serverless applications with AWS Lambda probably know how big of a deal this is. Until now, the AWS Simple Queue Service (SQS) was generally a pain to deal with for serverless applications. Communicating with SQS is simple and straightforward, but there was no way to automatically consume messages without implementing a series of hacks. In general, these hacks “worked” and were fairly manageable. However, as your services became more complex, dealing with concurrency and managing fan out made your applications brittle and error prone. SQS triggers solve all of these problems. 👊

Audio Version

Attaching Consumers to Message Brokers

It’s a common architecture design pattern to attach consumers to message brokers in distributed applications. This becomes even more important with microservices as it allows communication between different components. Depending on the type of work to be done (high versus low priority), message brokers can be passive, simply storing messages and waiting for a message-oriented middleware to poll it, or active, where it will route the message for you.

RabbitMQ, for example, allows you to create bindings that attach workers to queues. You can run workers in the background with something like supervisor, which will get messages pushed to them as RabbitMQ receives them. Until now, SQS has lacked the ability to do this type of “push” and instead required constant polling to achieve a similar effect. This constant polling might make sense for high volume queues, but for smaller, occasional jobs, even running a Lambda function every minute would be a waste of resources.

Setting up an SQS trigger in Lambda is simple through the AWS Console. SAM templates also support this (https://github.com/becloudway/aws-lambda-sqs-sam) so you can set it up using that as well. The team at Serverless is working to add this in too. The only required settings are the queue you want to access and the “batch size”, which is the maximum number of messages that will be read from your queue at once (up to 10 at a time). Be sure to configure your IAM permissions properly, you need read and write privileges.

Running Some Experiments

I set up two test functions to run some experiments. The first was the function that was triggered by SQS and received the messages:

The second was a “queue flooder” that just generated random messages and sent them to the queue. Remember that SQS can only handle batches of 10:

I then ran my queue flooder. It sent 500 messages to SQS, which triggered my receiver function and drained the queue! 🙌

The event looks like this:

The SQS trigger spawned 5 concurrent Lambda functions that took less than 2 seconds to process all of the messages. Pretty sweet! This got me thinking about how Lambda would handle thousands of messages as once. Would it spawn hundreds of concurrent functions? 🤔

Concurrency Control for Managing Throughput

My first thought was that Lambda’s default scaling behavior would continue to fan out to process messages in the queue. This behavior makes sense if you’re doing something like processing images and there are no bottlenecks to contend with. However, what if you’re writing to a database, calling an external API, or throttling requests to some other backend service? Unlimited scaling is sure to cause issues if hundreds or thousands of Lambda functions are competing for the same resource. Luckily for us, we have concurrency control! 🤘🏻

I ran a few experiments to see exactly how Lambda would handle throttling requests. I used my queue flooder again and set the concurrency to 1. I sent 100 messages to the queue. Here are the log results:

It ultimately used two functions to handle the workload, but they executed serially, never exceeding my concurrency limit. As you can see, 100 messages were eventually processed over the course of 40 total invocations. Brilliant! 😀

I ran a second experiment to scale up my workload. I set the concurrency to 2 and flooded the queue with 200 messages this time.

Four functions in total were used, with only two running concurrently. And all 200 messages were processed successfully! 😎

Additional experiments with concurrency and redrive polices

I had a really great comment that pointed out how adding redrive policies to SQS queues causes issues when you set a low concurrency. I ran some additional tests and discovered some really interesting behaviors.

It appears that the “Lambda service” polls the queue, and puts messages “in flight” without consideration of the concurrency limits. I added a 100ms wait time to my receiver function and noticed that messages towards the end of processing have ApproximateReceiveCounts of 4 or more. I’m sure this would be exacerbated by longer execution times and higher message volumes. The good news is that each message was only processed by my function one time.

I then set a redrive policy of 1, and ran my “queue flooder” again. This time a large percentage ended up in the Dead Letter Queue. This makes sense given how the Lambda Service polls for messages. But now this got me thinking about error handling. So I added some code to trigger an error in my receiver (throw new Error(‘some error’)) for every 10 requests, this turned out to be a really bad idea! I set my redrive policy to 1,000 Maximum Receives and then I sent 100 messages into the queue. 90 of the messages got processed (as expected), but the other 10 just kept on firing my Lambda function. Over and over again. 😳 I assume that it would have eventually stopped after those 10 messages had been tried 1,000 times each.

I think the concurrency control needs a bit of work. However, I believe that Lambda triggers are still usable if you manage the redrive policy yourself. If returning errors from the Lambda function doesn’t tell the Lambda service to DLQ the message, then perhaps this just needs to be handled by our receiver functions. I don’t think I really care how many times a message needs to be retried as long as it is eventually processed. If there is an issue with the message (e.g. it throws some error and needs to be inspected later), then I can handle that in a number of ways, including pushing it to a DLQ myself or logging it somewhere else.

It probably comes down to your particular use case, but throttling still seems possible, so long as you don’t rely on SQS’s built in redrive functionality.

This is a Game Changer 🚀

As you probably can tell, I’m a bit excited about this. Being able to use the Simple Queue Service as a true message broker that you can attach consumers to changes the way we’ll build serverless applications. One of the biggest challenges has been throttling backend resources to avoid hitting service limits. With SQS triggers and concurrency control we can now offload expensive (or service limited) jobs without the need to hack something with scheduled tasks or CloudWatch log triggers. There’s no longer a need to manage our own fan out operations to process queues. We can now simply choose how many “workers” we want processing our queued requests.

Think of the possibilities! We can trigger SQS from Dead Letter Queues to create our own redrive processes. We could queue data mutations, knowing that they will be processed almost instantaneously by our workers (reducing latency and costs of API responses). We can batch process and throttle remote API calls. My brain is spinning. 🤯

The team at AWS has once again pushed the envelope for serverless computing. Congrats! Know that your work is appreciated. You’re awesome. 🙇‍♂️

Update: Read the official announcement by AWS.


Interested in learning more about serverless? Check out 10 Things You Need To Know When Building Serverless Applications to jumpstart your serverless knowledge.

If you want to learn more about serverless security, read my posts Securing Serverless: A Newbie’s Guide and Event Injection: A New Serverless Attack Vector.

Tags: , , , ,


Did you like this post? 👍  Do you want more? 🙌  Follow me on Twitter or check out some of the projects I’m working on. You can sign up for my WEEKLY newsletter too. You'll get links to my new posts (like this one), industry happenings, project updates and much more! 📪

Sign Up for my WEEKLY email newsletter


I respect your privacy and I will NEVER sell, rent or share your email address.

24 thoughts on “Serverless Consumers with Lambda and SQS Triggers”

  1. When you set the concurrency limit on the lambdas reading from the SQS, did their throttling metric increase during the flooder test, or did they just not try and spin up new lambdas beyond their limit without giving any errors about concurrency limits being hit?

    1. The throttling metric did go up, but it didn’t throw any errors. It appears that they’ve designed it to handle concurrency so that it will process messages serially as the queue is worked down.

  2. Thanks for such a great post. I have a really dumb question that I think you answered several different way already, but for clarification:

    Does setting a concurrency limit on the Lambda fn mean that the queue will be completed however quickly the Lambda functions can work? Is the below example what you’re saying:

    ===

    – Silly Lambda fn: “sleep(1000).then(succeed)”
    – Lambda fn triggered by SQS – batch size 1
    – Lambda fn set to concurrency of 1
    – Add 100 messages to queue
    – SQS queue is drained after 100 invocations, approximately 100 seconds

    (Adjusting to concurrency 2 is approximately 50s, etc.)

    ===

    The “amount of time” isn’t necessarily my concern, but perhaps it helps illustrate “work” happening — I’m mainly curious how the queue gets drained. We’re currently polling SQS and then, if there are messages, invoking other Lambdas so we don’t DDOS ourselves.

    Bur what I *think* I’m excited about here is that we don’t have to do gymnastics to drain the queue — it Just Works(TM). Am I understanding correctly?

    1. Hi Tom,

      Great question. The docs aren’t very specific (https://docs.aws.amazon.com/lambda/latest/dg/invoking-lambda-function.html#supported-event-source-sqs), but I’m pretty sure my tests confirm your assumptions.

      According to Amazon’s announcement:
      “The Lambda service monitors the number of inflight messages, and when it detects that this number is trending up, it will increase the polling frequency by 20 ReceiveMessage requests per minute and the function concurrency by 60 calls per minute. As long as the queue remains busy it will continue to scale until it hits the function concurrency limits. As the number of inflight messages trends down Lambda will reduce the polling frequency by 10 ReceiveMessage requests per minute and decrease the concurrency used to invoke our function by 30 calls per-minute.”

      What I gather from that is that the Lambda service is still technically polling SQS (just more efficiently than we could). So if you hit your concurrency limit, it will process as many messages as possible with the available functions, and then when a function completes, it will invoke the function again and again until all messages are drained.

      – Jeremy

  3. Great post! Really good timing too!

    Question. So far, I’ve been using SNS to do this kind of job. I publish to a SNS topic and that triggers a lambda function. If a backend service has rate limits, I just tune the concurrency limits. Could you maybe explain the differences between using SNS and SQS with Lambda?

    Thanks!

    1. Hi Sebastian,

      That is a really great question. The difference between SQS and SNS comes down to retry behavior (more info here). SNS events are not stream-based and are invoked asynchronously. This means that after 3 attempts, the message will fail and either be discarded or sent to a Dead Letter Queue. If you are setting concurrency limits on Lambdas that are processing SNS messages, it is possible that high volumes could cause messages to fail 3 times and therefore be discarded.

      SQS, on the other hand, is a poll-based event source that is not stream-based. This means it will continue to poll the SQS queue with the resources (number of functions) available to it. Even if you set a concurrency limit to your functions, the messages will simply remain in the queue until there is a function available to process it. This is different from stream-based events, since those are BLOCKING, whereas SQS polling is not. According to the updated documentation: “If you don’t require ordered processing of events, the advantage of using Amazon SQS queues is that AWS Lambda will continue to process new messages, regardless of a failed invocation of a previous message. In other words, processing of new messages will not be blocked.”

      Another important factor is the retry intervals. If an SNS message fails due to concurrency issues, the message is tried again “with delays between retries”. This means that your throttling could significantly delay an SNS message getting processed by your function. SQS would just keep on chugging through the queue as soon as it had capacity to handle more messages.

      Hope that makes sense!
      – Jeremy

  4. Hi Jeremy,

    Thanks for a great post. I have a question though.

    You wrote in one of the comments “What I gather from that is that the Lambda service is still technically polling SQS (just more efficiently than we could). ”

    Does it mean that our function is being executed in some (dynamic) time intervals and we PAY for each invocation no matter if there were messages in the queue or not?

    My next question regards notifying SQS when you’re done with the message. I don’t see any code in your first function for deleting a message from a queue. I understand that you need to do that after consuming it, right?

    1. Hi Pawel,

      According to the SQS event documentation, you can “configure AWS Lambda to poll for these messages as they arrive and then pass the event to a Lambda function invocation.” The announcement also states that the “Lambda service monitors the number of inflight messages”. I don’t know exactly what they are doing internally, but my tests show that the Lambda functions begin consuming messages almost instantaneously once they’re added to the queue. To answer your question, you DO NOT pay for this monitoring service, only when messages are passed to your Lambda is it counted as an “invocation”.

      One of the other great things about this service is that you do not need to manage messages in SQS directly. The monitoring service (or whatever AWS calls it), grabs messages from the queue and passes them to your Lambda function. If your function executes successfully, the service will remove the messages for you. If your function fails or takes longer to execute than the message retention period, it will either be returned to the queue or sent to a Dead Letter Queue.

      Hope that helps,
      Jeremy

  5. Your experience with concurrency is a lot different than what I am observing.

    When I set the concurrency to 1 and drop multiple items into the queue, the “Throttled invocations” metric goes up and those messages that triggered the throttling never get processed, but instead go straight to my dead queue. Example: I drop 5 things into the queue. Lambda only ever tries to process 2 of them, I get 3 throttled invocations, and 3 items end up in my dead queue.

    This is with a queue redrive policy set to have maximum receives of 1, so that behavior is expected if throttling counts as a failed processing attempt. But the way you describe it is that it shouldn’t count as an error and will intelligently process the messages serially, which from my experience is not the case.

    1. Dave,

      Thanks for the comment. I ran some additional tests and got some very interesting findings.

      First of all, you are right about the redrive policies. It appears that the “Lambda service” polls the queue, and puts messages “in flight” without consideration of concurrency limits set. I added a 100ms wait time to my receiver function and noticed that messages towards the end of processing have ApproximateReceiveCounts of 4 or more. I’m sure this would be exacerbated by longer execution times and higher message volumes. The good news is that each message was only processed by my function one time.

      When I set a redrive policy, like you mention, then (depending on the execution time and message volume) a large percentage end up in the Dead Letter Queue. This got me thinking about error handling. So I added some code to trigger an error in my receiver (throw new Error('some error')) for every 10 requests, this turned out to be a really bad idea! I set my redrive policy to 1,000 Maximum Receives and then I sent 100 messages into the queue. 90 of the messages got processed (as expected), but the other 10 just kept on firing my Lambda function. Over and over again. I assume that it would have eventually stopped after those 10 messages had been tried 1,000 times each.

      I think the concurrency control needs a bit of work. However, I believe that Lambda triggers are still usable if you manage the redrive policy yourself. If returning errors from the Lambda function doesn’t tell the Lambda service to DLQ the message, then perhaps this just needs to be handled by our receiver functions. I don’t think I really care how many times a message needs to be retried as long as it eventually does. If there is an issue with the message (e.g. it throws some error and needs to be inspected later), then I can handle that in a number of ways, including pushing it to a DLQ or logging it somewhere else.

      It probably comes down to your particular use case, but throttling still seems possible, so long as you don’t rely on SQS’s built in redrive functionality.

      Good luck,
      Jeremy

  6. As you mention above, “if your function executes successfully, the service will remove the messages for you.” Otherwise they’re all re-queued.

    I’m wondering if you can avoid re-queueing/processing the whole batch of received messages (since it will send up to 10 at a time by default) if you’re unable to process, say, the 7th message in the batch.

    One suggestion I’ve seen is to have you’re lambda manually delete messages in the batch as they’re successfully handled. You’d then get automagic retries on any failed message and those not yet processed, without having to re-process those that were already successfully handled.

    That suggestion is made in this article: https://serverless.com/blog/aws-lambda-sqs-serverless-integration/#batch-size-and-error-handling-with-the-sqs-integration

    However, haven’t yet verified that empirically or seen any other mentions of this technique.

    1. As long as the function doesn’t error, the messages are removed from the queue. I think the better thing to do with batches is to configure your function to write to your own DLQ if a message isn’t processed correctly. This way you avoid requeuing all the messages. Not the ideal scenario, since you lose out on automatic retries, but you could configure a trigger on the DLQ as well.

  7. it’s great . But I have a question. I am begin studying and doing project with aws service. Let me ask something that I want to know. SQS trigger spawned 5 concurrent Lambda functions. Therefore, I received 5 event records payload when I pushed data to Queue. I want to send bulk email as a camping mail using by aws SES service without looping. How can I ? give me some suggestion pls.

    1. Hi Steven,

      If you only want to receive ONE message per Lambda invocation, then you need to set the “Batch Size” to 1 when you configure your SQS trigger. This may result in more concurrent Lambdas, unless you set an invocation throttle.

      However, I would suggest (especially if you are building an app that sends messages via SES), to set your batch size to 10 and use the sendBulkTemplatedEmail method from the SDK (https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/SES.html#sendBulkTemplatedEmail-property). This would allow you to reduce the number of Lambda invocations (saving you money) and reduce the number of calls to SES (which would be much faster).

      Hope that helps,
      Jeremy

  8. I’m having a hard time finding any specific examples/docs about best practices for configuring DLQs in conjunction with SQS queues that are being worked by Lambda functions. I want to be able to scale my DynamoDb r/w capacity units based on when there are messages in the queue, which means if anything goes wrong, I don’t want failing messages to stay in the main queue after being retried a few times. Maybe it’s supposed to be obvious but it’s not clear to me how DLQ configs play with the “Lambda service”. Any tips?

    1. Hi Andrew,

      SQS Triggers with Lambda are still a bit new, so there is still quite a bit of experimentation going on. As I mention in my post, if you are throttling at all, don’t use DLQs directly with SQS. The throttling is likely to DLQ messages that are valid.

      My suggestion would be to use a Lambda function along with an SQS trigger. You could set a CloudWatch alarm to monitor your SQS queue that would then trigger a Lambda function that could scale your DynamoDB throughput (although I think auto-scaling would solve this part for you). If your SQS consumer Lambda is getting errors because your DynamoDB capacity isn’t high enough, fail gracefully and let the message go back to the queue. If a message is bad, e.g. it is incorrectly formatted, use the Lambda function to move it to a DLQ and remove it from the main SQS queue.

      Hope that helps,
      Jeremy

  9. Hello, Jeremy!
    Thanks for the excellent post. I have a question, that you may be able to help me:
    Scenario :
    – We may have a high volume of messages on our queue.
    – When a message is received it’s integrated with an endpoint which can cope with a limited amount of transactions per second.
    – We need to authenticate with the endpoint before communicating (authentication token will be valid for 24 hours)
    – The endpoint may not be available 24/7, having expected downtime due to maintenance sporadically.

    We believe using SQS triggers we will be able to handle the high-volume Queue coping with the limited amount of transactions allowed by the endpoint.
    Authentication probably won’t be an issue as we could keep the token between different Lambda executions, using the approach you described on https://www.jeremydaly.com/reuse-database-connections-aws-lambda/
    Our question is how to cope with endpoint downtime as we don’t want to invoke our lambda function millions of times when there is a downtime on the endpoint (what would increase our costs). We would like to retry only after a period of time (We can avoid the communication with the endpoint by using variable values shared between lambda executions, however, we wouldn’t like to have our function triggered – as we would still pay for triggering the lambda function)
    We thought about maybe when getting a service unavailable, trying in someway disable SQS trigger via lambda code (we can’t have manual work involved) and then created a scheduled event which would check the endpoint health enabling SQS trigger again.

    Could you let me know, what do you think would be the best solution to our problem?

    Thanks a lot for you help

    1. Hi Mauricio,

      Great question. If you have to achieve near realtime throughput to the processing endpoint, then SQS triggers make sense. If you start to see a high number of failures (either via a CloudWatch alarm monitoring your queue or with some other error counter), you could trigger another Lambda function and use the updateEventSourceMapping method in the AWS SDK to “disable” the event source temporarily. Then perhaps a CloudWatch rule that runs every few minutes to check to see if the endpoint is live again that would reenable the event source. There might be a bit of a delay with that method, but it could work.

      What I might do instead, is use the circuit breaker pattern (explained here) to fail gracefully if the endpoint is down. Then send all new messages to a dead letter queue so that the Lambda function doesn’t keep hanging. Then you could periodically check the dead letter queue and work that down once your endpoint was live again.

      Hope that gives you some ideas.

      – Jeremy

  10. Awesome post Jeremy. Exactly what I was looking for. Answers a lot of my questions already from the post and the useful comments. So on setting a concurrency limit to my lambda function, I was hoping the polling by the aws lambda service to my sqs queue will be in accordance with the concurrency I set. Say I have a concurrency limit of 1. I was hoping to lambda service understand this and poll less frequently based on the concurrency limit and also maybe my function’s time out period. But what I saw was the number of inflight messages bloating and most of the messages having a very high receive count when they finally get processed. This I guess can be considered as lot of polls by lambda service which will not be used by my lambda function(I read and also tried experimenting to know that throttles are considered failures). They all will be eventually processed, but with lots of retries because of the concurrency limits. I was wondering if this was the expected. The doc says polling right now scales up based on the number of inflight messages. It scales up the polling and function concurrency until function concurrency hits max. Yet to check how low the poll frequency can get to for a function with concurrency 1. It would be awesome if lambda service intelligently reduces the number of unwanted polls based on function throttling.

    Your thoughts on if kinesis would be better suited for this?. The doc clearly says the lambda service will not pull new records from the kinesis shards if the function is throttled.

    1. Hi Vignesh,

      Unfortunately, throttling is not well integrated with SQS polling. There is a good post here that explains how the scaling works. It also confirms that throttled requests will still incur charges for the SQS polling. Kinesis isn’t much different. Lambda will still continue to poll the stream and just keep failing until there is concurrency available to process the shard. Neither is preferable for LONG-TERM workloads. If you are consistently streaming events (in Kinesis or SQS) that exceed your throughput processing capabilities, then you’re going to end up with a continuous backlog.

      My advice is to size your backend system capacity (e.g your RDS cluster) to handle typical load, and then rely on throttling as a means to handle traffic spikes. That will minimize hitting concurrency limits and allow you to process most requests in near realtime. If you DO NOT need near realtime, you might be better off batching data instead of trying to stream it.

      Hope that helps,
      Jeremy

    2. Thanks Jeremy. And for not near real time, you had mentioned its better to batching data instead of streaming. What do you mean by batching data.

    3. Hi Vignesh,

      Rather than having consumers try to keep up with incoming data (and provide enough concurrency to handle the throughput), you could have a function that periodically polled the queue or Kinesis stream and attempted to load hundreds or thousands of messages at once. You have to make multiple calls to SQS since the maximum batch size is 10, but this could then reduce strain on downstream resources by limiting the number of simultaneous connections.

      Hope that helps,
      Jeremy

  11. A bit of experimentation on SQS Lambda triggers, with throttling and concurrency. I was having issues with messages ending up in the DLQ.

    Lambda: Concurrency = 2, BatchSize = 10
    SQS: Receive Message Wait Time = 0, Delivery Delay = 0, Visibility Timeout = 30, DLQ set after 3 retries

    Sent 10 SQS messages (in a for-loop one at a time, very close together but not batch).
    What seems to happen here is many of the messages show up in flight immediately.
    The end result is ~6 messages get processed by the lambda, ~4 end up in the DLQ.
    The lambda throttle metric goes up.
    Each lambda execution log shows only ONE record in event.Records.length – even if batchSize is 10.

    If I bump the “Maximum Receives” before DLQ from 3 to 10, then all 10 are processed and none end up in the DLQ.
    Some of the ‘later’ messages, (i.e. 6-10) have ApproximateReceiveCount = 5.

    Ran another experiment where I set the lambda concurrency to 0 then all messages end up in DLQ.

    If I turn off the SQS trigger on the lambda, but all 10 SQS messages in the queue then turn on the lambda SQS trigger then I successfully get event.Records.length === 10.

    I guess it seems like the poller is running all the time and bringing messages in-flight too quickly. This ‘uses up’ my two function concurrency limit too quickly and that also means my lambda is called without batching too well (even with various delay/wait time settings tried on SQS queue). The follow-up messages continue to be re-received but since the lambda is throttled (and processing some message), after 3 attempts they end up in the DLQ.

    1. Hi Eamonn,

      As I suggest in my article, you should not rely on an automatic DLQ if you are using SQS triggers with function concurrency. The poller doesn’t interface well with the concurrency setting and will keep trying to process messages based on the volume of the queue, not the available processing concurrency. IF you need to limit concurrency, I suggest handling the DLQing yourself via your triggered Lambda.

      In regards to the low batch sizes, this has to do with the amount of messages in the queue. When I flooded the queue with thousands of messages, I often got 10 messages in a batch. If the queue is lower volume with a slower population rate, it is expected that the batches would be smaller.

      Hope that helps,
      Jeremy

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.