Serverless Consumers with Lambda and SQS Triggers

On Wednesday, June 27, 2018, Amazon Web Services released SQS triggers for Lambda functions. Those of you who have been building serverless applications with AWS Lambda probably know how big of a deal this is. Until now, the AWS Simple Queue Service (SQS) was generally a pain to deal with for serverless applications. Communicating with SQS is simple and straightforward, but there was no way to automatically consume messages without implementing a series of hacks. In general, these hacks “worked” and were fairly manageable. However, as your services became more complex, dealing with concurrency and managing fan out made your applications brittle and error prone. SQS triggers solve all of these problems. 👊

Attaching Consumers to Message Brokers

It’s a common architecture design pattern to attach consumers to message brokers in distributed applications. This becomes even more important with microservices as it allows communication between different components. Depending on the type of work to be done (high versus low priority), message brokers can be passive, simply storing messages and waiting for a message-oriented middleware to poll it, or active, where it will route the message for you.

RabbitMQ, for example, allows you to create bindings that attach workers to queues. You can run workers in the background with something like supervisor, which will get messages pushed to them as RabbitMQ receives them. Until now, SQS has lacked the ability to do this type of “push” and instead required constant polling to achieve a similar effect. This constant polling might make sense for high volume queues, but for smaller, occasional jobs, even running a Lambda function every minute would be a waste of resources.

Setting up an SQS trigger in Lambda is simple through the AWS Console. SAM templates also support this (https://github.com/becloudway/aws-lambda-sqs-sam) so you can set it up using that as well. The team at Serverless is working to add this in too. The only required settings are the queue you want to access and the “batch size”, which is the maximum number of messages that will be read from your queue at once (up to 10 at a time). Be sure to configure your IAM permissions properly, you need read and write privileges.

Running Some Experiments

I set up two test functions to run some experiments. The first was the function that was triggered by SQS and received the messages:

The second was a “queue flooder” that just generated random messages and sent them to the queue. Remember that SQS can only handle batches of 10:

I then ran my queue flooder. It sent 500 messages to SQS, which triggered my receiver function and drained the queue! 🙌

The event looks like this:

The SQS trigger spawned 5 concurrent Lambda functions that took less than 2 seconds to process all of the messages. Pretty sweet! This got me thinking about how Lambda would handle thousands of messages as once. Would it spawn hundreds of concurrent functions? 🤔

Concurrency Control for Managing Throughput

My first thought was that Lambda’s default scaling behavior would continue to fan out to process messages in the queue. This behavior makes sense if you’re doing something like processing images and there are no bottlenecks to contend with. However, what if you’re writing to a database, calling an external API, or throttling requests to some other backend service? Unlimited scaling is sure to cause issues if hundreds or thousands of Lambda functions are competing for the same resource. Luckily for us, we have concurrency control! 🤘🏻

I ran a few experiments to see exactly how Lambda would handle throttling requests. I used my queue flooder again and set the concurrency to 1. I sent 100 messages to the queue. Here are the log results:

It ultimately used two functions to handle the workload, but they executed serially, never exceeding my concurrency limit. As you can see, 100 messages were eventually processed over the course of 40 total invocations. Brilliant! 😀

I ran a second experiment to scale up my workload. I set the concurrency to 2 and flooded the queue with 200 messages this time.

Four functions in total were used, with only two running concurrently. And all 200 messages were processed successfully! 😎

Additional experiments with concurrency and redrive polices

I had a really great comment that pointed out how adding redrive policies to SQS queues causes issues when you set a low concurrency. I ran some additional tests and discovered some really interesting behaviors.

It appears that the “Lambda service” polls the queue, and puts messages “in flight” without consideration of the concurrency limits. I added a 100ms wait time to my receiver function and noticed that messages towards the end of processing have ApproximateReceiveCounts of 4 or more. I’m sure this would be exacerbated by longer execution times and higher message volumes. The good news is that each message was only processed by my function one time.

I then set a redrive policy of 1, and ran my “queue flooder” again. This time a large percentage ended up in the Dead Letter Queue. This makes sense given how the Lambda Service polls for messages. But now this got me thinking about error handling. So I added some code to trigger an error in my receiver (throw new Error(‘some error’)) for every 10 requests, this turned out to be a really bad idea! I set my redrive policy to 1,000 Maximum Receives and then I sent 100 messages into the queue. 90 of the messages got processed (as expected), but the other 10 just kept on firing my Lambda function. Over and over again. 😳 I assume that it would have eventually stopped after those 10 messages had been tried 1,000 times each.

I think the concurrency control needs a bit of work. However, I believe that Lambda triggers are still usable if you manage the redrive policy yourself. If returning errors from the Lambda function doesn’t tell the Lambda service to DLQ the message, then perhaps this just needs to be handled by our receiver functions. I don’t think I really care how many times a message needs to be retried as long as it is eventually processed. If there is an issue with the message (e.g. it throws some error and needs to be inspected later), then I can handle that in a number of ways, including pushing it to a DLQ myself or logging it somewhere else.

It probably comes down to your particular use case, but throttling still seems possible, so long as you don’t rely on SQS’s built in redrive functionality.

This is a Game Changer 🚀

As you probably can tell, I’m a bit excited about this. Being able to use the Simple Queue Service as a true message broker that you can attach consumers to changes the way we’ll build serverless applications. One of the biggest challenges has been throttling backend resources to avoid hitting service limits. With SQS triggers and concurrency control we can now offload expensive (or service limited) jobs without the need to hack something with scheduled tasks or CloudWatch log triggers. There’s no longer a need to manage our own fan out operations to process queues. We can now simply choose how many “workers” we want processing our queued requests.

Think of the possibilities! We can trigger SQS from Dead Letter Queues to create our own redrive processes. We could queue data mutations, knowing that they will be processed almost instantaneously by our workers (reducing latency and costs of API responses). We can batch process and throttle remote API calls. My brain is spinning. 🤯

The team at AWS has once again pushed the envelope for serverless computing. Congrats! Know that your work is appreciated. You’re awesome. 🙇‍♂️

Update: Read the official announcement by AWS.


Interested in learning more about serverless? Check out 10 Things You Need To Know When Building Serverless Applications to jumpstart your serverless knowledge.

If you want to learn more about serverless security, read my posts Securing Serverless: A Newbie’s Guide and Event Injection: A New Serverless Attack Vector.

Tags: , , , ,


Did you like this post? 👍  Do you want more? 🙌  Follow me on Twitter or check out some of the projects I’m working on. You can join my mailing list too. I’ll email you when I post more stuff like this! 📪

Sign Up!


10 thoughts on “Serverless Consumers with Lambda and SQS Triggers”

  1. When you set the concurrency limit on the lambdas reading from the SQS, did their throttling metric increase during the flooder test, or did they just not try and spin up new lambdas beyond their limit without giving any errors about concurrency limits being hit?

    1. The throttling metric did go up, but it didn’t throw any errors. It appears that they’ve designed it to handle concurrency so that it will process messages serially as the queue is worked down.

  2. Thanks for such a great post. I have a really dumb question that I think you answered several different way already, but for clarification:

    Does setting a concurrency limit on the Lambda fn mean that the queue will be completed however quickly the Lambda functions can work? Is the below example what you’re saying:

    ===

    – Silly Lambda fn: “sleep(1000).then(succeed)”
    – Lambda fn triggered by SQS – batch size 1
    – Lambda fn set to concurrency of 1
    – Add 100 messages to queue
    – SQS queue is drained after 100 invocations, approximately 100 seconds

    (Adjusting to concurrency 2 is approximately 50s, etc.)

    ===

    The “amount of time” isn’t necessarily my concern, but perhaps it helps illustrate “work” happening — I’m mainly curious how the queue gets drained. We’re currently polling SQS and then, if there are messages, invoking other Lambdas so we don’t DDOS ourselves.

    Bur what I *think* I’m excited about here is that we don’t have to do gymnastics to drain the queue — it Just Works(TM). Am I understanding correctly?

    1. Hi Tom,

      Great question. The docs aren’t very specific (https://docs.aws.amazon.com/lambda/latest/dg/invoking-lambda-function.html#supported-event-source-sqs), but I’m pretty sure my tests confirm your assumptions.

      According to Amazon’s announcement:
      “The Lambda service monitors the number of inflight messages, and when it detects that this number is trending up, it will increase the polling frequency by 20 ReceiveMessage requests per minute and the function concurrency by 60 calls per minute. As long as the queue remains busy it will continue to scale until it hits the function concurrency limits. As the number of inflight messages trends down Lambda will reduce the polling frequency by 10 ReceiveMessage requests per minute and decrease the concurrency used to invoke our function by 30 calls per-minute.”

      What I gather from that is that the Lambda service is still technically polling SQS (just more efficiently than we could). So if you hit your concurrency limit, it will process as many messages as possible with the available functions, and then when a function completes, it will invoke the function again and again until all messages are drained.

      – Jeremy

  3. Great post! Really good timing too!

    Question. So far, I’ve been using SNS to do this kind of job. I publish to a SNS topic and that triggers a lambda function. If a backend service has rate limits, I just tune the concurrency limits. Could you maybe explain the differences between using SNS and SQS with Lambda?

    Thanks!

    1. Hi Sebastian,

      That is a really great question. The difference between SQS and SNS comes down to retry behavior (more info here). SNS events are not stream-based and are invoked asynchronously. This means that after 3 attempts, the message will fail and either be discarded or sent to a Dead Letter Queue. If you are setting concurrency limits on Lambdas that are processing SNS messages, it is possible that high volumes could cause messages to fail 3 times and therefore be discarded.

      SQS, on the other hand, is a poll-based event source that is not stream-based. This means it will continue to poll the SQS queue with the resources (number of functions) available to it. Even if you set a concurrency limit to your functions, the messages will simply remain in the queue until there is a function available to process it. This is different from stream-based events, since those are BLOCKING, whereas SQS polling is not. According to the updated documentation: “If you don’t require ordered processing of events, the advantage of using Amazon SQS queues is that AWS Lambda will continue to process new messages, regardless of a failed invocation of a previous message. In other words, processing of new messages will not be blocked.”

      Another important factor is the retry intervals. If an SNS message fails due to concurrency issues, the message is tried again “with delays between retries”. This means that your throttling could significantly delay an SNS message getting processed by your function. SQS would just keep on chugging through the queue as soon as it had capacity to handle more messages.

      Hope that makes sense!
      – Jeremy

  4. Hi Jeremy,

    Thanks for a great post. I have a question though.

    You wrote in one of the comments “What I gather from that is that the Lambda service is still technically polling SQS (just more efficiently than we could). ”

    Does it mean that our function is being executed in some (dynamic) time intervals and we PAY for each invocation no matter if there were messages in the queue or not?

    My next question regards notifying SQS when you’re done with the message. I don’t see any code in your first function for deleting a message from a queue. I understand that you need to do that after consuming it, right?

    1. Hi Pawel,

      According to the SQS event documentation, you can “configure AWS Lambda to poll for these messages as they arrive and then pass the event to a Lambda function invocation.” The announcement also states that the “Lambda service monitors the number of inflight messages”. I don’t know exactly what they are doing internally, but my tests show that the Lambda functions begin consuming messages almost instantaneously once they’re added to the queue. To answer your question, you DO NOT pay for this monitoring service, only when messages are passed to your Lambda is it counted as an “invocation”.

      One of the other great things about this service is that you do not need to manage messages in SQS directly. The monitoring service (or whatever AWS calls it), grabs messages from the queue and passes them to your Lambda function. If your function executes successfully, the service will remove the messages for you. If your function fails or takes longer to execute than the message retention period, it will either be returned to the queue or sent to a Dead Letter Queue.

      Hope that helps,
      Jeremy

  5. Your experience with concurrency is a lot different than what I am observing.

    When I set the concurrency to 1 and drop multiple items into the queue, the “Throttled invocations” metric goes up and those messages that triggered the throttling never get processed, but instead go straight to my dead queue. Example: I drop 5 things into the queue. Lambda only ever tries to process 2 of them, I get 3 throttled invocations, and 3 items end up in my dead queue.

    This is with a queue redrive policy set to have maximum receives of 1, so that behavior is expected if throttling counts as a failed processing attempt. But the way you describe it is that it shouldn’t count as an error and will intelligently process the messages serially, which from my experience is not the case.

    1. Dave,

      Thanks for the comment. I ran some additional tests and got some very interesting findings.

      First of all, you are right about the redrive policies. It appears that the “Lambda service” polls the queue, and puts messages “in flight” without consideration of concurrency limits set. I added a 100ms wait time to my receiver function and noticed that messages towards the end of processing have ApproximateReceiveCounts of 4 or more. I’m sure this would be exacerbated by longer execution times and higher message volumes. The good news is that each message was only processed by my function one time.

      When I set a redrive policy, like you mention, then (depending on the execution time and message volume) a large percentage end up in the Dead Letter Queue. This got me thinking about error handling. So I added some code to trigger an error in my receiver (throw new Error('some error')) for every 10 requests, this turned out to be a really bad idea! I set my redrive policy to 1,000 Maximum Receives and then I sent 100 messages into the queue. 90 of the messages got processed (as expected), but the other 10 just kept on firing my Lambda function. Over and over again. I assume that it would have eventually stopped after those 10 messages had been tried 1,000 times each.

      I think the concurrency control needs a bit of work. However, I believe that Lambda triggers are still usable if you manage the redrive policy yourself. If returning errors from the Lambda function doesn’t tell the Lambda service to DLQ the message, then perhaps this just needs to be handled by our receiver functions. I don’t think I really care how many times a message needs to be retried as long as it eventually does. If there is an issue with the message (e.g. it throws some error and needs to be inspected later), then I can handle that in a number of ways, including pushing it to a DLQ or logging it somewhere else.

      It probably comes down to your particular use case, but throttling still seems possible, so long as you don’t rely on SQS’s built in redrive functionality.

      Good luck,
      Jeremy

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.