How To: Use SNS and SQS to Distribute and Throttle Events

An extremely useful AWS serverless microservice pattern is to distribute an event to one or more SQS queues using SNS. This gives us the ability to use multiple SQS queues to “buffer” events so that we can throttle queue processing to alleviate pressure on downstream resources. For example, if we have an event that needs to write information to a relational database AND trigger another process that needs to call a third-party API, this pattern would be a great fit.

This is a variation of the Distributed Trigger Pattern, but in this example, the SNS topic AND the SQS queues are contained within a single microservice. It is certainly possible to subscribe other microservices to this SNS topic as well, but we’ll stick with intra-service subscriptions for now. The diagram below represents a high-level view of how we might trigger an SNS topic (API Gateway → Lambda → SNS), with SNS then distributing the message to the SQS queues. Let’s call it the Distributed Queue Pattern.

Distributed Queue Pattern

This post assumes you know the basics of setting up a serverless application, and will focus on just the SNS topic subscriptions, permissions, and implementation best practices. Let’s get started!

SNS + SQS = 👍

The basic idea of an SQS queue subscribed to an SNS topic is quite simple. SNS is essentially just a pub-sub system that allows us to publish a single message that gets distributed to multiple subscribed endpoints. Subscription endpoints can be email, SMS, HTTP, mobile applications, Lambda functions, and, of course, SQS queues. When an SQS queue is subscribed to an SNS topic, any message sent to the SNS topic will be added to the queue (unless it’s filtered, but we’ll get to that later 😉). This includes both the raw Message Body, and any other Message Attributes that you include in the SNS message.

It is almost guaranteed that your SNS messages will eventually be delivered to your subscribed SQS queues. From the SNS FAQ:

SQS: If an SQS queue is not available, SNS will retry 10 times immediately, then 100,000 times every 20 seconds for a total of 100,010 attempts over more than 23 days before the message is discarded from SNS.

It is highly unlikely that your SQS queues would be unavailable for 23 days, so this is why SQS queues are recommended for critical message processing tasks. Also from the SNS FAQ:

If it is critical that all published messages be successfully processed, developers should have notifications delivered to an SQS queue (in addition to notifications over other transports).

So not only do we get the benefit of near guaranteed delivery, but we also get the benefit of throttling our messages. If we attempted to deliver SNS messages directly to Lambda functions or HTTP endpoints, it is likely that we could overwhelm the downstream resources they interact with.

It’s also possible that we could lose events if we don’t set up Dead Letter Queues (DLQs) to capture failed invocations when the services go down. And even if we did capture these failed events, we’d need a way to replay them. SQS basically does this automatically for us. Plus we may be the reason WHY the service went down in the first place, so using an HTTP retry policy might exacerbate the problem.

So don’t think about queues between services as added complexity, think of it as adding an additional layer of durability.

Creating Subscriptions

AWS has a tutorial that shows you how to set up an SQS to SNS subscription via the console, but we want to automate this as part of our serverless.yml or SAM templates. AWS also provides a sample CloudFormation template that you can use, but this one doesn’t create the appropriate SQS permissions for you. You can also use this CloudFormation template, but it creates IAM users and groups, which is overkill for what we’re trying to accomplish.

Let’s look at a real world example so we can better understand the context.

Sample SNS to SQS Serverless Application

In the example above, our serverless application has two SQS queues subscribed to our SNS topic. Each queue has a Lambda function subscribed, which will automatically process messages as they are received. One Lambda function is responsible for writing some information to an RDS Aurora table, and the other is responsible for calling the Twilio API. Notice that we are using additional SQS queues as Dead Letter Queues (DLQs) for our main queues. We are using a redrive policy for our queues instead of attaching DLQs directly to our Lambda functions because SQS events are processed synchronously by Lambda. We would also be throttling our Lambdas by setting our Reserved Concurrency to an appropriate level for our downstream services.

Serverless Configuration

Now that we know what we’re building, let’s write some configuration files! 🤓 I’m going to use the Serverless Framework for the examples below, but they could easily be adapted for SAM.

Let’s start with our resources. This is just straight CloudFormation (with a few Serverless Framework variables), but you could essentially just copy this into your SAM template. Take a look at this and we’ll discuss some of the highlights below.

Part One is quite simple. We create our AWS::SNS::Topic, our two AWS::SNS::Queues, and create a RedrivePolicy in each that sends failed messages to our deadLetterTargetArns.

Part Two creates an AWS::SQS::QueuePolicy for each of our queues. This is necessary to allow our SNS topic to send messages to them. For you security sticklers out there, you may have noticed that we are using a * for our Principal setting.  😲 Don’t worry, we are using a Condition that makes sure the "aws:SourceArn" equals our SnsTopic, so we’re good.

Part Three does the actual subscriptions to the SNS topic. Be sure to set your RawMessageDelivery to 'true' (note the single quotes) so that no JSON formatting is added to our messages.

That takes care of our topic, queues and subscriptions, now let’s configure our two functions. I’ve only included the pertinent configuration settings below.

This is quite straightforward. We are are creating two functions, each subscribed to their respective SQS queues. Notice we are using the !GettAtt CloudFormation intrinsic function to retrieve the Arn of our queues. Also, notice that we are setting the reservedConcurrency setting to throttle our functions. We can adjust this setting based on the capacity of our downstream resource. SQS and Lambda’s concurrency setting work together, so messages in the queue will remain in the queue until there are Lambda instances available to process them. If our Lambda fails to process the messages, they’ll go into our DLQs.

Message Filtering for the Win! 👊

You may be asking yourself, why don’t I just use Kinesis for this? That’s a great question. Kinesis is awesome for handling large message streams and maintaining message order. It is highly durable and we can even control our throttling by selecting the number of shards. Yan Cui also points out that SNS can actually be considerably more expensive than Kinesis when you get to high levels of sustained message throughput. This is certainly true for some applications, but for spiky workloads, it would be less of a problem.

However, there’s one thing that SNS can do that Kinesis can’t: Message Filtering. If you are subscribed to a Kinesis stream, but you’re only interested in certain messages, you still need to inspect ALL the messages in the stream. A common pattern is to use a single Kinesis stream as your “event stream” and subscribe several functions to it. Each function must load every message and then determine if it needs to do something with it. Depending on the complexity of your application, this may create a lot of wasted Lambda invocations.

SNS, on the other hand, allows us to add some intelligence to our “dumb pipes” by applying a FilterPolicy to our subscriptions. Now we can send just the events we care about to our SQS queues. This way we can minimize our Lambda invocations and precisely control our throttling. With Kinesis, every subscribed function is subject to the concurrency set by the shard count.

Setting up our filters is as simple as adding a FilterPolicy to our AWS::SNS::Subscriptions in our CloudFormation resources. If we only want messages that have an action of “sendSMS”, our updated resource would look like this:

There are several types of filtering available, including prefix matching, ranges, blacklisting, and more. You can view all the different options and the associated syntax in the official AWS docs. The important thing to remember is that filters work on Message Attributes, so you will need to make sure that messages sent to your SNS topic contain the necessary information. And that’s it. Now our secondQueue will only receive messages that have an action of “sendSMS”, which means our Lambda function will only be invoked to process “sendSMS” actions.

Final Thoughts 🤔

This is an incredibly powerful pattern, especially if you need to throttle downstream services. However, like with anything, applying this pattern to your application depends on your requirements. This pattern does not guarantee the order of messages, so it’s possible that new messages could be processed before older ones. Also, if you have very high message rates, SNS can get a bit expensive. Be sure to evaluate other Serverless Microservice Patterns for AWS to determine which patterns make the most sense for your use case.

If you’re interested in learning more about serverless microservices, check out my Introduction to Serverless Microservices and Mixing VPC and Non-VPC Lambda Functions for Higher Performing Microservices posts.

Tags: , , , ,

Did you like this post? 👍  Do you want more? 🙌  Follow me on Twitter or check out some of the projects I’m working on.

24 thoughts on “How To: Use SNS and SQS to Distribute and Throttle Events”

  1. So, we have run into an issue with this for a product that we are working on. If you take this to it’s logical extreme and set the reserved concurrency on the lambda side to 1 and blast the queue with messages, the queue will TRY to deliver the message to the lambda function, get a throttled exception, leaving it in flight until the visibility timeout and eventually send most of the messages into the dead letter queue.

    1. Hi Jeff,

      The Lambda poller for SQS immediately assumes a concurrency of 5 and then scales up after that based on your actual concurrency. Setting your concurrency to 1 will cause the redrive policies to kick in because the poller will attempt to load too many messages to start with. If you NEED to set a concurrency of 1, disable the redrive policy and manage DLQs through your Lambda function instead.

      Hope that helps,

    2. Hey Jeremy,

      Thanks for that tip, i must have missed the SQS poller lower limit in the documentation, that explains what i have been seeing. I don’t need to go down to 1 in the reserved concurrency, but it probably has to be below the limit. Appreciate the help, it was a good article too!


    3. Hi Jeremy,

      I am having the same problem as Jeff. I have designed a backend that utilises SQS triggers but I need the queue worker to be limited to 3 concurrent executions at a time. I had originally connected a redrive policy to my SQS queue but after reading your response to Jeff I removed the redrive policy and added a “DLQ resource” directly to the Lambda.

      Test case:
      – SQS queue (redrive policy: none; visibility timeout: 30s)
      – SQS DLQ (DLQ resource for Lambda)
      – Lambda (concurrency: 3, timeout: 30s; collect single message from queue and waits 20s before resolving succefully)
      – “script” (sends 100 msgs to SQS queue)

      – I still see far more than 3 messages visible on the queue (upwards of 25)
      – My lambda is still being throttled (over 500 times for the 100 messages)

      Your assistance would be greatly appreciated.


    4. Hi Zanda,

      I don’t think attaching the DLQ to the Lambda function will do anything in this case. DLQs on Lambda functions only work for “asynchronous” invocations and the Lambda SQS poller delivers messages “synchronously”. You should be catching errors in your Lambda and then using the aws-sdk to move messages to your DLQ manually in your situation.

      Also, I don’t think there is a way to get around the throttling and “in flight” message counts if your concurrency is less than 5. That doesn’t seem to be the way the system is designed. Although, other than some additional costs, throttles and requeues shouldn’t really be a problem. If there is no redrive policy, then the messages will just keep getting recycled.

      In your case, I’m not sure SQS triggers are the right way to go. I wrote a recent post to show how you can throttle calls to third-party APIs using Lambda as an orchestrator. Using something like that (or Step Functions) might be a better fit.

      Let me know how you make out.

      – Jeremy

  2. Great article! I have implemented a metrics collector with kinesis as the main stream. The need for ordering is the real reason I chose kinesis. Basically run a handler per type of message inside the single lambda instead of a lambda per message type. Batch vs ECS cloudwatch events for example of 2 message types. Does that seem reasonable to you? Also, does a DLQ make sense here considering the lambda will just keep retrying batch of messages till error is fixed…given there is an error? The lambda commits each batch of messages as an atomic transaction.

    1. Hi Braun,

      Hmm, I’m not a huge fan of using the same Lambda function to perform different tasks. Coupling, independent scalability, and maintenance may become an issue as the system grows. Rob Gruhl has a great article about Event Sourcing that shows the benefits of using events as a single source of truth. This allows different consumers to use the data independently without needing to worry about what the rest of the system is doing. With only a few consumers, the coupling doesn’t seem like an issue, but it could be.

      Also, in terms of the DLQ, Kinesis invokes Lambda functions synchronously, so a DLQ attached to a Lambda function would have no effect. If an error occurs while trying to process an event, it’s best to move that message to an SQS queue, and if you get an ack that the message was successfully put in the queue, return successfully from your function so Kinesis removes the records and continues to process. That is IF you can process a few failed events out of order.

      Hope that makes sense,

  3. This is a great post. How would you compare this to SNS + Lambda (fixed concurrency) + DLQ? You still get the throttling and similar reliability guarantees but perhaps less control over the retry policy?

    1. Hi Tekumara,

      You can use SNS + Lambda, but you would have no visibility into the queued items, only the failed ones. Adding the SQS queues give you the ability to see the size of the queue and can therefore use that to better understand load and pressure. Interesting approach though, especially given that SNS will retry a message over 100,000 times.

      – Jeremy

  4. With the serverless framework and this design, what happens when deploying a new upgrade? Does serverless delete/re-create aws resources?

    E.g. What if I changed my serverless config in a way that affected an existing SQS or Lambda and then re-deployed. Would the deploy re-create the SQS, Lambda function but leave the SNS resource as is, meaning that the SNS would retry messages that could not be received by SQS, effectively ensuring no messages are lost?

  5. This was really helpful, thankyou so much for your efforts on writing about aws serverless architecture. I’m looking forward to reading your future posts 🙂

  6. Hi Jeremy,
    Thank you for this article. Very helpful. I am learner in AWS. this article helped me understanding SNS SQS pattern better.

  7. Hey Jeremy,

    thanks for the article, it’s a great pattern for decoupling with huge flexibility.

    Do you have any idea about the big, but very important issue to observe and trace the data flow? We, for example, are using X-Ray for our lambdas as it has, wrapping with captureAWS(), minimal impact on our production code. But here (as in other use cases, e.g. subscribing to DynamoDB streams) we find it hard to trace through the SNS-SQS fan out. Do you have any ideas or workarounds on that?

    Thanks a lot and all the best

  8. Hello Jeremy,

    Thanks for the post, it was really helpful for me immersing myself into the serverless world. Just a quick question, is it possible to “unify” the same policy to both queues. I’m asking, because in my case I’ll be publishing to many queues, so I wonder if there’s a way to have a single policy and attach it to all the defined queues.

    Thank you in advance for your response.

  9. Hi Jeremey, thanks for the article. I was wondering what are the reasons this setup has SNS before SQS? Is it to fan out to multiple queues and also the message filtering you mentioned? What would be other reasons to have SNS before the queue?

    1. There are multiple reasons, but for this implementation, the idea is to use the durability of SQS on the other end of an SNS fan out. As long as the SNS call succeeds, the messages are pretty much guaranteed to get delivered to SQS. You can then throttle each SQS queue using Lambda to adjust the pressure on downstream systems. For example, if you fan out a message, it might have very high throughput when writing to a DynamoDB table. But if there is an API on the other end, then maybe we need to limit it to 50 concurrent executions. Hope that makes sense.

  10. Hi Jeremy, this is a great write up and matches the pattern I wanted to create (though at the moment there’s a single SQS subscriber).

    My remaining question is how I get the SNS ARN into a Lambda without doing a lookup on every execution. I figure I can use the environment, but this is the best I can do. I don’t seem to be able to reference the ARN directly:

    MYSNSTOPIC_ARN: {Fn::Join : [‘:’, [ ‘arn:aws:sns’, “${opt:region, self:provider.region}”, Ref : ‘AWS::AccountId’, “${self:service}-${self:provider.stage}-mySnsTopic”]]}

    1. You can export the ARN as an Output in the yml where it is created…something like this:
      Description: ‘SNS Topic’
      Value: !Ref SnsTopic
      Name: !Sub ‘${self:provider.stage}-${self:service}-sns’

      …then just import it using !ImportValue ‘dev-myservice-sns’ # or whatever the above “Name” would resolve to.

  11. I know its a tad late jumping on this discussion now 🙂 as I work hard to really get a solid foundation for serverless, but this post was the best I’ve found online. Thanks Jeremy. I have defined my Topics in a separate AWS command script since I found it tricky to define/know exactly where those creations would reside in any given serverless module as many queues can subscribe to a topic. Just my personal approach, any input on that? Having the queues defined in serverless has been great though, thanks again!

  12. Am I wrong or Lambda reserved concurrency actually don’t have any impact on the speed at which messages are delivered to Lambdas. For example, if your reserved concurrency is 1 and SQS batch size is 1 record, then a spike in SQS queue size to 100k records will lead to most messages being sent to DLQ as Lambda as Lambda won’t be able to keep up with message delivery from the SQS?

    From what I understand you cannot throttle consumption of SQS messages using lambdas in push mode efficiently.
    Either poll messages or use Kinesis.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.