How To: Use SNS and SQS to Distribute and Throttle Events

An extremely useful AWS serverless microservice pattern is to distribute an event to one or more SQS queues using SNS. This gives us the ability to use multiple SQS queues to “buffer” events so that we can throttle queue processing to alleviate pressure on downstream resources. For example, if we have an event that needs to write information to a relational database AND trigger another process that needs to call a third-party API, this pattern would be a great fit.

This is a variation of the Distributed Trigger Pattern, but in this example, the SNS topic AND the SQS queues are contained within a single microservice. It is certainly possible to subscribe other microservices to this SNS topic as well, but we’ll stick with intra-service subscriptions for now. The diagram below represents a high-level view of how we might trigger an SNS topic (API Gateway → Lambda → SNS), with SNS then distributing the message to the SQS queues. Let’s call it the Distributed Queue Pattern.

Distributed Queue Pattern

This post assumes you know the basics of setting up a serverless application, and will focus on just the SNS topic subscriptions, permissions, and implementation best practices. Let’s get started!

SNS + SQS = 👍

The basic idea of an SQS queue subscribed to an SNS topic is quite simple. SNS is essentially just a pub-sub system that allows us to publish a single message that gets distributed to multiple subscribed endpoints. Subscription endpoints can be email, SMS, HTTP, mobile applications, Lambda functions, and, of course, SQS queues. When an SQS queue is subscribed to an SNS topic, any message sent to the SNS topic will be added to the queue (unless it’s filtered, but we’ll get to that later 😉). This includes both the raw Message Body, and any other Message Attributes that you include in the SNS message.

It is almost guaranteed that your SNS messages will eventually be delivered to your subscribed SQS queues. From the SNS FAQ:

SQS: If an SQS queue is not available, SNS will retry 10 times immediately, then 100,000 times every 20 seconds for a total of 100,010 attempts over more than 23 days before the message is discarded from SNS.

It is highly unlikely that your SQS queues would be unavailable for 23 days, so this is why SQS queues are recommended for critical message processing tasks. Also from the SNS FAQ:

If it is critical that all published messages be successfully processed, developers should have notifications delivered to an SQS queue (in addition to notifications over other transports).

So not only do we get the benefit of near guaranteed delivery, but we also get the benefit of throttling our messages. If we attempted to deliver SNS messages directly to Lambda functions or HTTP endpoints, it is likely that we could overwhelm the downstream resources they interact with.

It’s also possible that we could lose events if we don’t set up Dead Letter Queues (DLQs) to capture failed invocations when the services go down. And even if we did capture these failed events, we’d need a way to replay them. SQS basically does this automatically for us. Plus we may be the reason WHY the service went down in the first place, so using an HTTP retry policy might exacerbate the problem.

So don’t think about queues between services as added complexity, think of it as adding an additional layer of durability.

Creating Subscriptions

AWS has a tutorial that shows you how to set up an SQS to SNS subscription via the console, but we want to automate this as part of our serverless.yml or SAM templates. AWS also provides a sample CloudFormation template that you can use, but this one doesn’t create the appropriate SQS permissions for you. You can also use this CloudFormation template, but it creates IAM users and groups, which is overkill for what we’re trying to accomplish.

Let’s look at a real world example so we can better understand the context.

Sample SNS to SQS Serverless Application

In the example above, our serverless application has two SQS queues subscribed to our SNS topic. Each queue has a Lambda function subscribed, which will automatically process messages as they are received. One Lambda function is responsible for writing some information to an RDS Aurora table, and the other is responsible for calling the Twilio API. Notice that we are using additional SQS queues as Dead Letter Queues (DLQs) for our main queues. We are using a redrive policy for our queues instead of attaching DLQs directly to our Lambda functions because SQS events are processed synchronously by Lambda. We would also be throttling our Lambdas by setting our Reserved Concurrency to an appropriate level for our downstream services.

Serverless Configuration

Now that we know what we’re building, let’s write some configuration files! 🤓 I’m going to use the Serverless Framework for the examples below, but they could easily be adapted for SAM.

Let’s start with our resources. This is just straight CloudFormation (with a few Serverless Framework variables), but you could essentially just copy this into your SAM template. Take a look at this and we’ll discuss some of the highlights below.

Part One is quite simple. We create our AWS::SNS::Topic, our two AWS::SNS::Queues, and create a RedrivePolicy in each that sends failed messages to our deadLetterTargetArns.

Part Two creates an AWS::SQS::QueuePolicy for each of our queues. This is necessary to allow our SNS topic to send messages to them. For you security sticklers out there, you may have noticed that we are using a * for our Principal setting.  😲 Don’t worry, we are using a Condition that makes sure the "aws:SourceArn" equals our SnsTopic, so we’re good.

Part Three does the actual subscriptions to the SNS topic. Be sure to set your RawMessageDelivery to 'true' (note the single quotes) so that no JSON formatting is added to our messages.

That takes care of our topic, queues and subscriptions, now let’s configure our two functions. I’ve only included the pertinent configuration settings below.

This is quite straightforward. We are are creating two functions, each subscribed to their respective SQS queues. Notice we are using the !GettAtt CloudFormation intrinsic function to retrieve the Arn of our queues. Also, notice that we are setting the reservedConcurrency setting to throttle our functions. We can adjust this setting based on the capacity of our downstream resource. SQS and Lambda’s concurrency setting work together, so messages in the queue will remain in the queue until there are Lambda instances available to process them. If our Lambda fails to process the messages, they’ll go into our DLQs.

Message Filtering for the Win! 👊

You may be asking yourself, why don’t I just use Kinesis for this? That’s a great question. Kinesis is awesome for handling large message streams and maintaining message order. It is highly durable and we can even control our throttling by selecting the number of shards. Yan Cui also points out that SNS can actually be considerably more expensive than Kinesis when you get to high levels of sustained message throughput. This is certainly true for some applications, but for spiky workloads, it would be less of a problem.

However, there’s one thing that SNS can do that Kinesis can’t: Message Filtering. If you are subscribed to a Kinesis stream, but you’re only interested in certain messages, you still need to inspect ALL the messages in the stream. A common pattern is to use a single Kinesis stream as your “event stream” and subscribe several functions to it. Each function must load every message and then determine if it needs to do something with it. Depending on the complexity of your application, this may create a lot of wasted Lambda invocations.

SNS, on the other hand, allows us to add some intelligence to our “dumb pipes” by applying a FilterPolicy to our subscriptions. Now we can send just the events we care about to our SQS queues. This way we can minimize our Lambda invocations and precisely control our throttling. With Kinesis, every subscribed function is subject to the concurrency set by the shard count.

Setting up our filters is as simple as adding a FilterPolicy to our AWS::SNS::Subscriptions in our CloudFormation resources. If we only want messages that have an action of “sendSMS”, our updated resource would look like this:

There are several types of filtering available, including prefix matching, ranges, blacklisting, and more. You can view all the different options and the associated syntax in the official AWS docs. The important thing to remember is that filters work on Message Attributes, so you will need to make sure that messages sent to your SNS topic contain the necessary information. And that’s it. Now our secondQueue will only receive messages that have an action of “sendSMS”, which means our Lambda function will only be invoked to process “sendSMS” actions.

Final Thoughts 🤔

This is an incredibly powerful pattern, especially if you need to throttle downstream services. However, like with anything, applying this pattern to your application depends on your requirements. This pattern does not guarantee the order of messages, so it’s possible that new messages could be processed before older ones. Also, if you have very high message rates, SNS can get a bit expensive. Be sure to evaluate other Serverless Microservice Patterns for AWS to determine which patterns make the most sense for your use case.

If you’re interested in learning more about serverless microservices, check out my Introduction to Serverless Microservices and Mixing VPC and Non-VPC Lambda Functions for Higher Performing Microservices posts.

Tags: , , , ,


Did you like this post? 👍  Do you want more? 🙌  Follow me on Twitter or check out some of the projects I’m working on. You can sign up for my WEEKLY newsletter too. You'll get links to my new posts (like this one), industry happenings, project updates and much more! 📪

Sign Up for my WEEKLY email newsletter


I respect your privacy and I will NEVER sell, rent or share your email address.

9 thoughts on “How To: Use SNS and SQS to Distribute and Throttle Events”

  1. So, we have run into an issue with this for a product that we are working on. If you take this to it’s logical extreme and set the reserved concurrency on the lambda side to 1 and blast the queue with messages, the queue will TRY to deliver the message to the lambda function, get a throttled exception, leaving it in flight until the visibility timeout and eventually send most of the messages into the dead letter queue.

    1. Hi Jeff,

      The Lambda poller for SQS immediately assumes a concurrency of 5 and then scales up after that based on your actual concurrency. Setting your concurrency to 1 will cause the redrive policies to kick in because the poller will attempt to load too many messages to start with. If you NEED to set a concurrency of 1, disable the redrive policy and manage DLQs through your Lambda function instead.

      Hope that helps,
      Jeremy

    2. Hey Jeremy,

      Thanks for that tip, i must have missed the SQS poller lower limit in the documentation, that explains what i have been seeing. I don’t need to go down to 1 in the reserved concurrency, but it probably has to be below the limit. Appreciate the help, it was a good article too!

      Jeff

    3. Hi Jeremy,

      I am having the same problem as Jeff. I have designed a backend that utilises SQS triggers but I need the queue worker to be limited to 3 concurrent executions at a time. I had originally connected a redrive policy to my SQS queue but after reading your response to Jeff I removed the redrive policy and added a “DLQ resource” directly to the Lambda.

      Test case:
      – SQS queue (redrive policy: none; visibility timeout: 30s)
      – SQS DLQ (DLQ resource for Lambda)
      – Lambda (concurrency: 3, timeout: 30s; collect single message from queue and waits 20s before resolving succefully)
      – “script” (sends 100 msgs to SQS queue)

      Problem:
      – I still see far more than 3 messages visible on the queue (upwards of 25)
      – My lambda is still being throttled (over 500 times for the 100 messages)

      Your assistance would be greatly appreciated.

      Thanks,
      Zanda

    4. Hi Zanda,

      I don’t think attaching the DLQ to the Lambda function will do anything in this case. DLQs on Lambda functions only work for “asynchronous” invocations and the Lambda SQS poller delivers messages “synchronously”. You should be catching errors in your Lambda and then using the aws-sdk to move messages to your DLQ manually in your situation.

      Also, I don’t think there is a way to get around the throttling and “in flight” message counts if your concurrency is less than 5. That doesn’t seem to be the way the system is designed. Although, other than some additional costs, throttles and requeues shouldn’t really be a problem. If there is no redrive policy, then the messages will just keep getting recycled.

      In your case, I’m not sure SQS triggers are the right way to go. I wrote a recent post to show how you can throttle calls to third-party APIs using Lambda as an orchestrator. Using something like that (or Step Functions) might be a better fit.

      Let me know how you make out.

      – Jeremy

  2. Great article! I have implemented a metrics collector with kinesis as the main stream. The need for ordering is the real reason I chose kinesis. Basically run a handler per type of message inside the single lambda instead of a lambda per message type. Batch vs ECS cloudwatch events for example of 2 message types. Does that seem reasonable to you? Also, does a DLQ make sense here considering the lambda will just keep retrying batch of messages till error is fixed…given there is an error? The lambda commits each batch of messages as an atomic transaction.

    1. Hi Braun,

      Hmm, I’m not a huge fan of using the same Lambda function to perform different tasks. Coupling, independent scalability, and maintenance may become an issue as the system grows. Rob Gruhl has a great article about Event Sourcing that shows the benefits of using events as a single source of truth. This allows different consumers to use the data independently without needing to worry about what the rest of the system is doing. With only a few consumers, the coupling doesn’t seem like an issue, but it could be.

      Also, in terms of the DLQ, Kinesis invokes Lambda functions synchronously, so a DLQ attached to a Lambda function would have no effect. If an error occurs while trying to process an event, it’s best to move that message to an SQS queue, and if you get an ack that the message was successfully put in the queue, return successfully from your function so Kinesis removes the records and continues to process. That is IF you can process a few failed events out of order.

      Hope that makes sense,
      Jeremy

  3. This is a great post. How would you compare this to SNS + Lambda (fixed concurrency) + DLQ? You still get the throttling and similar reliability guarantees but perhaps less control over the retry policy?

    1. Hi Tekumara,

      You can use SNS + Lambda, but you would have no visibility into the queued items, only the failed ones. Adding the SQS queues give you the ability to see the size of the queue and can therefore use that to better understand load and pressure. Interesting approach though, especially given that SNS will retry a message over 100,000 times.

      – Jeremy

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.