The Dynamic Composer (an AWS serverless pattern)

I’m a big fan of following the Single Responsibility Principle when creating Lambda functions in my serverless applications. The idea of each function doing “one thing well” allows you to easily separate discrete pieces of business logic into reusable components. In addition, the Lambda concurrency model, along with the ability to add fine-grained IAM permissions per function, gives you a tremendous amount of control over the security, scalability, and cost of each part of your application.

However, there are several drawbacks with this approach that often attract criticism. These include things like increased complexity, higher likelihood of cold starts, separation of log files, and the inability to easily compose functions. I think there is merit to these criticisms, but I have personally found the benefits to far outweigh any of the negatives. A little bit of googling should help you find ways to mitigate many of these concerns, but I want to focus on the one that seems to trip most people up: function composition.

The need for function composition

At my current startup, we have several Lambda functions that handle very specific pieces of functionality as part of our article processing pipeline. We have a function that crawls a web page and then extracts and parses the content. We have a function that runs text comparisons and string similarity against existing articles to detect duplicates, syndicated articles, and story updates. We have one that runs content through a natural language processing engine (NLP) to extract entities, keywords, sentiment, and more. We have a function that performs inference and applies tagging and scores to each article. Plus many, many, many more.

In most cases, the data needs to flow through all of these functions in order for it to be usable in our system. It’s also imperative that the data be processed in the correct order, otherwise some functions might not have the data they need to perform their specific task. That means we need to “orchestrate” these functions to ensure that each step completes successfully.

If you’re familiar with the AWS ecosystem, you’re probably thinking, “why not just use Step Functions?” I love Step Functions, and we initially started going down that path. But then we realized a few things that started making us question that decision.

First, we were composing a lot of steps, so the number of possible transitions required for each article (including failure states) became a bit unwieldy and possibly cost prohibitive. Second, the success guarantee requirement for this process is very low. Failing to process an article every now and then is not a make or break situation, and since most of our functions provide data enrichment features, there is little to no need for issuing rollbacks. And finally, while the most utilized workflow is complete end-to-end article processing, this isn’t always the case. Sometimes we need to execute just a portion of the workflow, like rerunning the NLP or the tagger. Sometimes we’ll even run part of the workflow for debugging purposes, so having a lot of control is really important.

Could we have built a whole bunch of Step Functions to handle all of these workflows? I’m sure we could have. But after some research and experimentation, we found a pattern that was simple, composable, flexible, and still had all the retry and error handling guarantees that we would likely have used with Step Functions. I call it, the Dynamic Composer.

The Dynamic Composer Pattern

The key to this pattern is the utilization of ASYNCHRONOUS function invocations (yes, Lambdas calling Lambdas). Below is a simple digram that shows the composition of five functions (Extractor, Comparator, NLP Analyzer, Tagger, and Persistor). This is only a small subset of our process for illustration purposes. This pattern can be used to compose as many functions as necessary.

Each of the thick black arrows represents an asynchronous call to the next Lambda function in the workflow. Each Lambda function has an SQS queue for its Dead Letter Queue (DLQ), and each DLQ is attached to a CloudWatch Alarm to issue an alert if the queue count is greater than one. Now for the “dynamic” part.

When we invoke the first function in our workflow (this can be from another function, API Gateway, directly from a client, etc.), we pass an array of function names with our event. For example, to kick off the full workload above, our payload for our extractor function might look like this:

The _compose array contains the order of composition, and can be dynamically generated based on your invocation. If you only wanted to run the tagger and the persistor, you would adjust your _compose array as necessary.

In order for our functions to properly handle the _compose array, we need to include a small library script in each of our functions. Below is a sample composer.js script that accepts the event from the function, as well as the payload that it should return or pass to the next function. If there are no more functions in the composition list (or it doesn’t exist at all), then the payload is returned.

In order to use this within your functions, you would include it like this:

Yes, we have to add a bit of helper code to our functions, but we still have the ability to call each function synchronously, and our composer.js script will just return the function output.

Wouldn’t Step Functions be more reliable?

That entirely depends on the use case and whether or not you need to add things like rollbacks and parallelism, or have more control over the retry logic. For us, this pattern works really well and gives us all kinds of amazing guarantees. For example:

Automatic Retry Handling
Because we are invoking the next function asynchronously, the Lambda service will automatically retry the event twice for us. The first retry is typically within a minute, and the second is after about two minutes.

Error Handling
There are two ways in which this pattern will automatically handle errors for us. The first is if the LAMBDA.invoke() call fails at the end. This is highly unlikely, but since we allow this error to bubble up to the Lambda function, a failed invocation will FAIL THIS FUNCTION, causing it to retry as mentioned above. If it fails three times, it will be saved in our DLQ. The other scenario is when the Lambda service accepts the invocation, but then that function fails. Just like with the first scenario, it will be retried twice, and then moved to the DLQ.

Durability and Replay
Since any failed event will be moved to the appropriate DLQ, that event will be available for inspection and replay for up to 14 days. Our CloudWatch Alerts will notify us if there is an item in one of these queues, and we can easily replay the event without losing the information about the workflow (since it is part of the event JSON). If there was a particular use case that needed to be automatically replayed, an extra Lambda function polling the DLQs could handle that for you.

Automatic Throttling
There are a lot of great throttling control use cases for SQS to Lambda, but many people don’t realize that asynchronous Lambda invocations give you built in throttling management for free! If you invoke a Lambda function asynchronously and the current reserved concurrency is exceeded, the Lambda service will retry for up to six hours before eventually moving the event to your DLQ.

Should you use this pattern?

I have to reiterate that Step Functions are amazing, and that for complex workflows that require rollbacks, or very specific retry policies, they are absolutely the way to go. However, if your workflows are fairly simple, meaning that it is mostly passing the results of one function to the next, and if you want to have the flexibility to alter your workflows on the fly, then this may work really well for you.

Your functions shouldn’t require any sort of significant rewrites to implement this. Just plug in your own version of the composer.js script and change the return. You could even build a Lambda Layer that wraps the functions and automatically adds the composer for you. The functions themselves will still be independently callable and can still be included in other Step Functions workflows. But you always have the ability to compose them together by simply passing an _compose array.

And there’s one more very cool thing (well, at least I think it’s cool)! Even though this sort of violates the Single Responsibility Principle that I mentioned at the beginning, we couldn’t help adding some “composition manipulation” within our functions. For example, if our comparator function detects that an article is a duplicate, then we will alter the _compose array to remove parts of the workflow. This adds a bit of extra logic to our functions, but it gives us some really handy controls over the workflow. Of course, this logic is conditioned on whether or not _compose exists, so we can still call the functions independently.

So, should you use this pattern? Well, we use it to process thousands of articles a day and it works like a charm for us. If you have a workflow that this might work for, I’d love to hear about.

Hope you find this useful. Happy serverlessing! 😉

Tags: , , , , ,


Did you like this post? 👍  Do you want more? 🙌  Follow me on Twitter or check out some of the projects I’m working on. You can sign up for my WEEKLY newsletter too. You'll get links to my new posts (like this one), industry happenings, project updates and much more! 📪

Sign Up for my WEEKLY email newsletter


I respect your privacy and I will NEVER sell, rent or share your email address.

9 thoughts on “The Dynamic Composer (an AWS serverless pattern)”

  1. Hi Jermny,

    this is a really nice post. Thank you for sharing it!

    One note on the pattern. I think it already has been named. It’s “Routing Slip” (https://www.enterpriseintegrationpatterns.com/patterns/messaging/RoutingTable.html). The only difference that I noticed is that the workflow invocation in your implementation returns only when all the steps complete. There is no such assumption in the Routing Slip. I would treat this as an implementation detail though.

    Cheers!

  2. Why not make all of those functions into 1 function ? If each function does minimal work, let’s say 20ms of work. You would be paying 100ms(since that is the min Lambda pricing unit) * 5 = 500ms of Lambda time. Compared to being a single function that would only cost 100ms so the Dynamic Composer will then be X5 more expensive?

    1. Hi Rehan,

      I think I mentioned this in another comment, but these functions don’t run for less that 100ms. Often times they need to run for multiple seconds because they are accessing complex third-party APIs or doing a significant amount of processing. Tuning each function actually saves us money depending on the process, plus, we can rely on the cloud to handle errors and retries for us.

      – Jeremy

  3. Hi Jeremy, very nice pattern! I has attracted my attention because I’m exactly in that situation.

    I want to try it out and, and for that I would like to ask you if you could share a more detailed example, maybe in Bitbuket?

    And, correct me if I’m wrong, every function will still run until the last in the chain is finished?

    Thanks

    Carlos

  4. Cool article!

    Just wanted to point out that the key point is that when you invoke the lambda you use “InvocationType: ‘Event'”, otherwise the call would be synchronous. so just in case, someone didn’t understand this.

  5. Hi Jeremy,
    I find this idea very useful in some scenarios.
    To decrease function coupling, have you thought about using an API gateway, queue or pub/sub instead of having functions wired directly to each other?

    1. Hi Renato,

      For the purpose of this workflow, introducing additional complexity didn’t make sense. As far as decoupling is concerned, this approach handles that quite well without needing to use any intermediary services. I typically would never use an API Gateway to invoke functions from other functions given the added latency, and adding more queues, SNS, or EventBridge would have added unnecessary costs. Like you said, this works in “some” scenarios, so I wouldn’t recommend it for many use cases.

      – Jeremy

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.