Aurora Serverless Data API: A First Look

On Tuesday, November 20, 2018, AWS announced the release of the new Aurora Serverless Data API. This has been a long awaited feature and has been at the top of many a person’s #awswishlist. As you can imagine, there was quite a bit of fanfare over this on Twitter.

Obviously, I too was excited. The prospect of not needing to use VPCs with Lambda functions to access an RDS database is pretty compelling. Think about all those cold start savings. Plus, connection management with serverless and RDBMS has been quite tricky. I even wrote an NPM package to help deal with the max_connections issue and the inevitable zombies 🧟‍♂️ roaming around your RDS cluster. So AWS’s RDS via HTTP seems like the perfect solution, right? Well, not so fast. 😞

Before I go any further, I want to make sure that I clarify a few things. First, the Data API is in BETA, so this is definitely not the final product. Second, AWS has a great track record with things like this, so I’m betting that this will get a heck of lot better before it reaches GA. And finally, I am a huge AWS fan (and I think they know that 😉), but this first version is really rough, so I’m not going to pull any punches here. I can see this being a complete #gamechanger once they iron out the kinks, so they definitely can benefit from constructive feedback from the community.

Enabling the Data API

Before we dive into performance (honestly I want to avoid telling you about it for as long as possible), let’s look at the set up. There is an AWS guide that tells you how to switch on the Data API. The guide is pretty thin right now, so I’ll give you basics.

NOTE: The Data API only works with Aurora Serverless clusters AND it is only available in the us-east-1 region. If you haven’t played around with Aurora Serverless yet, check out my post Aurora Serverless: The Good, the Bad and the Scalable.

Enable Data API in Network & Security settings of your cluster

You need to modify your Aurora Serverless cluster by clicking “ACTIONS” and then “Modify Cluster”. Just check the Data API box in the Network & Security section and you’re good to go. Remember that your Aurora Serverless cluster still runs in a VPC, even though you don’t need to run your Lambdas in a VPC to access it via the Data API.

Next you need to set up a secret in the Secrets Manager. 🤫 This is actually quite straightforward. User name, password, encryption key (the default is probably fine for you), and select the database you want to access with the secret.

Enter database credentials and select database to access

Next we give it a name, this is important, because this will be part of the arn when we set up permissions later. You can give it a description as well so you don’t forget what this secret is about when you look at it in a few weeks.

Give your secret a name and add a description

You can then configure your rotation settings, if you want, and then you review and create your secret. Then you can click on your newly created secret and grab the arn, we’re gonna need that next.

Click on your secret to get the arn.

Using the AWS SDK’s RDSDataService

If you were looking for this in the AWS guide for the Data API, you probably won’t find it. As of this writing it isn’t in there. You may have stumbled across the SDK docs and found Class: AWS.RDSDataService. But there are a bunch of options that bury the lead. Right now we just care about executeSql(). Here is the snippet from the docs:

Easy enough. Looks like we’re going to need that arn from our secret we just created, the arn of our Aurora Serverless cluster (you can find that in the cluster details), and then our SQL statements. Before we take this out for a drive, we need some data to query. I set up a database with a single table and started by inserting five rows:

A really, really, really complex MySQL table 😂

Now let’s set up a simple Lambda function and give it a try.

Notice above that I’m using async/await, so I’m taking advantage of the .promise() method that AWS provides to promisify their services. You can use callbacks if you really want to. But I wouldn’t.

I used the Serverless framework to publish this to AWS, but those are just details. Let’s give this a try and see what happens when we publish and run it.

Hmm, looks like the version of aws-sdk running on Lambdas in us-east-1 isn’t the latest version. Let’s repackage our function with the aws-sdk and try it again.

AccessDeniedException: User is not authorized to perform: rds-data:ExecuteSql

Okay, looks like we need some IAM permissions. Let add those:

And try it again.

BadRequestException: User is not authorized to perform: secretsmanager:GetSecretValue

Crap. Okay, we need some more IAM permission:

Okay, now we should be good to go! Let’s run it again.

What the?

This is querying our tiny little table of 5 rows with 5 columns with very little data. The Data API returned this monstrous JSON response that was over 11 KBs and took 228ms to run! Okay, we can’t put this off any longer. Let’s look at the performance. 😿

Data API Performance Metrics

Alright, so let’s just rip off the bandaid here. The performance is not good. I added a few more rows to the table and ran a comparison of the Data API versus a MySQL connection (using the mysql package) in a VPC. Here’s what I got:

Selecting 175 rows via the DATA API versus a MySQL connection in a VPC

This was the same query run against the same cluster and table. You can see that the Data API took 204 ms to query and return 175 rows versus the MySQL connection that only took 5 ms. Something to note here is that the 5 ms was after the function was warm and the initial MySQL connection was established. Obviously VPCs have a higher cold start time, so the first query will be a bit slower (about 150 ms plus the cold start). After that though, the speed is lightning fast. However, the Data API averaged over 200 ms every time it ran, warm or not.

Also, the size of the responses were radically different. The Data API returned another monster JSON blob weighing in at 152.5 KBs. The direct MySQL connection returned essentially the same data in under 30 KBs. I’m sure there will be optimizations in the future that will allow us to reduce the size of this response. There is a bunch of stuff in there that we don’t need.

Next I tried some INSERTs. I ran 10 simple INSERT statements with just one column of data. Once again I compared the Data API to a MySQL connection in a VPC.

10 serial INSERTs via the DATA API versus a MySQL Connection in a VPC

Once again, the direct MySQL connection blew away the Data API in terms of response times. Same caveats as before with these being warm functions, so the MySQL connection was already established and being reused. But as you can see, each Data API call suffers from the same high latency as the one before it. Which means, as is to be expected with an HTTP endpoint, that there is no performance gain by reusing the same const RDS = new AWS.RDSDataService() instance.

Another thing to note, however, is that the performance wasn’t impacted by more complex queries or larger data retrievals. The underlying MySQL engine performs as expected, so if AWS can fix this roundtrip latency issue, then hopefully all these performance issues go away.

Tweaking the knobs

I’ve been waiting for HTTP endpoints for RDS for a loooong time, so I didn’t want to let a few bad experiments ruin my day. I decided to turn some knobs to see if that would affect the performance. First thing I did was turn up the memory on my function. Higher memory equals higher CPU and throughput (I think), so I gave that a try. Unfortunately, there was no impact.

Then I thought, maybe if I beef up the database cluster, it might shave off some milliseconds. This was obviously wishful thinking, but I tried it anyway. I cranked up my cluster to 64 ACUs and… nothing. 😖 Oh well, it was worth a shot.

What about security?

So another thing I noticed when I first looked at the docs, is that the sqlStatements parameter expects a string. Yup, a plain old string. Not only that, you can separate MULTIPLE SQL statements with semicolons! Did I mention this parameter only accepts a string? If you’re not sure why this is a big deal, read up on SQL Injection or have a look at my Securing Serverless: A Newbie’s Guide.

Don’t want to listen to me?  Fine, but you should definitely take Ory Segal’s advice. He’s one of those people that knows what he’s talking about.

But seriously, this is a huge security issue if you aren’t properly escaping values. The mysql package he referenced actually disables multiple statements by default because they can be so dangerous. Let’s hope that some additional features are added that will do some of the escaping for us.

Some final thoughts

This thing is still in beta, and it really shows. There is a lot of work to be done, but I have faith that the amazing team at AWS will eventually turn the Data API in to pure gold. The latency here seems to be entirely with the overhead of setting up and tearing down the VPC connections behind the scenes. DynamoDB is HTTP-based and has single digit latency, so I’m guessing that HTTP isn’t the major issue.

Anyway, here are a few of the things that I’d like to see before the Data API goes GA:

  • Increased performance: I’d much rather suffer through a few cold starts now and then to enjoy 5 ms queries than to suffer through 200 ms for every query. Right now, these speeds make it unusable for synchronous use cases.
  • Response formatting: I get the value in returning the different data types, but it is overkill for 99% of queries. Besides simplifying that (and getting the JSON size down a bit), optionally returning the column information would be helpful too. I don’t need it most of the time.
  • Prepared queries: The current sqlStatements parameter is too dangerous. I know developers should take some responsibility here, but needing another library to escape SQL queries is unnecessary overhead. Some simply features of the mysql package (maybe a new params field that excepts an array and replaces ? in the queries) would go a long way.
  • Disable multiple statements by default: Having the ability to send multiple queries is really powerful (especially over HTTP), but it’s also super dangerous. It would be safer if you needed to expressly enable multiple statement support. Even better, require multiple statements to be sent in as an array.
  • IAM Role-based access: The secrets manager thing is okay, but it would be better if we could access Aurora Serverless using just the IAM role. I know that Aurora Serverless doesn’t support that yet, but this would be a helpful addition.

I have to say that I really love this concept. Yes, I’m underwhelmed by the initial implementation, but again, it is still very early. When (and I’m confident that it is a when, not an if) the AWS team works through these issues, this will help dramatically with serverless adoption. There are still plenty of use cases for RDBMS, so making it easier to use them is a huge win for serverless.

Finally, since I’ve offered a lot of criticism, I figured I’d end this on a bit of a positive note. The latency is a killer for synchronous applications, BUT, even in its current state, I can see this being extremely useful for asynchronous workflows. If you are running ETLs, for example, firing off some bulk loads into your reporting database without needing to run the Lambdas in a VPC would be quite handy.

What are your thoughts on the new Aurora Serverless Data API? I’d love to know what you think. Hit me up on Twitter or share your thoughts in the comments below.

Tags: , , , ,


Did you like this post? 👍  Do you want more? 🙌  Follow me on Twitter or check out some of the projects I’m working on. You can sign up for my WEEKLY newsletter too. You'll get links to my new posts (like this one), industry happenings, project updates and much more! 📪

Sign Up for my WEEKLY email newsletter


I respect your privacy and I will NEVER sell, rent or share your email address.

7 thoughts on “Aurora Serverless Data API: A First Look”

  1. This seems not to have a transaction model. Without that it will be useful for just a small selection is use cases where a RDBMS makes sense to be used without transactions, instead of using for example a nosql DB.

    1. I have a few more things I want to experiment with, including transactions and load testing to see how well it handles the underlying connection management. Did you try transactions?

  2. Thanks a lot. I’ve tried Aurora Serverless too and I am facing the same problem using the NodeJS SDK. Could you please tell me how you did the “Let’s repackage our function with the aws-sdk and try it again.” step?

    Thanks in advance
    Sven

    1. Hi Sven,

      I just ran npm install aws-sdk and installed it as a regular dependency instead of a “dev dependency”. Some frameworks might not package it on deploy, so make sure you check the documentation so that it will be included in your node_modules directory.

      – Jeremy

    2. Yeah, thanks for your answer – that’s what I finally did ;-). I misunderstood your repackaging. Deploying the aws-sdk manually works fine, but of course makes my deployment package much bigger…let’s see how fast the lambda’s sdk in us-east-1 becomes up-to-date.

      -Sven

  3. thanks for the article / tutorial you saved me some time. Providing the correct IAM configuration for serverless is also super useful. I was actually thinking this could work for me even with the performance issues the lack of being able to run as a transaction is probably more of a problem.

  4. Thanks for the great article. Have been playing with it myself and I couldn’t agree more with the list of things that they have to improve. Response it is just two much and performance it is pretty low.
    I am still sticking with connection to db for now but building a wrapper on DB communication and expect to change only that, when this is fully ready.
    This is a great feature though! Not having to put my lambdas inside my VPC is such a lifesaver.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.