AJ Stuyvenberg

Clawdbot bought me a car

2026-01-24T00:00:00+00:00

Car buying in 2026 still sucks

Buying a car from a dealership is an objectively awful experience. There’s a long history behind why manufacturers can’t sell directly to customers (without certain workarounds like Tesla/Rivian), so unless you’re going that route you’ll inevitably need to talk with someone trying to sell you a car ASAP. Salespeople are typically paid on commission so they’re incentivized to get you out of the test drive and into the finance office as quickly as possible.

It’s also typically a low-trust endeavor. Manufacturers change incentives every few weeks. Loan rates change constantly. You’ll negotiate a price and learn they didn’t include expensive dealer add-ons which can’t be removed, or an offer made today is gone tomorrow. Then when you’re exhausted and at the end of your patience, they’ll slide over a prepaid maintenance contract or key replacement service. It’s awful.

So when my family needed to replace our trusty old Subaru, I thought it’d be a good opportunity to say “Claude, take the wheel” and handed over the keys for my digital life to a chatbot.

Clawdbot, then Moltbot, now OpenClaw

Clawdbot, recently renamed Moltbot and now OpenClaw to avoid any trademark issues with Anthropic’s Claude, is the internet’s latest obsession after, well, Claude Code. It’s an open source project which pairs an LLM with long running processes to do things like read and write email (and monitor for replies), manage your calendar, and drive a browser with great effect. Unlike ChatGPT or Claude Code, Clawdbot does not start with a blank memory every time it starts. It saves files, breadcrumbs, and your chat histories so it can handle tasks which can take a few days without much issue:

I’ve been dying to try it out on something real and useful, so buying a new car seemed like a good first task.

You can prompt Clawdbot from a web browser just like ChatGPT, or the terminal CLI like Clade Code. The real power comes when you link it to a messaging service. Then messages sent via whatsapp (or imessage, signal or telegram) become prompts for Clawdbot to take action on your behalf. I chose a combination of the browser and whatsapp. It took a bit of fiddling around with Google Cloud to set up gog and access gmail/gdrive/gcal, but soon enough Clawdbot was able to access basically my entire digital life.

I installed Clawdbot on my M1 Macbook and named it Icarus for reasons which became obvious to me in hindsight.

The car

For a variety of reasons we landed on a Hyundai Palisade. I’m not interested in explaining the entire rationale, but YouTuber Doug DeMuro gives a good explanation of why this car stood out for him here. After a few test drives and lots of research we moved from the looking phase to the buying phase.

Ask anyone in sales and they’ll tell you that walking into a negotiation with a bit of extra knowledge is often the edge you need to win. So I decided to kick things off with a bit of price discovery. Car prices are very local, so I wanted to see what people in my area were paying for the vehicle/trim that we wanted.

I began with a simple enough prompt:

Search reddit.com/r/hyundaipalisade and find the typical and lowest prices people paid for a 2026 palisade hybrid in Massachusetts

Clawdbot churned away and flipped through several browser windows. Interestingly enough it hit a few roadblocks including an error message saying Your request was blocked by network security, but Clawdbot would not be denied.

After a few minutes it found that most people paid around $58k (plus tax/title/licensing):

So that left us with a target price of hopefully $57k.

Finding the car

My wife had picked out a specific color combination which was a bit rare. Blue (or green), with a brown interior. I didn’t want to browse every dealer site or call anyone, so I used an online inventory tool and gave Clawdbot the following prompt:

Use https://hexorcism.com/HyundaiApp/inventory.php to search dealers for a Palisade Hybrid in the Calligraphy trim with a green or blue exterior and brown (code ISB) interior. Stay within 50 miles of Boston. Then find the car using the VIN number on each dealers website and contact them asking for the best out-the-door price

Clawdbot churned away at this for some time. It popped up several browser tabs, and started filling out forms with my contact information. Clawdbot already had my email address (because I gave it gmail access). Since I had also set up whatsapp, Clawdbot had my phone number too.

I typically never want to negotiate for a car on the phone, it’s easier to cut through noise and fluff in writing. Most dealers do require a phone number to complete their contact page, but not all. Clawdbot pre-filled my real number onto the form without prompting me at all! Suddenly the automated texts and calls started trickling in.

This was my first jaw-dropping moment with Clawdbot. I prompted this language model hooked up to a browser and email, and moments later it did something very useful to me in the “real world”!

But the next day the messages would start pouring in from actual salespeople, and the real work began.

Negotiating

My simple negotiation strategy is to send each dealer the lowest quote and ask them to beat it. This works best if you don’t care about the color or specifications, as you can find vehicles which have been sitting on the lot for 30+ days which salespeople are more inclined to discount. It’s a bit riskier if you want a less common and more sought-after color, those tend to move more quickly.

Clawd had found 3 area dealers which had the car. By the second day all had emailed us back, so I asked to:

Check my emails every few minutes for messages from dealers. Negotiate for the lowest sale price possible, do not negotiate any trade in or interest rate. Just the lowest price. Prompt me before replying to anything consequential.

This set up Cron task within Clawdbot. It quickly played people off each other, sending the quote PDF files from dealer 1 to dealer 2. I got a few text messages here as well, but at this point I hadn’t quite gotten iMessage set up correctly so when those came in I just asked the sales people to email me and let Clawdbot take over.

Clawdbot also made a couple mistakes in this phase. When dealers would call, my flow was to politely decline and answer as many questions as I could via email with Clawdbot. At one point I got an inbound call and an email at the same time, so I asked Clawdbot to reply and say I can't talk, I'm in a condo board meeting. Email them back with our search parameters and in a timeless blunder, Clawd picked the wrong email thread and sent this someone we were already negotiating with:

That was the only minor slipup by Clawdbot during this process. I didn’t allow Clawd to be fully autonomous, which I’m sure would have caused additional issues.

Closing the deal

Eventually one dealer stopped responding, but two were very eager to make a deal. The emails kept flying, we had a bidding war!

Finally one dealer replied and said they’d take an additional $500 off if we closed tonight. Clawdbot managed to negoiate a $4200 dealer discount which put us below our target and down to $56k!

At this point credit applications were being sent around so I asked Clawd to stop and took over the actual communications. Thankfully this dealer had an entirely online process so I was able to e-sign everything and pick up the car the next day.

Wrapping up

My experience with Clawdbot made me feel like I’m living in the future. It’s the first big “leap” I’ve felt since Claude Code launched. I’ve already found a dozen additional use cases including politely declining inbound recruiter messages via email or linkedin. It’s also exceedingly good at setting up little cronjobs for web tasks, which is going to be my primary use case going forward.

This made Clawdbot pretty annoying to run on a laptop that I also used for other things. Since I needed a home desktop anyway, I picked up a new Mac Mini for Clawd (a popular trend on the internet in these past few weeks):

If you like this type of nonsense (or more technical stuff) you can follow me on twitter and send me any questions or comments.

Does AWS Lambda have a silent crash in the runtime?

2025-07-16T00:00:00+00:00

A blog post went very viral in the AWS space recently which asserts that there’s a silent crash in AWS Lambda’s NodeJS runtime when HTTP calls are made from a Lambda function. The post is nearly 23 pages long and mostly pertains to the handling of the issue by AWS (which seems like it could have been better), but ultimately my focus here is on the technical aspects of the post.

This post has been updated to the archive link, as the original has been experiencing a hug of death and is unavailable at the time of publishing.

Background

The author begins by explaining that they investigated this issue to a thorough extent, provided reproducible code, and even confirmed that this code worked fine in ec2 but somehow failed in Lambda. Here’s the summary:

Over a seven-week investigation, I — as CTO and principal engineer for a healthcare-focused AWS
Activate startup — diagnosed and proved a fatal runtime flaw in AWS Lambda that:
  • Affected Node.js functions in a VPC
  • Caused silent crashes during outbound HTTPS calls
  • Produced no logs, no exceptions, and no catchable errors
  • Was fully reproducible using minimal test harnesses

Reproducing the issue

Here’s the first snippet of code they provide. The author states this is a nestjs app, but that doesn’t really matter for the purpose of the issue.

@Post('/debug-test-email')
async sendTestEmail() {
  this.eventEmitter.emit(events.USER_REGISTERED, {
    name: Joe Bloggs,
    email: 'email@foo.com', // legitimate email was used for testing
    token: 'dummy-token-123',
  });
  return { message: 'Manual test triggered' };
}

When the handler runs, the author explains, the result is immediately a 201 with the successful expected message, but no email is ever sent:

It emits an event, then immediately returns a response — meaning it always reports success (201),
regardless of whether the downstream email handler succeeds or fails.

But here’s what happened:
  • I received the HTTP response
  • No email arrived
  • No logs appeared in CloudWatch
  • No errors fired
  • And the USER_REGISTERED event handler was never called

The Lambda simply stopped executing — silently, mid-flight.

The 201 response was intentional — and critical. It allowed the controller to return before downstream
failures occurred, revealing that Lambda wasn’t completing execution even after responding
successfully.
A response was returned, but the function NEVER completed its actual work

Before we move on, I want to add that this is exactly what I’d expect to happen.

The lifecycle of Lambda

So what’s happening here? And why is it expected?

Lambda is famous for “scaling to zero”, where your function code is executed when a request is made, and then “frozen” when the response is completed and there are no other requests to serve. It’s “thawed” again when a new request arrives. Today, a sandbox can only serve one request at a time, and may be reused for subsequent invocations.

After some amount of time, number of invocations, or for any number of possible reasons Lambda will shutdown the sandbox and reap its resources back into the worker pool.

The issue described by the author is rooted in how Lambda handles this lifecycle, specifically the invoke phase. There are two parts to disambiguate here, the Lambda managed runtime (which is nodejs in this case), and Lambda’s Runtime API. We’ll start by examining the Runtime API.

The Runtime API

Lambda exposes an HTTP-based Runtime API, hosted at the link-local address found in the AWS_LAMBDA_RUNTIME_API environment variable. This is a local server which provides the incoming event or request to the Lambda function in JSON format and receives the response from the function once it’s complete. Two of the endpoints are relevant here: /runtime/invocation/next and /runtime/invocation//response.

For the ease of discussion we’ll call them /next and /response.

Lambda operates as a state machine. Functions call the /next endpoint to receive the next request. When a function completes its request, it sends the result to the /response endpoint, and then calls /next again to get the next request and so on.

The call to /next has three possible return states:

You receive an invocation response containing a request payload.
You receive the shutdown event, indicating the sandbox will shut down (only applies to extensions, not your handler, but it is part of the Runtime API) or possibly
Lambda freezes the CPU because there are no pending requests When a request arrives, the runtime will thaw the CPU and return a result to /next.

This is easy to see in the state machine image for Extension development. For now, ignore the extension columns:

Lambda’s Node runtime

The NodeJS runtime isn’t really much of a secret, you can either extract it from the container base images they publish like this, or you can read the runtime interface client code, which interacts with the Runtime API.

When you provide a nodejs function, Lambda looks for it based on the handler method configured for the function. Then it imports your function, and passes it the runtime events from the Runtime API. Then it’s effectively acting as a state machine, ferrying requests to your code, awaiting the result, and sending them back to the runtime.

Putting it all together

So here is how the Node runtime executes your function

It calls /next to receive the invocation. At this time, the sandbox could receive a new invocation or be frozen!
After the call to /next returns, it awaits your handler code,
Then it returns the result via the /response endpoint through the markCompleted callback, which is called via result.then.

Now when we look back at the original code snippet, we see the issue:

@Post('/debug-test-email')
async sendTestEmail() {
  this.eventEmitter.emit(events.USER_REGISTERED, {
    name: Joe Bloggs,
    email: 'email@foo.com, // legitimate email was used for testing
    token: 'dummy-token-123',
  });
  return { message: 'Manual test triggered' };
}

The listener waiting for the USER_REGISTERED event will never run unless subsequent invocations occur frequently enough that Node’s scheduler runs that task! And given that this result is returned basically instantly, that may never happen!

How to actually do this

Now that we’ve jumped through the Lambda Runtime API and Node Runtime and see why this code wouldn’t work, how could you do something like this in Lambda if you wanted to? There are three pretty good options:

Use Lambda’s NodeJS response streaming to separate the response from the handler’s promise resolution.
Use a custom runtime
Use a Lambda extension (internal or external, but internal is easier).

Response Streaming

If your client can receive a chunked response, you can easily return the lightweight response using the streaming API and then perform the async work and resolve your handler’s promise when the work completes.

AWS even published a great blog about it here, but here’s the relevant section:

export const handler = awslambda.streamifyResponse(async (event, responseStream, _context) => {
    logger.info("[Function] Received event: ", event);

    // Do some stuff with event
    let response = await calc_response(event);

    // Return response to client
    logger.info("[Function] Returning response to client");
    responseStream.setContentType('application/json');
    responseStream.write(response);
    responseStream.end();

    await async_task(response);   
});

This works great, but there’s an even easier way with

Use a custom runtime.

You can fork the runtime-interface-client and then drive your async tasks to completion after providing the response via /response but before calling the /next endpoint. Bref, the extremely popular PHP runtime for Lambda, already supports this out of the box. Here we can see that Bref will get the response from next, return the result (via sendResponse), and then call the afterInvoke hooks to run any async work you may have queued up:

    public function processNextEvent(Handler | RequestHandlerInterface | callable $handler): bool
    {
        [$event, $context] = $this->waitNextInvocation();

        // Expose the context in an environment variable
        $this->setEnv('LAMBDA_INVOCATION_CONTEXT', json_encode($context, JSON_THROW_ON_ERROR));

        try {
            ColdStartTracker::invocationStarted();

            Bref::triggerHooks('beforeInvoke');
            Bref::events()->beforeInvoke($handler, $event, $context);

            $this->ping();

            $result = $this->invoker->invoke($handler, $event, $context);

            $this->sendResponse($context->getAwsRequestId(), $result);
        } catch (Throwable $e) {
            $this->signalFailure($context->getAwsRequestId(), $e);

            try {
                Bref::events()->afterInvoke($handler, $event, $context, null, $e);
            } catch (Throwable $e) {
                $this->logError($e, $context->getAwsRequestId());
            }

            return false;
        }

        // Any error in the afterInvoke hook happens after the response has been sent,
        // we can no longer mark the invocation as failed. Instead we log the error.
        try {
            Bref::events()->afterInvoke($handler, $event, $context, $result);
        } catch (Throwable $e) {
            $this->logError($e, $context->getAwsRequestId());

            return false;
        }

        return true;
    }

Vercel also added support to Lambda via waitUntil to achieve a similar end.

This technique looks quite simple, but of course the downside is that you’re responsible to maintain the nodejs distribution you’re packaging, but I find that’s a pretty low overhead and something that dependabot can help keep updated.

I’d like to see AWS offer this natively.

Use an extension

Lambda Extensions offer a relatively low lift way to add async processing to your Lambda function. You can use an internal or external extension, and AWS recommends an internal extension in their post, but the rest is pretty straightforward.

Configure the handler, and provide an in-memory queue to pass jobs between the handler and the job runner:

import json
import time
import async_processor as ap
from aws_lambda_powertools import Logger

logger = Logger()

def calc_response(event):
    logger.info(f"[Function] Calculating response")
    time.sleep(1) # Simulate sync work
    return {
        "message": "hello from extension"
    }

# This function is performed after the handler code calls submit_async_task 
# and it can continue running after the function returns
def async_task(response):
    logger.info(f"[Async task] Starting async task: {json.dumps(response)}")
    time.sleep(3)  # Simulate async work
    logger.info(f"[Async task] Done")

def handler(event, context):
    logger.info(f"[Function] Received event: {json.dumps(event)}")

    # Calculate response
    response = calc_response(event)

    # Done calculating response
    # call async processor to continue
    logger.info(f"[Function] Invoking async task in extension")
    ap.start_async_task(async_task, response)

    # Return response to client
    logger.info(f"[Function] Returning response to client")
    return {
        "statusCode": 200,
        "body": json.dumps(response)
    }

Then configure the job runner:

import os
import requests
import threading
import queue
from aws_lambda_powertools import Logger

logger = Logger()
LAMBDA_EXTENSION_NAME = "AsyncProcessor"

# An internal queue used by the handler to notify the extension that it can
# start processing the async task.
async_tasks_queue = queue.Queue()

def start_async_processor():
    # Register internal extension
    logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Registering with Lambda service...")
    response = requests.post(
        url=f"http://{os.environ['AWS_LAMBDA_RUNTIME_API']}/2020-01-01/extension/register",
        json={'events': ['INVOKE']},
        headers={'Lambda-Extension-Name': LAMBDA_EXTENSION_NAME}
    )
    ext_id = response.headers['Lambda-Extension-Identifier']
    logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Registered with ID: {ext_id}")

    def process_tasks():
        while True:
            # Call /next to get notified when there is a new invocation and let
            # Lambda know that we are done processing the previous task.

            logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Waiting for invocation...")
            response = requests.get(
                url=f"http://{os.environ['AWS_LAMBDA_RUNTIME_API']}/2020-01-01/extension/event/next",
                headers={'Lambda-Extension-Identifier': ext_id},
                timeout=None
            )

            # Get next task from internal queue
            logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Woke up, waiting for async task from handler")
            async_task, args = async_tasks_queue.get()
            
            if async_task is None:
                # No task to run this invocation
                logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Received null task. Ignoring.")
            else:
                # Invoke task
                logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Received async task from handler. Starting task.")
                async_task(args)
            
            logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Finished processing task")

    # Start processing extension events in a separate thread
    threading.Thread(target=process_tasks, daemon=True, name='AsyncProcessor').start()

# Used by the function to indicate that there is work that needs to be 
# performed by the async task processor
def start_async_task(async_task=None, args=None):
    async_tasks_queue.put((async_task, args))

# Starts the async task processor
start_async_processor()

The downside to this solution which is not handled in this example code is handling the shutdown event response from /next. In this case you’ll want to work the queue to exhaustion and then exit the process, but presumably this is left as an exercise to you, dear reader.

If you run this type of logic across multiple language runtimes, it may be worthwhile to write an External Lambda Extension which is runtime agnostic. You might consider rust, which has pretty incredible performance characteristics in Lambda, as I learned when rewriting Datadog’s Next-Generation Lambda Extension.

Should AWS add support for this?

Running async code in Lambda is such a common request that I’d like to see AWS support it in Lambda, as the value prop of the entire product is anchored in them managing the runtime for you.

That said, I don’t think I’d recommend this solution generally. Instead for the author’s stated use case I’d prefer to use a direct API Gateway -> SQS integration here, which can enqueue a message and then allow me to write a Lambda function which can process these messages in batches, handle retries, downstream provider backpressure, and generally build a more robust system.

Presumably that’s why AWS hasn’t done this yet.

What the author got wrong

Beyond a simple misunderstanding of how Lambda works, the author also expected Lambda to work exactly like EC2. But it’s not, and it shouldn’t be. The opinionated nature of Lambda exists specifically to NOT be ec2. Shipping a whole web framework to Lambda does work and can be useful, but the expectations of the runtime are simply not the same as you’d expect in ec2.

For the author to have that, they’ll need to write your own runtime, or look somewhere else.

If you like this type of content please subscribe to my YouTube channel and follow me on twitter to send me any questions or comments. You can also ask me questions directly if I’m streaming on Twitch.

Avoiding the Lambda Doom Loop

2024-10-03T00:00:00+00:00

There have been a number of recent changes in the Lambda sandbox environment, mostly transparent ones like changing the Runtime API IP address and port to a link-local IP. But recently I noticed a change in how Lambda handles function crashes and re-initialization, and after confirming this behavior with the Lambda team I wanted to take some time to help explain how it works now and why.

In a previous post I’ve demonstrated how not all cold starts are identical. Specifically, cold starts after a runtime crash, function timeout, or out-of-memory error cause the Lambda function to re-initialize and cause a mini cold start, which AWS calls a suppressed init. It’s this case that we’re going to focus on today. As of October 4th, 2024 this is now documented on AWS as well.

If your Lambda functions have an especially short timeout configuration, you’ll want to pay close attention.

Background

AWS Lambda Functions permit up to 10 seconds for the function code to initialize. Previously we’ve exploited this fact to uncover how AWS pre-warms your function in my post about Proactive Initialization, but it’s important to note that historically, this ten-second init duration is evaluated separately from the configured function timeout.

Today? Apart from the first initialization of a sandbox, re-initialization time for suppressed initializations is counted against the overall function timeout. This may seem myopic, but it can cause a serious downside and outage for your function.

Before your eyes glaze over, let me explain.

Example

Let’s consider a Lambda function serving an API with a 3 second timeout configured. Imagine that the function also requires a database connection along with some credential fetching, so the cold start time is approximately 3 seconds. Today your Lambda function will still initialize successfully after those 3 seconds and go on to serve many other serial Lambda invocations with no issues.

But now imagine that function crashes on the next invocation. Maybe it times out, or runs out of memory.

When Lambda re-initializes your function under a suppressed init, it won’t complete re-initialization before the timeout arrives, and it’s now permanently stuck in a retry loop. Function invocations will fail until Lambda decides to kill the sandbox and start a new one.

Reproducing the issue

This one is super easy to reproduce. You can pull down this repo, but the logic is simple:

async function delay(millis) {
  return new Promise((resolve) => {
    setTimeout(resolve, millis);
  });
}
// Simulate a longer init duration
await delay(3000);
console.log('init done');
export async function hello(event) {
  if (event.queryStringParameters && event.queryStringParameters.crash) {
    // simulate timeout
    // After this the function will no longer run, permanently
    await delay(5000);
  }

  return {
    statusCode: 200,
    body: JSON.stringify({message: 'Hello from Lambda!'})
  };
}

Curl the endpoint to call the function normally. It’ll require 3 seconds to initialize as per the REPORT log: REPORT RequestId: bdace18c-8f63-48f0-b44a-c909b6b134a0 Duration: 2.85 ms Billed Duration: 3 ms Memory Size: 1024 MB Max Memory Used: 64 MB Init Duration: 3152.18 ms
Force a suppressed init by passing ?crash=true. This causes the function to timeout.
Now call it again, with the crash parameter removed. The function will continue to crash as it cannot re-initialize. It’s dead until a new sandbox comes along, or you re-deploy the function.

If you open the logs you’ll now see the Status: timeout field, which is new: REPORT RequestId: 13222b1e-f16b-4550-89df-869ab0a9806d Duration: 3000.00 ms Billed Duration: 3000 ms Memory Size: 1024 MB Max Memory Used: 64 MB Status: timeout

How to avoid the doom loop

Ultimately avoiding this is simple and there are several options.

Increase the timeout value so it covers the longest possible function execution plus your expected Init Duration time.
If your function initialization is mostly caused by interpreting code, you can increase the configured memory size up to 1769MB, where you’ll receive one full vCPU.
Optimize your function initialization! I gave a long talk about this at re:Invent 2023, check it out for specific tips and be sure to consider lazy-loading!
Finally modify your function code so that a timeout won’t cause the environment to error (and thus re-initialize). You can do this by racing the deadline provided by getRemainingTimeInMillis() method on the context object.

These tips are in the help docs as well. Although it’s unfortunate this couldn’t be factored in for us when creating Lambda functions, it seems this change is a deeply tied to other intractable changes underpinning Lambda - so it’s one we’ll need to live with.

It’s important to note that this behavior of a suppressed initialization did already exist in some cases earlier. Specifically, beginning around 2021 with functions configured with a Lambda Extension or SnapStart. Now it’s a default behavior for all functions.

Key takeaways

If you’ve followed me for any period of time I hope I’ve given you the tools necessary to minimize the impact of cold starts, but the fact remains that some initialization time is necessary.

This is especially true for customers loading heavy AI or ML libraries, negotiating TCP connections to databases and older caches which don’t offer HTTP APIs like Momento (not sponsored, it’s just good tech). With the recent proliferation of LLMs, I’ve noticed developers choosing to bring heavier libraries to Lambda, so I expect cold start times to be generally longer these days.

BASE Jumps & Backups - how I use Synology and AWS to store my data

2024-07-08T00:00:00+00:00

If you mostly know me because of this blog or my cloud talks, it may surprise you to learn that I’m also an avid parachutist. I’ve been skydiving since 2010 and BASE jumping since 2012, and have more than 1200 combined jumps all over the world. It’s a neat hobby! Contrary to popular belief, it’s not as dangerous as you might think.

Starting in 2010 also means I’m a child of the GoPro era. This was the beginning of YouTube. Like so many others I was inspired by videos of people soaring down cliffs. So against the guidance of literally everyone, I strapped a GoPro to my head and zipped up a wingsuit as soon as I possibly could. Thankfully I managed to develop into a reasonably competent BASE jumper and enjoyed about 10 years of frequent BASE trips, new experiences, and of course several thousand video files.

The fear of losing these files always burned in the back of my mind. I backed everything up to an external HDD, but had no other copies of the data. In case it’s not clear this is a bad thing. Typically, you’d want to have a 3-2-1 backup pattern with an original data set, an on-site backup, and an off-site backup. Since this data isn’t “production” data, I mostly need the original and an off-site backup.

The video files pile up

At the same time, I’ve also been spending more time streaming on twitch and youtube. It’s been fun to poke around serverless platforms, ship toy applications on the weekend, and learn new languages with a small audience. Recently I’d written a few simple benchmarking scripts collecting cold start metrics from AWS Lambda as well as Vercel. I wanted to host these scripts on my local network to simulate what a “real” user may experience, so I knew I’d need a solution which primarily acts as a network attached storage device, but also has a bit of compute available to run my projects. Nothing too crazy, but a unix-like environment would be ideal.

Finally in May, I asked twitter about their recommendations and received a lot of comments. Virtually everyone recommended Synology NAS systems, or had an insane homelab, like my colleague Nick Frichette.

Synology DiskStation

I was introduced to the kind folks at Synology who offered to ship me their DS923+, a couple drives, and the 10GbE upgraded NIC!

After everything arrived, I fired up my live stream and got to work. You can view the whole setup process from start to finish here, but I’ll run you through my major choices.

Synology provided 2x 4tb HDDs, which I opted to store in a fully-redundant setup. This left me around 3.6TB of storage after opting for the Hybrid RAID setup. I chose hybrid raid because I plan to expand the storage further with additional drives, and like the flexibility to mix and match drive size within the same pool.

Setting up the drive pool was a breeze, and after I plugged in the correct network cable, I had things up and running quite easily. I copied my entire external hard drive of archived BASE jumping footage using usb3, but opted to mount the NAS as an SMB to copy archives of my live streams to the NAS over the 10GbE line. This seemed to run as fast as the disks would write!

Backing up to the cloud

Within a few hours, I had the entire system unboxed, running and had made 2 full copies of my treasured BASE jumping memories! RAID is great, but it still leaves me with a single point of failure. To prevent this, I knew I’d need to back up this data somewhere else entirely. For this, I chose AWS.

AWS has a dizzying number of storage options, but after some careful thought I realized my choice boiled down to S3 (and the infrequent access tier), and Glacier. Both are arbitrary blob-storage systems, but the main difference is that S3 is geared toward arbitrary, ond-demand file access, whereas Glacier is meant to store archival data which may be retrieved only after creating a retrieval request and waiting a few hours for it to be ready. Both services have multiple storage tiers, but at their slowest/coldest option - Glacier Deep Archive is $0.00099/GB, while S3 Infrequent Access is $0.0125/GB.

Because I already have local copies of my data, if I wanted to watch some videos or edit a new one, I wouldn’t need to use my cloud backup. This meant that Glacier was the right choice for my use case.

Luckily, Synology provides an out-of-the box package for Glacier support. Setting it up was pretty easy, my one complaint here is that the Glacier package on Synology could be a bit more user-friendly in terms of setting up the IAM policy. To start I ended up granting pretty broad Glacier access via IAM. I’m not too worried though. I only leaked the key 5-6 times live on stream! (and rotated it, of course).

After the backup finished, I consulted CloudTrail to get the specific permissions required. You’ll notice that two archives are created, with one specifically called a mapping archive. I suspect this holds metadata about the backup itself.

At any rate, you can skip this step because I’ve done it for you. Here is the full IAM policy for the Synology Glacier backup package:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "glacier:GetJobOutput",
                "glacier:InitiateJob",
                "glacier:UploadArchive",
                "glacier:ListVaults",
                "glacier:DeleteArchive",
                "glacier:UploadMultipartPart",
                "glacier:CompleteMultipartUpload",
                "glacier:InitiateMultipartUpload"
            ],
            "Resource": "*"
        }
    ]
}

You can further limit the two resources to arn:aws:glacier:us-west-2:123456789012:vaults/your-vault-name and arn:aws:glacier:us-west-2:123456789012:vaults/your-vault-name_mapping if you want to be more specific, but I don’t believe you can specify the vault name so you’ll need to use a wildcard to start.

After backing up everything, the costs rolled in. It cost me around $9 to initially back up the data, and will be about $4/month to store it.

I want to take a minute to cover Erasure Coding and why it helps make the web work so well. Building reliable systems means having fault-tolerant systems. For data systems, this means ensuring that the inevitable failing hard drive won’t lead to data loss. But it’s both inefficient and risky to have multiple complete backups of data around. A drive could be stolen or lost in a move, leading to data leaking. And maintaining these complete copies is expensive.

How Erasure Coding works

Enter Erasure Coding. Erasure coding allows us to divide a piece of data like my video files into N slices (or shards in distributed systems parlance). Then instead of backing up each shard (thus increasing the backup size by 2x or 3x), we can transform each shard of N into a slice of data with size 1/K using an encoding function. Now, the original file can be recomposed to N with N-K shards!

For a [3, 2] code, this means we can fetch 2 slices from any of the 3 to full retrieve our data. This is helps improve the tail latency performance of distributed systems, as we can make 3 requests across each of the 3 nodes, but only need 2 to succeed to get the data back.

This example is dramatically simplified, to learn more I’d suggest this excellent post on Toward Data Science.

If you want to learn more about S3 itself - I highly recommend Andy Warfield’s talk from FAST’23: Building and Operating a Pretty Big Storage System.

Erasure coding is a powerful concept because our backup system can withstand losing an entire storage node and still maintain a full copy of the data. It pairs very nicely with the fact that distributed systems increase reliability exponentially while costs increase linearly. It’s true!. This is how AWS can run S3 with 11 9’s of durability!

Key takeaways

My goal when I chose a NAS was to have a simple and reliable network storage system which could also moonlight as a small homelab, and Synology delivers all that and more. The available packages are solid, and the community-supported offerings are extensive. It’s become a critical part of my workflow both as a live-streaming software developer, and as a BASE jumper with loads of footage to store.

What most surprised me was how useful and intuitive the web-based operating system is. I thought I’d need to configure a remote desktop or VPN, but instead it’s so simple to use any browser to manage the NAS or even drop files onto it. Theo was right, it’s annoyingly good.

I generally sleep well, but I sleep even better knowing all that local storage power is combined with cloud-based archival storage, so that I’ve got many many 9’s of erasure coding backing up my adventure videos.

If you like this type of content please subscribe to my blog or follow me on twitter and send me any questions or comments. You can also ask me questions directly if I’m streaming on Twitch or YouTube.

Ultimate guide to secrets in Lambda

2024-03-27T00:00:00+00:00

We all have secrets. Some are small secrets which we barely hide (sometimes I roll through stop signs on my bike). Others are so sensitive that we don’t even want to think about them (serverless actually has servers).

Managing and securing secrets in your applications have similar dimensions! As a result, handling a random 3rd party API key is different from handling the root signing key for an operating system or nuclear launch codes.

This work is a fundamental requirement for any production-quality software system. Unfortunately, AWS doesn’t make it easy to select a secrets management tool within their ecosystem. For Serverless developers, this is even more difficult! Lambda is simply one service in a constellation of multiple supporting services which you can use to control application secrets. This guide lays out the most common ways to store and manage secrets for Lambda, the performance impacts of each option, and a framework for considering your specific use cases.

Quick best practices primer

Plaintext secrets should NEVER be hardcoded in your application code or source control. Typically you want to follow the principle of least privilege and limit the access of any runtime secret to only the runtime environment (Lambda, in this case).

This means passing references or encrypted data to configuration files or infrastructure as code tools whenever possible. It also means that decrypting or fetching secrets from a secure storage system at runtime will be the most secure option. This post is geared to deploying your Lambda applications along this dimension.

Lambda Secret Options

Within Lambda, there are four major options for storing configuration parameters and secrets. They are:

Lambda Environment Variables
AWS Systems Manager Parameter Store (Formerly known as Simple Systems Manager, or SSM)
AWS Secrets Manager
AWS Key Management Service

This post will rate each option along the following dimensions:

Ease of use
Cost
Auditability
Rotation Simplicity
Capability

We’ll also cover the AWS Lambda Parameter and Secret extension, which is used to retrieve secrets from both Parameter Store and Secrets Manager from within a Lambda function.

Then, we’ll consider several example secrets with various blast radii, and decide which service best suits our needs.

Service breakdown Tl;dr

	Ease of Use	Cost	Auditability	Rotation Complexity	Capability
Environment Variables	Easiest	Free!	Poor	Requires UpdateFunctionConfiguration or deployment	Encrypted at rest Decrypted when getFunctionConfiguration called. Limited to 4KB total
Parameter Store Standard	Some assembly required	Free storage Free calls up to 40 calls/second. $0.05/10,000 calls after	Good	Easy manual rotation, not automatic	4KB size limit
Parameter Store Advanced	Some assembly required	$0.05 per month per secret. $0.05/10,000 calls	Good	Easy manual rotation, not automatic	Supports TTL for secrets. 8KB size limit
Secrets Manager	Some assembly required	$0.40 per secret per month $0.05/10,000 calls. 30 day free tier.	Good	Easiest & Automatic Built into the product	Largest binary size, 65KB per secret
Key Management Service (KMS)	Most work	$1 per key per month $0.03/10,000 requests	Good	Depends on ciphertext storage. Easy with DynamoDB/S3, more manual with env vars.	Most flexible option. 4KB per `encrypt` operation. Binary size is limited by storage mechanism. Roll your own Secrets Manager or Parameter Store.

Lambda Environment Variables

Environment variables in Lambda are where most folks start out in their journey. They’re baked right in, and can be fetched easily (using something like process.env.MY_SECRET for Node or os.environ.get('MY_SECRET') for Python). Unfortunately they are not the most secure option.

However one common misconception is that environment variables are stored as plain text by AWS Lambda. This is false.

Lambda environment variables are encrypted at rest, and only decrypted when the Lambda Function initializes, or you take an action resulting in a call to GetFunctionConfiguration. This includes visiting the Environment Variables section of the Lambda page in the AWS Console. It startles some people to see their secrets on this page, but you can easily prevent this by denying lambda:GetFunctionConfiguration, or kms:Decrypt permissions from your AWS console user.

Auditability is another challenge of Lambda environment variables. For the principle of least privilege to be effective, we should limit access to secrets only to when they are needed. To ensure this is followed, or investigate and remediate a leaked secret, we need to know which Lambda function used a specific secret and at what time.

Environment variables are automatically decrypted and injected into every function sandbox upon initialization. Given that CloudTrail reflects one call to kms:Decrypt, I presume the entire 4KB environment variable package is encrypted together. This means you lack the ability to audit an individual secret - it’s all or nothing.

If you’re in a regulated environment, or otherwise distrust Amazon; you can create a Consumer-Managed Key (CMK) and use that to encrypt your environment variables instead.

It’s important to note that when you update environment variables, you will trigger a cold start (as long as you’re using the $LATEST function alias). Your function sandbox is automatically shut down permanently. Then when a new request arrives, you will experience a cold start and that sandbox will pull the latest environment variables into scope.

Environment variables are also the best-performing option. Systems Manager Parameter Store, Secrets Manager, Lambda environment variables, and KMS all fundamentally rely on KMS and thus a call to kms:Decrypt at some point.

Lambda Function environment variables add around 25ms to your cold start duration, according to an article David Behroozi just wrote. These calls are logged in CloudTrail whenever your function starts.

However, purely storing secrets as environment variables is not the most secure option. Although they are encrypted at rest, environment variables and lambda:GetFunctionConfiguration permissions are treated by Lambda as part of the ReadOnly policy used by AWS internally, auditors, and cloud security SaaS products. This broadens your risk for a vendor or 3rd party auditor becoming compromised and leaking your secrets.

One risk is that you may accidentally leak a secret when sharing your screen while viewing or modifying a Lambda environment variable. It’s unfortunate that AWS automatically decrypts and displays these values in plain text. AWS has no excuse for this, and should absolutely hide environment variable values unless toggled on, which is how Parameter Store and Secrets Manager both work.

Furthermore, CloudFormation treats environment variables as regular parts of a template, so they are available when looking at the full template or historical templates for a given stack. Additionally, AWS does not recommend storing anything secret in an environment variable.

You can improve that somewhat for no (or little) cost using a pattern I lay out further on. Before we get there, you should be familiar with the first-class products AWS offers to store your secrets.

AWS Systems Manager Parameter Store

The title is a mouthful, and the service is equally Byzantine. It includes features for managing nodes, patching systems, handling feature flags, and so much more. Earlier it was called the Simple Systems Manager, however it’s truly anything but simple.

Today we’ll focus only on Lambda and exclusively on the Parameter Store feature which allows us to store a plaintext or secure string either as a simple value or structured item.

You always want to use SecureString for secrets.

Parameter Store offers the choice between Standard and Advanced Parameters. Standard Parameters are free to store, Advanced Parameters incur a $0.05 per month per parameter charge.

Standard parameters are limited to 4KB in size (each), with 10,000 total per region. Advanced Parameters have higher limits of 8KB per item and 100,000 total per region. They come with the bonus of attaching Parameter Policies, which are effectively TTLs for a given parameter.

Standard Parameters are free up to 40 requests per second (for all values stored in Parameter Store). Beyond that, the cost is $0.05 per 10,000 Parameter Store API Interactions. Advanced Parameters are always billed at $0.05/10,000 requests. Fetching each parameter counts as an interaction, so 10 parameters triggers 10 interactions. Parameters are individually versioned, and you can fetch by version or $LATEST.

Historically one major advantage of Secrets Manager over Parameter Store is the ability to share secrets across AWS accounts using a resource-based policy. This is now supported by Parameter Store for Advanced Parameters as well.

Finally, individual Parameter calls are auditable in CloudTrail so you can prove who accessed a Parameter and when.

Performance

For a new TCP connection, Parameter Store fetched a parameter in around 217ms, including 99ms to set up the connection itself:

With an existing connection, fetching the parameter took around 39.3ms:

AWS Secrets Manager

Secrets Manager is purpose-built for encrypting and storing secrets for your application. It also has the largest cost at $0.40 per secret per month. This cost is multiplied by the number of regions you choose to replicate each secret to, so this can add up quickly. Fetching a secret costs $0.05 per 10,000 API calls, and there is a free 30-day trial.

The big features you’ll gain over Parameter Store are the ability to automatically replicate secrets across regions, automatically (or manually) rotate secrets. This feature often satisfies requirements for applications subject to regulations like PCI-DSS or HIPAA. If these are must-have features for your application, it makes sense to use Secrets Manager.

Secret values can be up to 65KB in size, which is far larger than environment variables or Parameter Store. Like Parameter Store, calls for GetSecretValue are logged in CloudTrail. The big advantage Secrets have over Parameter Store is the ability to simply rotate or change a secret everywhere it’s used. You can do this on a schedule if you’re in an environment which demands this, or ad-hoc.

Performance

Similar to Parameter Store, it takes Secrets Manager a bit to warm up. 177ms was the duration to create this TCP connection and make the request:

With a warm connection, fetching a secret from Secrets Manager took only 29.4ms:

Key Management Service

AWS Key Management Service (KMS) is the system which underpins all of these other services. If you look carefully at either the documentation or CloudTrail logs, you’ll see KMS!

KMS allows us to create an encryption key, securely store it within AWS, and then use IAM to grant access to resource-based policies used by Lambda to decrypt the ciphertext when your function runs. Instead of passing around a reference to a secret, you’ll need to pass your Lambda function the encrypted ciphertext.

Storing and fetching the ciphertext can be implemented many ways, and should generally track the size of the encrypted blob. Small strings can be easily encrypted and stored as environment variables. If you need to share the same secret, you can store the ciphertext in DynamoDB. For large shared secrets, ciphertexts can be stored in S3.

Most often these secrets are decrypted during the initialization phase of a Lambda function. Fun fact, you don’t need to store or pass the ID of the key used to encrypt data. That key ID is encoded right along with the encrypted data in the ciphertext! Simply call kms:Decrypt on the blob, and KMS takes care of the rest. Neat!

KMS bills $1 per key per month. There is no charge for the keys created and used by Parameter Store, Secrets Manager, or AWS Lambda. You’re also charged $0.03 per 10,000 requests to kms:Decrypt (or other API actions). These calls are individually auditable in CloudTrail.

You’ll have to implement rotation yourself, but if you store ciphertexts in DynamoDB, this can be relatively straightforward and cheaper than either Parameter Store or Secrets Manager, especially if you want to distribute a secret across multiple regions.

I see KMS used most frequently to encrypt slowly changing items like certificates, .PEM files, or to securely store signing keys.

Performance

Decrypting one small (~200b) ciphertext with KMS is notably faster than Parameter Store or Secrets Manager. This request took 64.4ms, including creating the TCP connection:

With a warm connection, KMS decrypted my secret in a blistering 6.45ms:

Presumably a big advantage here is that my ciphertext was already present in Lambda (as an environment variable) and didn’t need to be fetched from a remote datastore call. KMS merely needed to decrypt the ciphertext and return!

AWS Parameter and Secrets Lambda Extension

To more easily use either Parameter Store or Secrets Manager in Lambda, AWS has published a Lambda extension which handles API calls to the underlying services for you, along with caching and refreshing secrets. You can tune these parameters to your liking as well.

Your function interacts with this extension via a lightweight API running on localhost. It’s reasonably well designed, although I find it a bit clumsy overall. This really feels like the type of feature Lambda should implement themselves, and then magically make secrets appear in your function runtime. In contrast, ECS has this behavior built in and I find the experience far superior compared to Lambda.

Furthermore, this extension isn’t open source. Because extensions are indistinguishable from your own function code, it leaves a bit of a foul taste in my mouth that I’m completely blessing a random extension with carte-blanche access to both my function code and secrets.

I’m of the firm opinion that we as users shouldn’t seriously consider any Lambda Extension unless the code is open source (and can be built/published to my own account if I choose). If AWS changes this behavior, I’ll happily update the post.

For these reasons, I prefer interacting with the Parameter Store or Secrets Manager APIs instead, using the aws-sdk. The (excellent) AWS Lambda PowerTools project also supports fetching parameters from multiple sources and is absolutely worth considering.

Now let’s consider three example secrets. We’ll look at the attack vectors, the blast radius for a leak/compromise, and identify the best cost/benefit solution for each.

Patterns and Practices

Safely securing environment variables

Given that AWS Lambda environment variables are encrypted at rest - the biggest issue storing sensitive data in environment variables isn’t Lambda itself - it’s the AWS Console and CloudFormation (and your CI pipeline)! When your stack is created or updated, those environment variables are plaintext values in the CloudFormation stack template. Templates are also stored and retrievable in the CloudFormation UI, as well as in the AWS Lambda console.

Unfortunately you’re not able to use dynamic references to pass a reference to your secret to CloudFormation, because they aren’t yet supported by Lambda environment variables. You should complain about this to your AWS TAM.

The downside is that your secrets are still viewable in the Lambda Console via lambda:GetFunctionConfiguration, and if you update your secret in Parameter Store, it won’t be updated in Lambda until you redeploy your functions.

Envelope Encryption

Consider a case where you may have ~100kb of secrets to store. A handful of signing keys, a couple tokens, maybe an mTLS certificate. Here’s where you can use a technique called envelope encryption to secure your data.

Create a KMS key
Generate 256-bit AES key for each customer, application, or secrets payload
Encrypt all of your secrets with the AES key. This is the “envelope”
Include the encrypted secrets in your function zip.
Finally, encrypt the AES key with your KMS key and pass the encrypted key to your function in an environment variable.

You’ve just encrypted an envelope, and passed the encrypted key to your Lambda Function securely! This also helps save money on KMS keys, as you can re-use one KMS key for multiple AES keys. This pattern is also useful if you need to secure keys for customers in a multi-tenant environment, but laying that out is beyond the scope of this post.

Sensitive Data Exercise

We’ve covered the fundamental building blocks for securing sensitive information within AWS and using it within Lambda. We’ve also composed a few patterns you can use to reduce costs or handle specific use cases.

Now, let’s consider 4 common secrets used in Lambda and think about how best to secure them.

Telemetry API Key

First up is a telemetry API key. Consider an ELK stack, or any provider you prefer. These keys are free to create, so it’s best to create one key per application to limit blast radius and, as a bonus - better track costs. Telemetry keys are also usually write-only. Leaking this key can only cause an attacker to send additional data to the API.

With this in mind, environment variables are likely a good enough option here. They have minimal performance overhead, no cost, and minimal blast radius.

Keys can be easily created for exactly one Lambda function, or CloudFormation stack. If someone peers over your shoulder at a coffee shop, or inadvertently leaks the environment variable - it’s simple to change with a few clicks and a re-deploy.

You can also use dynamic references and limit the read permissions for console users or 3rd party roles to further prevent access.

Using a SecureString with Parameter Store would also be a good option as it would likely be free - especially if your application doesn’t have any users.

In this case, the blast-radius is small, the rotation complexity is easy, and a key encrypted at rest is likely more than suitable for our use case.

Database Username and Password

Your RDBMs may only allow one username and password string, to be shared across all applications - or maybe you just need to share a secret for the sake of simplicity. If you’re not using a stateful connection pooler (like pgbouncer), you may need to share this secret with all your functions.

Here’s where Parameter Store is probably also a great fit. If you ever have to change it, your functions can reference an unversioned Parameter and get the latest key. For one key, it’s pretty affordable. However this math changes if you have a larger bundle of secrets, which exceed the 4KB or 8KB size limits of Parameter Store.

GitHub Application Private Key

For our second example, consider building and deploying a GitHub Application. Authenticating as a GitHub Application is not quite as simple as a 128bit UUID.

Instead, you must download and save an application key in PEM format. These keys can be a bit large, around ~2KB which may push you close to the 4KB environment variable limit.

You can create multiple keys for the same application at no cost, so deploying one key per stack is still tenable.

If the key were to be leaked, someone could conceivably authenticate as your application and access ANY of the repositories your application is installed into (with whatever permissions your application is configured to use). This is risky!

In this case, you’d probably want to use something like Parameter Store if you choose to create multiple keys and rotate them yourself. You’ll help avert the size limit for Lambda environment variables, but it won’t be too costly.

If you’re dealing with a larger key but don’t want to eat the cost of Secrets Manager, KMS or DynamoDB can make sense as well.

I’d be remiss if I didn’t mention that like Lambda environment variables, DynamoDB records are also encrypted at rest, optionally with your own consumer-managed key. I assume this is mostly at the hardware (disk) level, so data in memory may not be encrypted. But generally if you’re also concerned with someone peeking over your shoulder as you browse DynamoDB items in the AWS console, you could also encrypt them with your own key.

PCI-DSS or HIPAA credential rotation

If you’re in a regulated environment with mandated credential rotation, Secrets Manager makes this so easy. As this post has mentioned several times, it’s certainly possible to build this yourself. However - it’s often worth the cost of $0.40 per secret to have the peace of mind that Secrets Manager will automatically rotate your secrets on a regular cadence. Your auditor will thank you as well.

Wrapping up

My hot take after writing this guide is that Lambda environment variables are generally fine for a one-off API key with a small blast radius. They’re fast, free, and easy to use.

For secrets with larger blast radii, use SecureStrings from Parameter Store. If you’re working in a regulated environment or you’d like to regularly rotate a secret, it’s probably easiest to use Secrets Manager.

Reach for KMS and another storage mechanism if your use case doesn’t quite fit into these boxes, or if doing so would be prohibitively expensive.

Ultimately security is a balancing act. I realize best practices are all about limiting risks at every turn, but it still feels wrong to crow about environment variables when so many developers run around with Administrator IAM roles (and can easily read any secret anyway).

At the same time, AWS should do more to restrict the values of environment variables to a permission more restricted than lambda:getFunctionConfiguration.

This post would not exist without David Behroozi challenging me to finish it, and helping out with his CloudTrail digging. You should follow him on twitter. Thanks, David!

Nick Frichette, Alex DeBrie, and Aidan Steele also helped review this, thanks friends!

How Lambda starts containers 15x faster (deep dive)

2024-01-09T00:00:00+00:00

In the first post of this series, we demonstrated that container-based Lambda functions can initialize as fast or faster than zip-based functions. This is counterintuitive as zip-based functions are usually much smaller (up to 250mb), while container images typically contain far more data and are supported up 10gb in size. So how is this technically possible?

“On demand container loading on AWS Lambda” was published on May 23rd, 2023 by Marc Brooker et al. I suggest you read the full paper, as it’s quite approachable and extremely interesting, but I’ll break it down here.

The key to this performance improvement can be summarized in four steps, all performed during function creation.

Deterministically serialize container layers (which are tar.gz files) onto an ext4 file system
Divide filesystem into 512kb chunks
Encrypt each chunk
Cache the chunks and share them across all customers

With these chunks stored and shared safely in a multi-tier cache, they can be fetched more quicky during function cold start.

But how can one safely encrypt, cache, and share actual bits of a container image between users?!

Container images are sparse

One interesting fact about container images is that they’re an objectively inefficient method for distributing software applications. It’s true!

Container images are sparse blobs, with only a fraction of the contained bytes required to actually run the packaged application. Harter et al found that only 6.5% of bytes on average were needed at startup.

When we consider a collection of container images, the frequency and quantity of similar bytes is very high between images. This means there are lots of duplicated bytes copied over the wire every time you push or pull an image!

This is attributed to the fact that container images include a ton of stuff that doesn’t vary between us as users. These are things like the kernel, the operating system, system libraries like libc or curl, and runtimes like the jvm, python, or nodejs.

Not to mention all of the code in your app which you copied from Chat GPT (like everyone else).

The reality is that we’re all shipping ~80% of the same code.

Deterministic serialization onto ext4

Container images are stacks of tarballs, layered on top of each other to form a filesystem like the one on your own computer. This process is typically done at container runtime, using a storage driver like overlayfs.

In a typical filesystem, this process of copying files from the tar.gz file to the filesystem’s underlying block device is nondeterministic. Files always land in the same directory, but those locations on disk may land on different parts of the block device over the course of multiple instantiations of the container.
This is a concurrency-based performance optimization used by filesystems, which introduces nondeterminism.

In order to de-duplicate and cache function container images, Lambda also needs a filesystem. This process is done when a function is created or updated. But for Lambda to efficiently cache chunks of a function container image, this process needed to be deterministic. So they made filesystem creation a serial operation, and thus the creation of Lambda filesystem blocks are deterministic.

Filesystem chunking

Now that each byte of a container image will land in the same block each time a function is created, Lambda can divide the blocks into 512kb chunks. They specifically call out that larger chunks reduce metadata duplication, and smaller chunks lead to better deduplication and thus cache hit rate, so they expect this exact value to change over time.

The next two steps are the most important.

Convergent encryption

Lambda code is considered unsafe, as any customer can upload anything they want. But then how can AWS deduplicate and share chunks of function code between customers?
The answer is something called Convergent Encryption, which sounds scarier than it is:

Hash each 512kb chunk, and from that, derive an encryption key.
Encrypt each block with the derived key.
Create a manifest file containing a SHA256 hash of each chunk, the key, and file offset for the chunk.
Encrypt the keys list in the manifest file using a per-customer key managed by KMS.

These chunks are then de-duplicated and stored in a s3 when a Lambda function is created.

Now that each block is hashed and encrypted, they can be efficiently de-duplicated and shared across customers. The manifest and chunk key list are decrypted by the Lambda worker during a cold start, and only chunks matching those keys are downloaded and decrypted.
This is safe because for any customer’s manifest to contain a chunk hash (and the key derived from it) in the manifest file, that customer’s function must have created and sent that chunk of bytes to Lambda.

Put another way, all users with an identical chunk of bytes also all share the identical key.

This is key to sharing chunks of container images without trust. Now if you and I both run a node20.x container on Lambda, the bytes for nodejs itself (and it’s dependencies like libuv) can be shared, so they may already be on the worker before my function runs or is even created!

Multi-tiered cache strategy

The last component to this performance improvement is creating a multi-tiered cache. Tier three is the source cache, and lives in an S3 bucket controlled by AWS.

The second tier is an AZ-level cache, which is replicated and separated into an in-memory system for hot data, and flash storage for colder chunks. Fun fact - to reduce p99 outliers, this cache data is stored using erasure coding in a 4-of-5 code strategy. This is the same sharding technique used in s3.

This allows workers to make redundant requests to this cache while fetching chunks, and abandon the slowest request as soon as 4 of the 5 chunks return. This is a common pattern, which AWS also uses when fetching zip-based Lambda function code from s3 (among many other applications).

Finally the tier-one cache lives on each Lambda worker and is entirely in-memory. This is the fastest cache, and most performant to read from when initializing a new Lambda function.

In a given week, 67% of chunks were served from on-worker caches!

Putting it together

During a cold start, these chunk IDs are looked up using the manifest, and then fetched from the cache(s) and decrypted. The Lambda worker reassembles the chunks and then the function initialization begins. It doesn’t matter who uploaded the chunk, they’re all shared safely!

Crazy stat

This leads to a staggering statistic. If (after subscribing and sharing this post), you close this page and create a brand new container-based Lambda function right now, there is an 80% chance that new container image will contain zero unique bytes compared to what Lambda already has seen.

AWS has seen the code and dependencies you are likely to deploy before you have even deployed it.

Wrapping up

The whole paper is excellent and includes many other interesting topics like cache eviction, and how this was implemented (in Rust!), so I suggest you read the full paper to learn more. The Lambda team even had to contend with some cache fragements being too popular, so they had to salt the chunk hashes!

It’s interesting to me that the Fargate team went a totally different direction here with SOCI. My understanding is that SOCI is less effective for images smaller than 1GB, so I’d be curious if some lessons from this paper could further improve Fargate launches.

At the same time, I’m curious if this type of multi-tenant cache would make sense to improve launch performance of something like GCP Cloud Run, or Azure Container Instances.

If you like this type of content please subscribe to my blog or reach out on twitter with any questions. You can also ask me questions directly if I’m streaming on Twitch or YouTube.

The case for containers on Lambda (with benchmarks)

2024-01-02T00:00:00+00:00

Note: the second part of this post is available here.

When AWS Lambda first introduced support for container-based functions, the initial reactions from the community were mostly negative. Lambda isn’t meant to run large applications, it is meant to run small bits of code, scaled widely by executing many functions simultaneously.

Containers were not only antithetical to the philosophy of Lambda and the serverless mindset writ large, they were also far slower to initialize (or cold start) compared with their zip-based function counterparts.

If we’re being honest, I think the biggest roadblock to adoption was the cold start performance penalty associated with using containers. That penalty has now all but evaporated.

The AWS Lambda team put in tremendous amounts of work and improved the cold-start times by a shocking 15x, according to the paper and talk given by Marc Brooker.

This post focuses on analyzing the performance of container-based Lambda functions with simple, reproducible tests. It also lays out the pros and cons for containers on Lambda. The next post will delve into how the Lambda team pulled off this performance win.

Performance Tests

I set off to test this new container image strategy by creating several identical functions across zip and container-based packaging schemes. These varied from 0mb of additional dependencies, up to the 250mb limit of zip-based Lambda functions. I’m not directly comparing the size of the final image with the size of the zip file, because containers include an OS and system libraries, so they are natively much larger than zip files.

As usual, I’m testing the round trip request time for a cold start from within the same region. I’m not using init duration, which does not include the time to load bytes into the function sandbox.

I created a cold start by updating the function configuration (setting a new environment variable), and then sending a simple test request. The code for this project is open source. I also streamed this entire process live on twitch.

These results were based on the p99 response time, but I’ve included the p50 times for python below.

This first test contains a set of NodeJS functions running Node18.x. After several days and thousands of invocations, we see the final result. The top row represents zip-based Lambda functions, and the bottom row reports container-based Lambda functions (lower is better): An earlier version of this post reversed the rows. I’ve changed this to be consistent with the python result format. Thanks to those who corrected me!

It’s easier to read a bar chart:

The second test was similar and performed with Python functions running Python 3.11. We see a very similar pattern, with slightly more variance and overlap on the lower end of function sizes. Here is the p99:

and here is the p50:

Here it is in chart form, once again looking at p99 over a week:

We can see the closer variance at the 100mb and 150mb marks. For the 150mb test I was using Pandas, Flask, and PsycoPG as dependencies. I’m not familiar with the internals of these libraries, so I don’t want to speculate on why these results are slightly unexpected.

My simplest answer is that this is a “real world” test using real dependencies. On top of a managed service like Lambda as well as some amount of network latency in a shared multi-tenant system - many variables could be confounding here.

Performance Takeaways

For NodeJS, beyond ~30mb, container images outperform zip based Lambda functions in cold start performance.

For Python, container images vastly outperform zip based Lambda functions beyond 200mb in size.

This result is incredible, because Lambda container images (in total) are much much larger than the comparative zip files.

I want to stress that the size of dependencies is only one factor that plays into cold starts. Besides size, other factors impact static initialization time including:

Size and number of heap allocations
Computations performed during init
Network requests made during init

These nuances are covered in my talk at AWS re:Invent if you want to dig deeper on the topic of cold starts. All of these individual projects are available on GitHub.

Should you use containers on Lambda?

I am not advocating that you choose containers as a packaging mechanism for your Lambda function based solely on cold start performance.

That said, you should be using containers on Lambda anyway. With these cold start performance improvements, there are very few reasons not to.

While it’s technically true that container images are objectively less efficient means of deploying software applications, container images should be the standard for Lambda functions going forward.

Pros:

Containers are ubiquitous in software development, and so many tools and developer workflows already revolve around them. It’s easy to find and hire developers who already know how to use containers.
Multi-stage builds are clear and easy to understand, allowing you easily create the lightest and smallest image possible.
Graviton on Lambda is quickly becoming the preferred architecture, and container images make x86/ARM cross-compilation easy. This is even more relevant now, as Apple silicon becomes a popular choice for developers.
Base images for Lambda are updated frequently, and it’s easy enough to auto-deploy the latest image version containing security updates
Containers allow support larger functions, up to 10gb
You can use custom runtimes like Bun, Deno, as well as use new runtime versions more easily
Using the excellent Lambda web adapter extension with a container, you can very easily move a function from Lambda to Fargate or Apprunner if cost becomes an issue. This optionality is of high value, and shouldn’t be overlooked.
AWS and the broader software development community continues to invest heavily in the container image standard. These improvements to Lambda represent the result of this investment, and I expect that to continue.

Cons:

To update dependencies managed by Lambda runtimes, you’ll need to re-build your container image and re-deploy your function occasionally. This is something dependabot can easily do, but it could be painful if you have thousands of functions. These updates come free with managed runtimes anyway.
You do pay for the init duration. Today, Lambda documentation claims that init duration is always billed, but in practice we see that init duration for managed runtimes is not included in the billed duration, logged in the REPORT log line at the end of every execution.
Slower deployment speeds
The very first cold start for a new function or function update seems to be quite slow (p99 ~5+ seconds for a large function). This makes the iterate + test loop feel slow. In any production environment, this should be mitigated by invoking an alias (other than $LATEST). In practice I’ve noticed this goes away if I wait a bit between deployment and invocation. This isn’t great and ideally the Lambda team fixes it soon, but in production it shouldn’t be a problem.

If all of your functions are under 30mb and you’re team is comfortable with zip files, then it may be worth continuing with zip files. For me personally, all new Lambda-backed APIs I create are based on container images using the Lambda web adapter.

Ultimately your team and anyone you hire likely already knows how to use containers. Containers start as fast or faster than zip functions, have more powerful build configurations, and more easily support existing workflows. Finally, containers make it easy to optionally move your application to something like Fargate or AppRunner if costs become a primary concern.

It’s time to use containers on Lambda.

Thanks for reading!

The next post in this series explores how this performance improvement was designed. It’s an example of excellent systems engineering work, and it represents why I’m so bullish on serverless in the long term.

If you like this type of content please subscribe to my blog or reach out on twitter with any questions. You can also ask me questions directly if I’m streaming on Twitch or YouTube.

You shouldn’t use Lambda layers

2023-11-08T00:00:00+00:00

Why you shouldn’t use Lambda layers

Lambda layers are a special packaging mechanism provided by AWS Lambda to manage dependencies for zip-based Lambda functions. Layers themselves are nothing more than a sparkling zip file, but they have a few interesting properties which prove useful in some cases. Unfortunately Lambda layers are also difficult to work with as a developer, tricky to deploy safely, and typically don’t offer benefits over native package managers. These downsides frequently outweigh the upsides, and we’ll examine both in detail.

By the end of this post, you’ll understand the pitfalls of general Lambda layer use as well as the niche cases where layers may make sense.

Busting Lambda layer Myths

When I ask developers why they are using Lambda layers I often learn the underlying reasons are misguided. It’s not their fault entirely, the documentation makes some imprecise claims which may perpetuate these myths.

Lambda layers do not circumvent the 250mb size limit

I frequently hear folks say they are leveraging Lambda layers to “raise the 250mb limit placed on zip-based Lambda functions”. That’s simply not true. The size of the unzipped function and all attached layers must be less than 250mb.

This misunderstanding springs from the very first point in the documentation which states that Lambda layers “reduce the size of your deployment packages”. While technically it is true that the specific function code you deploy can be reduced with layers, the overall size of the function when it runs in Lambda does not change.

This leads me to my next point.

Lambda layers do not improve or reduce cold start initialization duration

Developers often mistake that a “reduced deployment package” size will reduce cold start latency. This is also untrue, as we already know that the code you load is the single largest contributor to cold start latency. Whether or not these bytes come from a layer or simply the function zip itself is irrelevant to the resulting initialization duration.

Development pain with Layers

One of the biggest challenges for developers leveraging Lambda layers is that they appear magically when a handler executes. While that feat is impressive technically, it poses an issue for developers as text editors and IDEs expect dependencies to be locally available, as do bundlers, test runners, and lint tools. If you run your function code locally or use an emulator, only a subset of those tools cooperate with layers. Although solving these issues is possible, external dependencies provided by Lambda layers require special consideration and handling for limited benefit.

Often, the process of building and deploying Layers separately is enough to avoid them, but there are other reasons to avoid Lambda layers.

Cross-architecture woes

We’re writing software for a world which is increasingly powered by ARM chips. It may be your shiny new M3 laptop, or Amazon’s own (admittedly excellent) Graviton processor. Your Lambda functions are likely running on x86 or a combination of ARM and x86 processors today.

Lambda layers do support metadata attributes called “supported runtimes” and “supported architectures”, but these are merely labels. They don’t prevent or enforce any runtime or deployment time compatibility. Imagine your surprise when you attach a binary compiled for x86 to your arm-based Lambda function and receive exec format errors!

I demonstrated this failure live

Deployment difficulties

Lambda layers do not support semantic versioning. Instead, they are immutable and versioned incrementally. While this does help prevent unintentional upgrading, incremental versioning offers no clues as to backwards compatibility or changes in the updated layer package. Additionally, Lambda layers are completely runtime agnostic and offer no manifest, lockfile, or packaging hints. Layers don’t provide a package.json, pyproject.toml, or gemspec file to ensure adequate dependency resolution. Instead it’s incumbant on the authors to only package compatible code.

One of the main selling points of Lambda layers is that they can share common dependencies between many functions, which is great if every function requires exactly the same compatible version of a dependency. But what happens when you want to upgrade a major version?

You’ll need to release a new version of the layer with the new major version, ensure that no developer accidentally applies the incrementally-adjusted layer (remember – no semantic versioning, manifest files, or lockfiles!), and then simultaneously upgrade the Lambda function code and layer at the same time.

But even that doesn’t work out automatically, as I’ve already documented. Deploying a function + layer results in two separate, asynchronous API calls. updateFunction updates the function code while updateFunctionConfiguration updates the configured layers, and both of these are separate control plane operations which can happen in parallel. This means that invoking $LATEST will fail until both calls complete. To avoid this you’ll need to create a new function version, apply the new layer, and then update your integration (eg: ApiGateway) to point to the new alias, after both steps are complete.

Now semantic versioning is not perfect, and flexible specification (eg: ~ or ^ for relative versions) means that the combination of bits executing your Lambda function may run together for the very first time in a staging or production environment. This has caused enough issues that package managers have solutions like npm shrinkwrap, but this can be even worse with Lambda layers.

And that’s the gist of my point – this is what your package manager should be doing.

Dependency collisions

Lambda layers can cause a particular nasty bug and it stems from how Lambda creates a filesystem from your deployment artifacts. If you’ve followed this blog, you know that zip archives themselves can already create interesting edge cases when unpacking a zip file onto a file system, and Lambda is not immune to that. When a Lambda function sandbox is created, the main function package is copied into the sandbox and then each layer is copied in order into the same filesystem directory. This means that layers containing files with the same path and filename are squashed.

Although Lambda handler code is copied into a different directory than layer code, the runtime will decide where to look first for dependencies. This is typically handled by the order of directories listed in the PATH environment variable, or the runtime-specific variant like NODE_PATH, Ruby’s GEM_PATH, or Java’s CLASS_PATH as documented here.

Consider a Lambda function and two layers which all depend on different versions of the same library. Layers don’t provide lockfiles or content metadata, so as a developer you may not be aware of this dependency conflict at build time or deployment time.

At runtime, the layer code and function code are copied to their respective directories, but when the handler begins processing a request; it crashes with a syntax error! But your code ran fine locally?! What happened?

The code and dependencies in the Lambda layer expect to have access to version 2 of library ABC, but the runtime has already loaded version 1 of library ABC from the function zip file!

If this seems farfetched, it can happen to you – because it happened to me.

What Lambda layers can do for you

Lambda layers can improve function deployment speeds (but so can your CI pipeline)

Consider two Lambda functions of identical dependencies, one with using layers (A), and one without (B). It’s true that you can expect relatively shorter deployments for A, if you aren’t also modifying and deploying the associated layer(s). However the vast majority of CI/CD pipelines support dependency caching, so most users have clear paths towards fast deployments regardless of their use of layers. Yes, your CloudFormation deployment will be a bit longer but ultimately there is not a distinct advantage here.

Within the same region, one layer can be used across different Lambda functions. This admittedly can be super useful to share libraries for authentication or other cross-functional dependencies. This is especially useful if you (like me) need to share layers for other users, even publicly.

I don’t really agree with the other two points in the documentation. Layers may “separate core function logic from dependencies”, but only as much as putting that dependency in another file and importing it. Your runtime does this already so this point falls a bit flat.

Finally, I don’t think it’s best to edit your production Lambda function code live in the console editor, and I especially don’t think you should modify your software development process to support this. (Cloud9 IDE is a good product, just don’t use the version in the Lambda console.)

Where you should use Lambda layers

Lambda layers aren’t all bad, they’re a tool with some sharp edges (which AWS should fix!). There are a couple exceptions which you can and should use Lambda layers.

Shared binaries

If you have a commonly used binary like ffmpeg or sharp, it may be easier to compile those projects once and deploy them as a layer. It’s handy to share them across functions, and this specific layer will rarely need to be rebuilt and updated. Layers are best with established binaries containing solid API contracts, so you won’t need to deal with the deployment difficulties I listed earlier pertaining to major version upgrades.

Custom runtimes

The immensely popular Bref PHP runtime is available as a Layer. Bref is available precompiled for both arm and x86, so it can make sense to use as a layer. The same is true for the Bun javascript runtime. That being said - container images have become far more performant recently and are worth reconsidering, but that’s a subject for another post.

Lambda Extensions

Extensions are a special type of Layer but have access to extra lifecycle events, async work, and post processing which regular Lambda handlers cannot access. Extensions can perform work asynchronously from the main handler function, and can execute code after the handler has returned a result to the caller. This makes Lambda Extensions a worthwhile exception to the above risks, especially if they are also pre-compiled, statically linked binary executables which won’t suffer from dependency collisions.

Wrapping up

In specific cases it can be worthwhile to use Lambda layers. Specifically for Lambda extensions, or heavy compiled binaries. However Lambda layers should not replace the runtime-specific packaging and ecosystem you already have. Layers don’t offer semantic versioning, make breaking changes difficult to synchronize, cause headaches during development, and leave your software susceptible to dependency collisions.

If or when AWS offered semantic versioning, support for layer lockfiles, and integration with native package managers, I’ll happily reconsider these thoughts.

Use your package manager wherever you can, it’s a more capable tool and already solves these issues for you.

If you like this type of content please subscribe to my blog or reach out on twitter with any questions. You can also ask me questions directly if I’m streaming on Twitch or YouTube.

Understanding AWS Lambda Proactive Initialization

2023-07-13T00:00:00+00:00

This post is both longer and more popular than I anticipated, so I’ve decided to add a quick summary:

TL;DR

Lambda occasionally pre-initializes execution environments to reduce the number of cold start invocations.
This does NOT mean you’ll never have a cold start
The percentage of true cold start initializations to proactive initializations varies depending on many factors, but you can clearly observe it.
Depending on your workload and latency tolerences, you may need Provisioned Concurrency.

Lambda Proactive Initialization

In March 2023, AWS updated the documentation for the Lambda Function Lifecycle, and included this interesting new statement:

“For functions using unreserved (on-demand) concurrency, Lambda may proactively initialize a function instance, even if there’s no invocation.”

It goes on to say:

“When this happens, you can observe an unexpected time gap between your function’s initialization and invocation phases. This gap can appear similar to what you would observe when using provisioned concurrency.”

This sentence, buried in the docs, indicates something not widely known about AWS Lambda; that AWS may warm your functions to reduce the impact and frequency of cold starts, even when used on-demand!

Today, July 13th - they clarified this further: “For functions using unreserved (on-demand) concurrency, Lambda occasionally pre-initializes execution environments to reduce the number of cold start invocations. For example, Lambda might initialize a new execution environment to replace an execution environment that is about to be shut down. If a pre-initialized execution environment becomes available while Lambda is initializing a new execution environment to process an invocation, Lambda can use the pre-initialized execution environment.”

This update is no accident. In fact it’s the result of several months I spent working closely with the AWS Lambda service team:

1 - Execution environments (see ‘Init Phase’ section), and 2 - Invocation Initialization gap

In this post we’ll define what a Proactively Initialized Lambda Sandbox is, how they differ from cold starts, and measure how frequently they occur.

Distributed Tracing & AWS Lambda Proactive Initialization

This adventure began when I noticed what appeared to be a bug in a distributed trace. The trace correctly measured the Lambda initialization phase, but appeared to show the first invocation occurring several minutes after initialization. This can happen with SnapStart, or Provisioned Concurrency - but this function wasn’t using either of these capabilities and was otherwise entirely unremarkable.

Here’s what the flamegraph looks like:

We can see a massive gap between function initialization and invocation - in this case the invocation request wasn’t even made by the client until ~12 seconds after the sandbox was warmed up.

We’ve also observed cases where Initialization occurs several minutes before the first invocation, in this case the gap was nearly 6 minutes:

After much discussion with the AWS Lambda Support team - I learned that I was observing a Proactively Initialized Lambda Sandbox.

It’s difficult to discuss Proactive Initialization at a technical level without first defining a cold start, so let’s start there.

Defining a Cold Start

AWS Lambda defines a cold start in the documentation as the time taken to download your application code and start the application runtime.

Until now, it was understood that cold starts would happen for any function invocation where there is no idle, initialized sandbox ready to receive the request (absent using SnapStart or Provisioned Concurrency).

When a function invocation experiences a cold start, users experience something ranging from 100ms to several additional seconds of latency, and developers observe an Init Duration reported in the CloudWatch logs for the invocation.

With cold starts defined, let’s expand this to understand the definition of Proactive Initialization.

Technical Definition of Proactive Initialization

Proactive Initialization occurs when a Lambda Function Sandbox is initialized without a pending Lambda invocation.

As a developer this is desirable, because each proactively initialized sandbox means one less painful cold start which otherwise a user would experience.

As a user of the application powered by Lambda, it’s as if there were never any cold starts at all.

When a function is proactively initialized, the user making the first request to the sandbox does not experience a cold start (similar to Provisioned Concurrency, but for free).

Aligned interests in the Shared Responsibility Model

Proactive Initialization serves the interests of both the team running AWS Lambda and developers running applications on Lambda.

We know that from an economic perspective, AWS Lambda wants to run as many functions on the same server as possible (yes, serverless has servers…). We also know that developers want their cold starts to be as infrequent and fast as possible.

Understanding the fact that cold starts absorb valuable CPU time in a shared, multi-tenant system, (time which is currently not billed) it’s clear that any option AWS has to minimize this time is mutually beneficial.

AWS Lambda is a distributed service. Worker fleets need to be redeployed, scaled out, scaled in, and respond to failures in the underlying hardware. After all - everything fails all the time.

This means that even with steady-state throughput, Lambda will need to rotate function sandboxes for users over the course of hours or days. AWS does not publish minimum or maximum lease durations for a function sandbox, although in practice I’ve observed ~7 minutes on the low side and several hours on the high side.

Update: An AWS whitepaper states that the maximum lease lifetime for a worker sandbox is 14 hours. Thanks to Philip Potter for pointing this out!

The service also needs to run efficiently, combining as many functions onto one machine as possible. In distributed systems parlance, this is known as bin packing (aka shoving as much stuff as possible into the same bucket).

The less time spent initializing functions which AWS knows will serve invocations, the better for everyone.

When Lambda will Proactively Initialize your function

There are some logical conditions which can lead to Proactive Initialization - deployments and eager assignments.

Consider we’re working with a function which at steady state experiences 100 concurrent invocations. When you deploy a change to your function (or function configuration), AWS can make a pretty reasonable guess that you’ll continue to invoke that same function 100 times concurrently after the deployment finishes.

Instead of waiting for each invocation to trigger a cold start, AWS will automatically re-provision (roughly) 100 sandboxes to absorb that load when the deployment finishes. Some users will still experience the full cold start duration, but some won’t (depending on the request duration and when requests arrive).

This can similarly occur when Lambda needs to rotate or roll out new Lambda Worker hosts.

These aren’t novel optimizations in the realm of distributed systems, but this is the first time AWS has confirmed they make these optimizations.

Proactive Initialization due to Eager Assignments

In certain cases, Proactive Initialization is a consequence of natural traffic patterns in your application where an internal system called the AWS Lambda Placement Service will assign pending lambda invocation requests to sandboxes as they become available.

Here’s how it works:

Consider a running Lambda function which is currently processing a request. In this case, only one sandbox is running. When a new request triggers a Lambda function, AWS’s Lambda Control Plane will check for available warm sandboxes to run your request.

If none are available, a new sandbox is initialized by the Control Plane:

However it’s possible that in this time that a warm sandbox completes a request and is ready to receive a new request. In this case, Lambda will assign the request to the newly-free warm sandbox.

The new sandbox which was created now has no request to serve. It is still kept warm, and can serve new requests - but a user did not wait for the sandbox to warm up.

This is a proactive initialization.

When a new request arrives, it can be routed to this warm container with no delay!

Request B did spend some time waiting for a sandbox (but less than the full duration of a cold start). This latency is not reflected in the duration metric, which is why it’s important to monitor the end to end latency of any synchronous request through the calling service! (Like API Gateway)

Detecting Proactive Initializations

We can leverage the fact that AWS Lambda functions must initialize within 10 seconds, otherwise the Lambda runtime is re-initialized from scratch. Using this fact, we can safely infer that a Lambda Sandbox is proactively initialized when:

Greater than 10 seconds has passed between the earliest part of function initialization first invocation processed and
We’re processing the first invocation for a sandbox.

Both of these are easily tested, here’s the code for Node:

const coldStartSystemTime = new Date()
let functionDidColdStart = true

export async function handler(event, context) {
  if (functionDidColdStart) {
    const handlerWrappedTime = new Date()
    const proactiveInitialization = handlerWrappedTime - coldStartSystemTime > 10000 ? true : false
    console.log({proactiveInitialization})
    functionDidColdStart = false
  }
  return {
    statusCode: 200,
    body: JSON.stringify({success: true}) 
  }
}

and for Python:

import json
import time

init_time = time.time_ns() // 1_000_000
cold_start = True

def hello(event, context):
    global cold_start
    if cold_start:
        now = time.time_ns() // 1_000_000
        cold_start = False
        proactive_initialization = False
        if (now - init_time) > 10_000:
            proactive_initialization = True
            print(f'{{proactiveInitialization: {proactive_initialization}}}')
    body = {
        "message": "Go Serverless v1.0! Your function executed successfully!",
        "input": event
    }

    response = {
        "statusCode": 200,
        "body": json.dumps(body)
    }

    return response

Frequency of Proactive Initializations

At low throughput, there are virtually no proactive initializations for AWS Lambda functions. But I called this function over and over in an endless loop (thanks to AWS credits provided by the AWS Community Builder program), and noticed that almost 65% of my cold starts were actually proactive initializations, and did not contribute to user-facing latency.

Here’s the query:

fields @timestamp, @message.proactiveInitialization
| filter proactiveInitialization == 0 or proactiveInitialization == 1
| stats count() by proactiveInitialization

Here’s the detailed breakdown, note that each bar reflects the sum of initializations:

Running this query over several days across multiple runtimes and invocation methods, I observed between 50% and 75% of initializations were Proactive (versus 50% to 25% which were true Cold Starts):

We can see this reflected in the cumulative sum of invocations for a one day window. Here’s a python function invoked at a very high frequency:

We can see after one day, we’ve had 63 Proactively Initialized Lambda Sandboxes, with only 11 Cold Starts. 85% of initializations were proactive!

AWS Serverless Hero Ken Collins maintains a very popular Rails-Lambda package. After some discussion, he added the capability to track Proactive Initializations and came to a similar conclusion - in his case after a 3-day test using Ruby with a custom runtime, 80% of initializations were proactive:

Confirming what we suspected

This post confirms what we’ve all speculated but never knew with certainty - AWS Lambda is warming your functions. We’ve demonstrated how you can observe this behavior, and followed this through until the public documentation was updated.

But that begs the question - what should you do about AWS Lambda Proactive Initialization?

What you should do about Proactive Initialization

Nothing.

This is the fulfillment of the promise of Serverless in a big way. You’ll get to focus on your own application while AWS improves the underlying infrastructure. Cold starts become something managed out by the cloud provider, and you never have to think about them.

We use Serverless services because we offload undifferentiated heavy lifting to cloud providers. Your autoscaling needs and my autoscaling needs probably aren’t that similar, but workloads taken in aggregate with millions of functions across thousands of customers, AWS can predictively scale out functions and improve performance for everyone involved.

Wrapping it up

I hope you enjoyed this first look at Proactive Initialization, and learned a bit more about how to observe and understand your workloads on AWS Lambda. If you want to track metrics and/or APM traces for proactively initialized functions, it’s available for anyone using Datadog.

This was also my first post as an AWS Serverless Hero! So if you like this type of content please subscribe to my blog or reach out on twitter with any questions.

Thawing your Lambda Cold Starts with Lazy Loading

2023-05-26T00:00:00+00:00

If you’ve heard anything about Serverless Applications or AWS Lambda Functions, you’ve certainly heard of the dreaded Cold Start. I’ve written a lot about Cold Starts, and I spend a great deal of time measuring and comparing various Cold Start Benchmarks.

In this post we’ll recap what a Cold Start is, then we’ll define a technique called Lazy Loading, show you when and how to use it, and measure the outcome!

What is a Cold Start?

Lambda sandboxes are created on demand when a new request arrives, but live for multiple sequential invocations of a function. When an application experiences an increase in traffic, Lambda must create additional sandboxes.

The additional latency caused by this sandbox creation (which the user also experiences) is known as a Cold Start:

Sample App

This application is a Todo list, which is built for multiple tenants. This application is built using AWS Lambda, API Gateway, and DynamoDB.

One particular user (we can pick on me, AJ, in this case), demands that he is notified by SNS any time a new Todo item is added to his list. The architecture of this application looks like this:

Eager Loading

Eager loading happens when you load a dependency by calling require, or import at the top of your function code.

Normally, dependencies in your function are Eager loaded - or loaded during initialization. For Node, Python, and Ruby runtimes - your dependencies are loaded when the runtime begins reading your handler files and processing each require or import in the order they are written. If you’re writing Rust or Go, this is the default behavior as well because binaries are statically compiled into one file.

This code is very typical and you’ve probably seen it many times. At the top of the file, we load a DynamoDB client along with a SNS client, then we move on to process the payload:

'use strict';

const { DynamoDBClient } = require("@aws-sdk/client-dynamodb");
const { DynamoDBDocumentClient, PutCommand } = require("@aws-sdk/lib-dynamodb");
const dynamoClient = new DynamoDBClient({ region: process.env.AWS_REGION });
const ddbClient = DynamoDBDocumentClient.from(dynamoClient);

const { SNSClient, PublishBatchCommand } = require("@aws-sdk/client-sns");
const snsClient = new SNSClient({ region: process.env.AWS_REGION });
const { v4: uuidv4 } = require("uuid");

// handler code in gist

The full code is available here.

Eager Loading Cold Start

We can measure the duration of this Cold Start Trace and see that loading DynamoDB loads in around 360ms. The DynamoDB client also depends on the AWS STS client, which is true of SNS and most other services. The trace looks like this:

Further down the flamegraph we see SNS loads in another 50ms:

Lazy Loading to improve performance

If we have hundreds or thousands of users; AJ’s todo items may represent only 5% or 1% of calls to this endpoint. However we load the SNS client on every single initialization, regardless of if we’ll use SNS!

Let’s fix this!

To improve this performance we can move our require statement into a method which we’ll call only when a Todo item item from AJ is received. Don’t worry that we reassign this variable - in NodeJS, calls to require are cached so this module load will only occur once on the first call to loadSns(). We could also check if the snsClient variable is nil before calling the method, but brevity is preferred here.

This strategy is also effective for Ruby and Python (as well as Java and other languages).

'use strict';

const { DynamoDBClient } = require("@aws-sdk/client-dynamodb");
const { DynamoDBDocumentClient, PutCommand } = require("@aws-sdk/lib-dynamodb");
const dynamoClient = new DynamoDBClient({ region: process.env.AWS_REGION });
const ddbClient = DynamoDBDocumentClient.from(dynamoClient);

let snsClient, PublishBatchCommand, SNSClient
const { v4: uuidv4 } = require("uuid");

const loadSns = () => {
  ({ SNSClient, PublishBatchCommand } = require("@aws-sdk/client-sns"));
  snsClient = new SNSClient({ region: process.env.AWS_REGION });
}

module.exports.addItem = async (event) => {
  const body = JSON.parse(event.body);
  const promises = []
  const newItemId = uuidv4();
  // It's for AJ - load the SNS client!
  if (body.userId === 'aj') {
    loadSns();
    // ... rest of handler code in gist

The full code is available here.

Lazy Loading means that we only load the SNS client when we need it - so let’s take a look at the Cold Start Trace when a normal user creates a Todo item:

We can see that the handler loads in 401ms compared to the previous 478ms - that’s a 16% decrease in latency for normal users experiencing a Cold Start!

So what happens when a Todo item is created for AJ? You can see that the ~80ms is shifted to the AWS Lambda Handler function span, where AJ has to wait for the SNS client to load:

Subsequent invocations for AJ won’t result in any additional latency, as modules are cached by the Node process (or Ruby, or Python), so subsequent calls to loadSns() are effectively a no-op. If additional Todo items are created for AJ after the initial load from loadSns(), we only see the parallel calls to SNS and DynamoDB in the trace:

We could clean up the implementation to codify this behavior, but I think that exercise is best left to the reader.

Wrapping up

Keen observers would point out that the init portion of a Lambda execution lifecycle is free. And they’re right! For now. AWS doesn’t promise that the init duration is free (although this is widely observed and has been for some time).

Cost in dollars shouldn’t really be a factor here, as the overall number of cold starts is limited and shifting this dependency to the user with a special case is worth saving everyone other use the initialization time.

This technique is especially applicable to mono-lambda APIs where dependencies can vary by route, or specific users like in this simple example. I’d also make a strong case that this type of atypical behavior ought to be refactored out into a separate Lambda Function, but that will be a topic for a different day.

As you embark on your Serverless journey, keep an eye out for opportunities to be lazy!

Hopefully you enjoyed this post. If you’re interested in other Serverless minutia, be sure to check out the rest of my blog and twitter feed!