Post-mortem on Oct 20th AWS incident

Yesterday our hosting provider, AWS, experienced a long-running incident. It affected thousands of businesses worldwide (including ours). Incidents happen regularly, and we should be prepared for them. Below I explain why this one led to downtime for us.

We received the first alerts from our monitoring at 07:18 UTC. It quickly looked like an infrastructure-provider issue, so we checked the AWS Status Page and saw reports of a widespread outage in the us-east-1 region. Within minutes, AWS identified a service causing the problem—DNS issues led to DynamoDB being inaccessible across the entire region. This was particularly unfortunate for us because DynamoDB hosts all of our short links.

Our first step was to update our status page, but Atlassian Statuspage is hosted in us-east-1 as well. Statuspage itself worked, but we couldn’t update it. That’s why our page looked like the typical “all-green, nothing works” situation. We also couldn’t support customers through Intercom, which is… also hosted in us-east-1. Our only option was to post updates on LinkedIn and X.

Next, we tried to find a workaround while AWS worked on a fix. Although short links are stored in DynamoDB, we also maintain an alternative store in Cassandra, so we considered releasing a hotfix. We soon realized we couldn’t: us-east-1 is not only the busiest region; it’s AWS’s primary region. We couldn’t access or modify our infrastructure because we couldn’t sign in or refresh tokens. Our only option was to wait.

Around 09:00 UTC, DynamoDB returned to normal. We still weren’t able to access our infrastructure, update Statuspage, or provide chat support, but links began working again. We were able to update our status page around 09:30 UTC.

The incident was over for us at that point, but not for AWS. Around 13:00 UTC, a second wave of issues began, affecting AWS serverless and autoscaling services. We don’t use those; we run on reserved EC2 instances.

Between 15:14 UTC and 21:00 UTC, AWS routed all US traffic for our short links to our servers in Frankfurt. This didn’t affect availability but did slightly increase latency (an extra 300–500 ms), which was acceptable given the broader impact elsewhere.

We chose not to release any fixes that day to avoid making the situation worse.

Our next steps

on hanling such incidents

We shouldn’t host our status page with the same provider. We’re deploying a secondary cluster with another hosting provider for a self-hosted status page and a fallback, self-hosted chat support tool. Status page will live at shortiostatus.com so we don’t share the same domain as the main website.

Our engineers will implement a Cassandra fallback if we cannot retrieve a link from DynamoDB. We could have done this earlier, but we believed a DynamoDB outage was highly unlikely. This is now our top priority.

We will grant more autonomy to our secondary regions—Frankfurt and Sydney—while ensuring we remain within our budget.

Post-mortem on Oct 20th AWS incident

Our next steps

Written by:

Andrii