top of page

Resilient by Design: Why Zerberus Survives What Brought AWS Down

📅 Updated 21 October 2025, 17:20 BST 

🟡 Developing incident: AWS continues gradual recovery across its US-EAST-1 region following widespread DNS degradation affecting DynamoDB, EC2, and Lambda.


Resilient by Design: How Zerberus Stayed Unfazed During the AWS Outage 2025

The AWS Outage 2025: What Actually Happened

At 12:11 AM PDT on 20 October 2025, Amazon Web Services confirmed an “operational issue” in its US-EAST-1 (Northern Virginia) region. The root cause was a DNS resolution failure within DynamoDB, cascading to services such as EC2, Lambda, RDS, Glue, ECS, and CloudFormation.

For almost five hours, key internet applications slowed or stalled. Based on incident dashboards and Downdetector reports, the fallout included:

  • Social & Communication: Snapchat, Reddit, Facebook (partial), T-Mobile, Verizon.

  • Gaming: Roblox and Fortnite sign-ins failing globally.

  • Entertainment: Disney+ streaming blackouts.

  • Finance: Coinbase, Robinhood, and Venmo login disruptions (no data loss).

  • Commerce & Daily Apps: Amazon, Canva, McDonald’s, Lyft, United Airlines, Duolingo, The New York Times.

By 5:10 AM PDT, AWS reported recovery, but residual throttling persisted across EC2 and Lambda. At its peak, more than 15,000 users reported issues, translating to millions of indirect service interruptions worldwide.


Developers on X (formerly Twitter) summarised the mood best:

“It’s 2025, and one AWS region can still take half the internet offline.”


Why Cloud Outages Keep Happening

This incident is neither unique nor unexpected. The pattern is clear: tight interdependencies across a cloud’s control plane, DNS, and identity systems amplify a local fault into a global outage.

We’ve seen it before:

  • Fastly’s CDN crash (2021) that knocked out global news sites,

  • AWS S3’s metadata fault (2017) that stalled half the web,

  • Azure AD outages (2020–2021) from authentication key rotation bugs,

  • Google Cloud’s 2020 global auth failure that silenced Workspace.

Each event reinforces the same lesson, redundancy within one provider is not the same as multi-cloud resilience.


Cloud Outage Highlights: 2020–2025

Year

Provider / Region

Root Cause

Duration

Major Impact

2020 (Dec)

Google Cloud / Global

Central auth quota fault

~45 min

Gmail, YouTube, Drive down worldwide

2020 (Sep)

Azure / Americas

AD deployment misconfiguration

~3 h

Microsoft 365, Teams sign-in failure

2021 (Mar–Apr)

Azure / Global

Key rotation & DNS code defect

2–4 h

Azure, Exchange, Outlook affected

2021 (Dec)

AWS / US-EAST-1

Network device impairment

~6 h

Netflix, Disney+, Slack outage

2023 (Apr)

Google Cloud / Paris

Data-centre fire after water leak

>12 h

GCP regional loss

2023 (Jun)

AWS / US-EAST-1

Capacity-management fault

2 h

Lambda, API Gateway, AWS Console

2025 (Jun)

Google Cloud / Multi-region

Policy update broke Service Control

~3 h

Workspace, Spotify, Discord

2025 (Oct)

AWS / US-EAST-1

DNS resolution fault in DynamoDB

~5 h

Global service degradation

Across providers, the control plane, identity, DNS, quotas, service routing, remains the single point of systemic failure.

The pattern is clear: centralised control planes are single points of systemic risk.


How We Got Here: From Mainframes to Trust Automation

Before 2018, compliance software meant monolithic, on-prem GRC suites from giants like IBM OpenPages and Oracle GRC Cloud,  black-box systems deployed by banks and Fortune 500s.They were powerful but expensive, closed, and rigid.

Then came Vanta, Drata, and later Secureframe.These companies invented the Trust Management sector,  bringing what was once enterprise-only automation to startups and SMBs.They replaced audit spreadsheets with real-time dashboards, automated evidence collection, and compliance workflows that any engineering team could adopt.

This shift democratised trust for the modern SaaS economy.


How the Pioneers Built the Modern Stack

Platforms like Vanta, Drata, and Secureframe transformed compliance from a manual, consultant-driven exercise into real-time, scalable trust management. Their work effectively created the compliance automation category, replacing black-box systems from IBM and Oracle with transparent, developer-friendly SaaS.

Vanta’s architecture exemplifies this new model, a multi-tenant, microservice platform hosted on AWS, powered by Terraform-based IaC for reproducibility and MongoDB Atlas for scalable storage.

The company further innovated with a browser extension that automates vendor questionnaire ingestion and an MCP Server that links compliance data with AI agents such as Claude, enabling contextual audit automation. Its AI retrieval layer uses vector-based embeddings, semantic search, and a “quality hill-climbing” feedback loop to improve precision in audit responses.

Behind the scenes, Vanta also invested heavily in reliability engineering,  optimising frontend performance sevenfold, resolving JWT race conditions, and introducing structured error frameworks. Controlled dependency rollouts and SBOM-style validation underpin their continuous GRC Trust Centre.

Source: Engineering at Vanta (Terraform Migration, MCP Server, Questionnaire Extension, Smarter Retrieval System)

Drata and Secureframe followed similar architectural paths,  multi-tenant SaaS systems on AWS, with strong automation pipelines and agent frameworks. Collectively, they made compliance accessible, fast, and measurable for thousands of SaaS companies.


Yet, as the AWS outage 2025 reminds us, even world-class trust platforms built atop a single cloud provider inherit its operational risks. Their engineering excellence remains unquestioned, but architectural independence is the next frontier.


Why Zerberus Chose the Harder Road

When we designed Zerberus, we assumed failure as the baseline. Our objective was to build autonomous compliance that functions across outages, regions, and providers.

A crisp Architecture Diagram of Zerberus.ai and how it achieves Resiliance

1, Multi-Cloud Resilience Core

  1. Five production regions across AWS & Azure

  2. DigitalOcean as Pilot-Light (active-active) nodes. 

  3. Cloudflare Edge performs DNS, WAF and load balancing outside any single cloud. 

  4. Each hot stand-by receives live traffic to ensure parity and instant scalability.


2, Externalised Control Plane

Routing, identity sync, and health logic operate independently of cloud IAM or DNS. If a region fails, Cloudflare redirects within seconds to a healthy zone or provider.


3, Predictive Security and AI

Azure-hosted models trained on open breach datasets and live telemetry generate continuous risk signals for Trace-AI and trigger Remed-AI for automated fixes.


4, Just-in-Time Remediation & Zero-Knowledge Proof Audits

Temporary STS tokens grant short-lived access; after execution, permissions are destroyed. Each action produces a Zero-Knowledge Proof recorded in the Zerberus Evidence Plane, ensuring privacy and verifiable auditability.

5, Data Sovereignty by Default

Every client operates within its own VPC boundary.


Only hashed artefacts (pass/fail logs, evidence summaries) leave the jurisdiction, supporting DORA, NIS2, and Cyber Resilience Act compliance.


Zerberus vs the AWS Outage 2025: Zero Impact

Failure Vector

Typical Industry Effect

Zerberus Outcome

DNS Resolution (US-EAST-1)

Cascading API failures

Cloudflare rerouted edge traffic to Azure/DO nodes

Control Plane Latency

IAM timeouts

Local token stores kept auth autonomous

Lambda / Compute Throttle

Service degradation

Workloads scaled via Azure App Services and DO Droplets

Audit Data Persistence

Evidence loss

ZKP ledger continued uninterrupted

While many world-renowned services reported degradation, Zerberus maintained 100 % availability and full audit continuity.

A dashboard from UptimeRobot showing uptime and response times for Zerberus.ai app for last 24hrs.
Zerberus.ai uptime with Response time

Note: That "spike" on the response time is our "threshold" on "Edge" and traffic distribution!


Lessons for the Industry

The pioneers of trust automation proved compliance could be continuous. The next chapter is about making it resilient.

  1. Resilience is not an SLA metric,  it’s an architectural philosophy.

  2. Multi-region is good; multi-cloud resilience with independent control planes is better.

  3. Regulations like the Cyber Resilience Act will soon require exactly that.


Readt to future proof your Cybersecurity Compliance Automation?

Visit: https://zerberus.ai today and request a Demo.


Further Reading & References

  • AWS Post-Event Summary — US-EAST-1 Outage (20 Oct 2025)

  • Downdetector Insights: Outage impact data and user report density

  • Vanta Engineering Blog (“Terraform Migration”, “Questionnaire Automation”, “MCP Server”, “Smarter Retrieval System”)

  • Drata Architecture Overview (2024 Engineering Spotlight)

  • Google Cloud Incident Archive (Dec 2020, Jun 2025)

  • Microsoft Azure Post-Incident Reports (2020–2021)

  • EU Cyber Resilience Act, NIS2 Directive & DORA Framework

Comments


bottom of page