Resilient by Design: Why Zerberus Survives What Brought AWS Down
- Ramkumar Sundarakalatharan
- 1 day ago
- 5 min read
📅 Updated 21 October 2025, 17:20 BST
🟡 Developing incident: AWS continues gradual recovery across its US-EAST-1 region following widespread DNS degradation affecting DynamoDB, EC2, and Lambda.

The AWS Outage 2025: What Actually Happened
At 12:11 AM PDT on 20 October 2025, Amazon Web Services confirmed an “operational issue” in its US-EAST-1 (Northern Virginia) region. The root cause was a DNS resolution failure within DynamoDB, cascading to services such as EC2, Lambda, RDS, Glue, ECS, and CloudFormation.
For almost five hours, key internet applications slowed or stalled. Based on incident dashboards and Downdetector reports, the fallout included:
Social & Communication: Snapchat, Reddit, Facebook (partial), T-Mobile, Verizon.
Gaming: Roblox and Fortnite sign-ins failing globally.
Entertainment: Disney+ streaming blackouts.
Finance: Coinbase, Robinhood, and Venmo login disruptions (no data loss).
Commerce & Daily Apps: Amazon, Canva, McDonald’s, Lyft, United Airlines, Duolingo, The New York Times.
By 5:10 AM PDT, AWS reported recovery, but residual throttling persisted across EC2 and Lambda. At its peak, more than 15,000 users reported issues, translating to millions of indirect service interruptions worldwide.
Developers on X (formerly Twitter) summarised the mood best:
“It’s 2025, and one AWS region can still take half the internet offline.”
Why Cloud Outages Keep Happening
This incident is neither unique nor unexpected. The pattern is clear: tight interdependencies across a cloud’s control plane, DNS, and identity systems amplify a local fault into a global outage.
We’ve seen it before:
Fastly’s CDN crash (2021) that knocked out global news sites,
AWS S3’s metadata fault (2017) that stalled half the web,
Azure AD outages (2020–2021) from authentication key rotation bugs,
Google Cloud’s 2020 global auth failure that silenced Workspace.
Each event reinforces the same lesson, redundancy within one provider is not the same as multi-cloud resilience.
Cloud Outage Highlights: 2020–2025
Year | Provider / Region | Root Cause | Duration | Major Impact |
2020 (Dec) | Google Cloud / Global | Central auth quota fault | ~45 min | Gmail, YouTube, Drive down worldwide |
2020 (Sep) | Azure / Americas | AD deployment misconfiguration | ~3 h | Microsoft 365, Teams sign-in failure |
2021 (Mar–Apr) | Azure / Global | Key rotation & DNS code defect | 2–4 h | Azure, Exchange, Outlook affected |
2021 (Dec) | AWS / US-EAST-1 | Network device impairment | ~6 h | Netflix, Disney+, Slack outage |
2023 (Apr) | Google Cloud / Paris | Data-centre fire after water leak | >12 h | GCP regional loss |
2023 (Jun) | AWS / US-EAST-1 | Capacity-management fault | 2 h | Lambda, API Gateway, AWS Console |
2025 (Jun) | Google Cloud / Multi-region | Policy update broke Service Control | ~3 h | Workspace, Spotify, Discord |
2025 (Oct) | AWS / US-EAST-1 | DNS resolution fault in DynamoDB | ~5 h | Global service degradation |
Across providers, the control plane, identity, DNS, quotas, service routing, remains the single point of systemic failure.
The pattern is clear: centralised control planes are single points of systemic risk.
How We Got Here: From Mainframes to Trust Automation
Before 2018, compliance software meant monolithic, on-prem GRC suites from giants like IBM OpenPages and Oracle GRC Cloud, black-box systems deployed by banks and Fortune 500s.They were powerful but expensive, closed, and rigid.
Then came Vanta, Drata, and later Secureframe.These companies invented the Trust Management sector, bringing what was once enterprise-only automation to startups and SMBs.They replaced audit spreadsheets with real-time dashboards, automated evidence collection, and compliance workflows that any engineering team could adopt.
This shift democratised trust for the modern SaaS economy.
How the Pioneers Built the Modern Stack
Platforms like Vanta, Drata, and Secureframe transformed compliance from a manual, consultant-driven exercise into real-time, scalable trust management. Their work effectively created the compliance automation category, replacing black-box systems from IBM and Oracle with transparent, developer-friendly SaaS.
Vanta’s architecture exemplifies this new model, a multi-tenant, microservice platform hosted on AWS, powered by Terraform-based IaC for reproducibility and MongoDB Atlas for scalable storage.
The company further innovated with a browser extension that automates vendor questionnaire ingestion and an MCP Server that links compliance data with AI agents such as Claude, enabling contextual audit automation. Its AI retrieval layer uses vector-based embeddings, semantic search, and a “quality hill-climbing” feedback loop to improve precision in audit responses.
Behind the scenes, Vanta also invested heavily in reliability engineering, optimising frontend performance sevenfold, resolving JWT race conditions, and introducing structured error frameworks. Controlled dependency rollouts and SBOM-style validation underpin their continuous GRC Trust Centre.
Source: Engineering at Vanta (Terraform Migration, MCP Server, Questionnaire Extension, Smarter Retrieval System)
Drata and Secureframe followed similar architectural paths, multi-tenant SaaS systems on AWS, with strong automation pipelines and agent frameworks. Collectively, they made compliance accessible, fast, and measurable for thousands of SaaS companies.
Yet, as the AWS outage 2025 reminds us, even world-class trust platforms built atop a single cloud provider inherit its operational risks. Their engineering excellence remains unquestioned, but architectural independence is the next frontier.
Why Zerberus Chose the Harder Road
When we designed Zerberus, we assumed failure as the baseline. Our objective was to build autonomous compliance that functions across outages, regions, and providers.

1, Multi-Cloud Resilience Core
Five production regions across AWS & Azure
DigitalOcean as Pilot-Light (active-active) nodes.
Cloudflare Edge performs DNS, WAF and load balancing outside any single cloud.
Each hot stand-by receives live traffic to ensure parity and instant scalability.
2, Externalised Control Plane
Routing, identity sync, and health logic operate independently of cloud IAM or DNS. If a region fails, Cloudflare redirects within seconds to a healthy zone or provider.
3, Predictive Security and AI
Azure-hosted models trained on open breach datasets and live telemetry generate continuous risk signals for Trace-AI and trigger Remed-AI for automated fixes.
4, Just-in-Time Remediation & Zero-Knowledge Proof Audits
Temporary STS tokens grant short-lived access; after execution, permissions are destroyed. Each action produces a Zero-Knowledge Proof recorded in the Zerberus Evidence Plane, ensuring privacy and verifiable auditability.
5, Data Sovereignty by Default
Every client operates within its own VPC boundary.
Only hashed artefacts (pass/fail logs, evidence summaries) leave the jurisdiction, supporting DORA, NIS2, and Cyber Resilience Act compliance.
Zerberus vs the AWS Outage 2025: Zero Impact
Failure Vector | Typical Industry Effect | Zerberus Outcome |
DNS Resolution (US-EAST-1) | Cascading API failures | Cloudflare rerouted edge traffic to Azure/DO nodes |
Control Plane Latency | IAM timeouts | Local token stores kept auth autonomous |
Lambda / Compute Throttle | Service degradation | Workloads scaled via Azure App Services and DO Droplets |
Audit Data Persistence | Evidence loss | ZKP ledger continued uninterrupted |
While many world-renowned services reported degradation, Zerberus maintained 100 % availability and full audit continuity.

Note: That "spike" on the response time is our "threshold" on "Edge" and traffic distribution!
Lessons for the Industry
The pioneers of trust automation proved compliance could be continuous. The next chapter is about making it resilient.
Resilience is not an SLA metric, it’s an architectural philosophy.
Multi-region is good; multi-cloud resilience with independent control planes is better.
Regulations like the Cyber Resilience Act will soon require exactly that.
Readt to future proof your Cybersecurity Compliance Automation?
Visit: https://zerberus.ai today and request a Demo.
Further Reading & References
AWS Post-Event Summary — US-EAST-1 Outage (20 Oct 2025)
Downdetector Insights: Outage impact data and user report density
Vanta Engineering Blog (“Terraform Migration”, “Questionnaire Automation”, “MCP Server”, “Smarter Retrieval System”)
Drata Architecture Overview (2024 Engineering Spotlight)
Google Cloud Incident Archive (Dec 2020, Jun 2025)
Microsoft Azure Post-Incident Reports (2020–2021)
EU Cyber Resilience Act, NIS2 Directive & DORA Framework
Comments