Postmortem Incident Report

Incident Summary

Date: 2025-04-01
Duration: 08:40 - 09:35 (UTC)
Impact:
- Services in the EU cluster and the Telemetry cluster were disrupted.
- Affected services included delays in alerting, degraded application performance, and partial outages.
- Root cause stemmed from Azure AKS node failures in the northeurope-2 datacenter, compounded by a stuck Redis instance due to disk allocation issues.

We sincerely apologize for the disruption this caused. We understand the importance of service availability and deeply regret any inconvenience to our users and teams.

Resolution

The issue was resolved by migrating workloads to a healthy datacenter and restoring dependencies, including CosmosDB and Redis.

Remediation Actions Taken:

✅ Manual failover of services – Migrated critical workloads out of the affected AKS cluster.

✅ Telemetry rerouting – Reconnected and validated monitoring systems in the alternate region.

✅ Redis instance recovery – Identified and resolved a stuck disk allocation that prevented Redis failover.

✅ CosmosDB connectivity restored – Waited for Azure region-wide services to stabilize post-outage.

Root Cause Analysis

What Happened

At 08:46 UTC, Microsoft experienced a power maintenance event in the North Europe region, affecting Availability Zone 2.
This caused widespread disruption to Azure Virtual Machines, Storage, CosmosDB, and other services.
Our AKS nodes in northeurope-2 became unavailable at 08:46 UTC, leading to service downtime.
A Redis instance did not failover automatically due to a stuck disk allocation, causing further delays.
CosmosDB services degraded and then gradually recovered by late morning.

Why It Happened

Power event in Azure North Europe – Caused loss of infrastructure in AZ2, including our AKS and dependent services.
Redis disk allocation failure – Blocked automatic failover, extending downtime until manual intervention.
CosmosDB regional service instability – Contributed to degraded app performance and telemetry alert delays.

Impact

Services Affected: EU production workloads, Telemetry/alerting systems, Redis-backed services.
Consequences: Downtime, delayed alerts, increased service latency, and degraded observability during incident window.

Corrective & Preventive Measures

✅ Completed Actions (Post-Incident Remediation)

Manual migration playbook executed – Ensured reliable transition of workloads.
Redis instance recovered – Cleared disk allocation and restored operations.
Telemetry alerts validated – Reestablished healthy monitoring flow.

📌 Long-Term Improvements

Implement Redis Sentinel with automated failover
- Planned maintenance will introduce Redis Sentinel to eliminate single-node failures.
Conduct power-failure disaster recovery simulations
- Simulate AZ-specific outages to improve readiness and reduce recovery time.

Apology & Commitment to Improvement

We deeply regret the impact this incident had on our users and teams. Ensuring seamless service availability is our priority, and we acknowledge that this issue could have been avoided with better Redis failover strategy and multi-region preparedness.

We are committed to preventing similar incidents by implementing Redis Sentinel, region-aware resilience practices, and faster automated recovery. Thank you for your patience and trust as we continue to harden our infrastructure.

Owner: Matus Szepe

Azure AKS Regional Failure Impacting EU & Telemetry Clusters