Azure AKS Regional Failure Impacting EU & Telemetry Clusters

Uptime Impact: 13 minutes and 3 seconds
Resolved
Resolved

Postmortem Incident Report

Incident Summary

  • Date: 2025-04-01
  • Duration: 08:40 - 09:35 (UTC)
  • Impact:
    • Services in the EU cluster and the Telemetry cluster were disrupted.
    • Affected services included delays in alerting, degraded application performance, and partial outages.
    • Root cause stemmed from Azure AKS node failures in the northeurope-2 datacenter, compounded by a stuck Redis instance due to disk allocation issues.

We sincerely apologize for the disruption this caused. We understand the importance of service availability and deeply regret any inconvenience to our users and teams.

Resolution

The issue was resolved by migrating workloads to a healthy datacenter and restoring dependencies, including CosmosDB and Redis.

Remediation Actions Taken:

Manual failover of services – Migrated critical workloads out of the affected AKS cluster.

Telemetry rerouting – Reconnected and validated monitoring systems in the alternate region.

Redis instance recovery – Identified and resolved a stuck disk allocation that prevented Redis failover.

CosmosDB connectivity restored – Waited for Azure region-wide services to stabilize post-outage.

Root Cause Analysis

What Happened

  • At 08:46 UTC, Microsoft experienced a power maintenance event in the North Europe region, affecting Availability Zone 2.
  • This caused widespread disruption to Azure Virtual Machines, Storage, CosmosDB, and other services.
  • Our AKS nodes in northeurope-2 became unavailable at 08:46 UTC, leading to service downtime.
  • A Redis instance did not failover automatically due to a stuck disk allocation, causing further delays.
  • CosmosDB services degraded and then gradually recovered by late morning.

Why It Happened

  • Power event in Azure North Europe – Caused loss of infrastructure in AZ2, including our AKS and dependent services.
  • Redis disk allocation failure – Blocked automatic failover, extending downtime until manual intervention.
  • CosmosDB regional service instability – Contributed to degraded app performance and telemetry alert delays.

Impact

  • Services Affected: EU production workloads, Telemetry/alerting systems, Redis-backed services.
  • Consequences: Downtime, delayed alerts, increased service latency, and degraded observability during incident window.

Corrective & Preventive Measures

Completed Actions (Post-Incident Remediation)

  1. Manual migration playbook executed – Ensured reliable transition of workloads.
  2. Redis instance recovered – Cleared disk allocation and restored operations.
  3. Telemetry alerts validated – Reestablished healthy monitoring flow.

📌 Long-Term Improvements

  • Implement Redis Sentinel with automated failover
    • Planned maintenance will introduce Redis Sentinel to eliminate single-node failures.
  • Conduct power-failure disaster recovery simulations
    • Simulate AZ-specific outages to improve readiness and reduce recovery time.

Apology & Commitment to Improvement

We deeply regret the impact this incident had on our users and teams. Ensuring seamless service availability is our priority, and we acknowledge that this issue could have been avoided with better Redis failover strategy and multi-region preparedness.

We are committed to preventing similar incidents by implementing Redis Sentinel, region-aware resilience practices, and faster automated recovery. Thank you for your patience and trust as we continue to harden our infrastructure.

Owner: Matus Szepe

Avatar for
Resolved

We have now fully resolved the incident that began earlier today due to Azure AKS node failures in the northeurope-3 region.

✅ All core services have been restored

✅ Telemetry and alerting pipelines are operational

✅ Azure CosmosDB is now fully functional, and dependent services are behaving as expected

We will proceed with a full post-incident review to analyze the root causes, validate recovery steps, and implement long-term improvements to prevent recurrence.

Thank you for your patience throughout this disruption.

Avatar for
Recovering

🚧 Incident Update – Recovery in Progress We have mitigated the initial impact from the Azure AKS node failures in the northeurope-3 datacenter, and most services have been restored.

However, we are still experiencing degraded performance and intermittent connectivity issues with Azure CosmosDB, which is affecting some components dependent on database access.

We are monitoring recovery progress closely and will provide updates as we learn more.

Current Status:

✅ Core services operational

⚠️ CosmosDB access partially degraded

🛠 Recovery in progress

Thank you for your patience as we continue to restore full service reliability.

Avatar for
Investigating

🚨 We are currently investigating a service disruption affecting parts of our EU and Telemetry clusters. Initial signs point to infrastructure issues within the Azure North Europe region. Some services may be intermittently unavailable or degraded. Our team is actively working to identify the scope and restore normal operations as quickly as possible.

We will provide further updates as we learn more.

Began at:

Affected components
  • EU DataCenter
    • API
  • Telemetry