Postmortem Incident Report

Incident Summary

• Date: February 19, 2026
• Duration: 12:30 – 13:00 CET
• Impact:
  • Full AMQP service outage affecting all connected services and event-driven workflows.
  • RabbitMQ cluster instability following an Out-Of-Memory (OOM) event on a single node, which escalated into a cluster-wide failure.
  • Approximately 30 minutes of downtime and partial message loss due to required cluster recovery.

We sincerely apologize for the disruption this caused. Reliable messaging is a critical component of our platform, and we deeply regret the inconvenience experienced by our users and teams.

Resolution

The issue was resolved by temporarily detaching RabbitMQ from the load balancer, restoring the cluster to a stable clean state, and re-enabling inter-node communication before resuming traffic.

Remediation Actions Taken:

✅ Detached RabbitMQ from load balancer – Reduced connection churn and stabilized recovery efforts.
✅ Performed clean-state cluster restore – Allowed all nodes to return to a healthy state.
✅ Restored cluster communication service – Ensured proper inter-node synchronization before reattaching traffic.

Root Cause Analysis

What Happened

At 12:28 CET, one RabbitMQ node experienced a sudden and significant memory spike, increasing from approximately 3 GiB to over 5 GiB within seconds. Kubernetes terminated the Erlang VM process due to an Out-Of-Memory (OOM) event:

Following the restart, the node was unable to fully stabilize due to the large number of existing queues, which significantly slowed the recovery process.

Contributing Factors

• Elevated queue volume increased memory pressure during node recovery.
• Client reconnection attempts amplified load on remaining nodes.
• Recovery required controlled traffic isolation to prevent further cascading effects.

Impact

• 30 minutes of full AMQP service unavailability.
• All services dependent on RabbitMQ messaging were affected.
• Some in-flight or queued messages were not recoverable following the cluster restore.

Corrective & Preventive Measures

Completed Actions (Post-Incident Remediation)

1. Stabilized recovery procedure – Defined safe detach/attach process for the load balancer.
2. Validated cluster configuration consistency – Ensured required services for inter-node communication are present.
3. Initiated queue lifecycle policy review – Strengthened governance around queue management.

Long-Term Improvements

• Enhance queue lifecycle controls
• Reinforce automated policy enforcement during cluster provisioning and migration.
• Automate recovery workflows
• Implement protective measures to reduce reconnection storms during partial outages.

Lessons Learned

What Went Well

• Monitoring detected the node failure immediately and triggered alerts.
• Traffic isolation from the load balancer helped prevent prolonged cascading failure.
• Team response time was rapid and coordinated.

What Went Wrong

• Recovery complexity was higher than expected under load conditions.
• Manual intervention under pressure increased operational risk.

Apology & Commitment to Improvement

We deeply regret the impact this incident had on our users and teams. Ensuring resilient and highly available messaging infrastructure remains a top priority.

We are strengthening monitoring, governance, and recovery automation to further reduce the likelihood and impact of similar events in the future.

Owner: DevOps Team

EU datacenter down