Postmortem Incident Report

Incident Summary

• Date: June 2, 2026
• Incident window: 09:35 – 12:38 CET
• Total service downtime: ~1 hour 48 minutes (across two separate outage periods)
• Impact:
  • Full AMQP service outage affecting all connected services and event-driven workflows on the EU2 environment.
  • RabbitMQ cluster instability following an Out-Of-Memory (OOM) event on a single node, which escalated into a cluster-wide failure.
  • Services were fully operational between the two incidents.

We sincerely apologize for the disruption this caused. Reliable messaging is a critical component of our platform, and we deeply regret the inconvenience experienced by our users and teams.

Resolution

The issue was resolved by spinning up additional nodes to accelerate cluster recovery, restoring the cluster to a stable state, and verifying full service availability before confirming resolution.

Remediation Actions Taken:

✅ Deployed additional infrastructure nodes – Accelerated cluster recovery and reduced time to restoration.
✅ Performed cluster recovery – Restored all nodes to a healthy, synchronized state.
✅ Verified full service availability – Confirmed all EU2 services operational before closing the incident.

Root Cause Analysis

What Happened

At 09:35 CET, a RabbitMQ node on the EU2 cluster experienced a sudden memory spike, exceeding its configured threshold. The Erlang VM process was terminated due to an Out-Of-Memory (OOM) event.

Following the initial node failure, the remaining cluster nodes were unable to maintain quorum, triggering a cascading failure that resulted in a full cluster outage. Recovery required deploying additional nodes and performing a controlled cluster restoration.

Contributing Factors

• Elevated queue volume increased memory pressure on the affected node.
• Client reconnection attempts amplified load on remaining nodes during recovery.
• Known cluster configuration limitation where single-node failure can propagate to the entire cluster.

Impact

• ~1 hour 48 minutes of total AMQP service unavailability on EU2 (across two separate outage periods).
• All services dependent on RabbitMQ messaging were affected.
• All EU2 customers experienced complete service disruption during the outage windows.

Corrective & Preventive Measures

Completed Actions (Post-Incident Remediation)

1. Stabilized recovery procedure – Validated and executed node deployment process for accelerated recovery.
2. Verified cluster health – Ensured all nodes fully synchronized and operational before restoring traffic.
3. Initiated review of memory monitoring thresholds – Identified gaps in early-warning alerting for node memory pressure.

Ongoing Improvements

• Architectural hardening of the messaging layer to improve cluster resilience and fault isolation.
• Enhanced monitoring and alerting to enable earlier detection of resource pressure.

Lessons Learned

What Went Well

• External monitoring (Pingdom) detected the failure within minutes.
• Team response was rapid — DevOps engaged within 2 minutes of alert.
• Deploying additional nodes accelerated the recovery process.

What Went Wrong

• Known cascading failure limitation had not yet been addressed prior to this incident.
• No proactive memory alerting to warn before OOM threshold was reached.
• Recovery complexity extended the total downtime beyond initial expectations.

Apology & Commitment to Improvement

We deeply regret the impact this incident had on our users and teams. Ensuring resilient and highly available messaging infrastructure remains a top priority.

We are implementing architectural changes to our messaging layer, strengthening monitoring, and improving operational safeguards to significantly reduce the likelihood and impact of similar events in the future.

Owner: DevOps Team

EU2 Cluster connectivity issue