• Date: June 2, 2026
• Incident window: 09:35 – 12:38 CET
• Total service downtime: ~1 hour 48 minutes (across two separate outage periods)
• Impact:
• Full AMQP service outage affecting all connected services and event-driven workflows on the EU2 environment.
• RabbitMQ cluster instability following an Out-Of-Memory (OOM) event on a single node, which escalated into a cluster-wide failure.
• Services were fully operational between the two incidents.
We sincerely apologize for the disruption this caused. Reliable messaging is a critical component of our platform, and we deeply regret the inconvenience experienced by our users and teams.
The issue was resolved by spinning up additional nodes to accelerate cluster recovery, restoring the cluster to a stable state, and verifying full service availability before confirming resolution.
✅ Deployed additional infrastructure nodes – Accelerated cluster recovery and reduced time to restoration.
✅ Performed cluster recovery – Restored all nodes to a healthy, synchronized state.
✅ Verified full service availability – Confirmed all EU2 services operational before closing the incident.
At 09:35 CET, a RabbitMQ node on the EU2 cluster experienced a sudden memory spike, exceeding its configured threshold. The Erlang VM process was terminated due to an Out-Of-Memory (OOM) event.
Following the initial node failure, the remaining cluster nodes were unable to maintain quorum, triggering a cascading failure that resulted in a full cluster outage. Recovery required deploying additional nodes and performing a controlled cluster restoration.
Contributing Factors
• Elevated queue volume increased memory pressure on the affected node.
• Client reconnection attempts amplified load on remaining nodes during recovery.
• Known cluster configuration limitation where single-node failure can propagate to the entire cluster.
Impact
• ~1 hour 48 minutes of total AMQP service unavailability on EU2 (across two separate outage periods).
• All services dependent on RabbitMQ messaging were affected.
• All EU2 customers experienced complete service disruption during the outage windows.
Completed Actions (Post-Incident Remediation)
1. Stabilized recovery procedure – Validated and executed node deployment process for accelerated recovery.
2. Verified cluster health – Ensured all nodes fully synchronized and operational before restoring traffic.
3. Initiated review of memory monitoring thresholds – Identified gaps in early-warning alerting for node memory pressure.
Ongoing Improvements
• Architectural hardening of the messaging layer to improve cluster resilience and fault isolation.
• Enhanced monitoring and alerting to enable earlier detection of resource pressure.
What Went Well
• External monitoring (Pingdom) detected the failure within minutes.
• Team response was rapid — DevOps engaged within 2 minutes of alert.
• Deploying additional nodes accelerated the recovery process.
What Went Wrong
• Known cascading failure limitation had not yet been addressed prior to this incident.
• No proactive memory alerting to warn before OOM threshold was reached.
• Recovery complexity extended the total downtime beyond initial expectations.
We deeply regret the impact this incident had on our users and teams. Ensuring resilient and highly available messaging infrastructure remains a top priority.
We are implementing architectural changes to our messaging layer, strengthening monitoring, and improving operational safeguards to significantly reduce the likelihood and impact of similar events in the future.
Owner: DevOps Team
We've now resolved the incident. Thanks for your patience.
✅ We’ve completed our investigation! If there was an issue, it should now be resolved, and everything should be back up and running. If this was a false alarm, we’ll be reviewing our monitoring setup to minimize these hiccups in the future.
🔄 If you're still experiencing problems, please try refreshing the page or reaching out to our support team. Thanks for your patience! 🚀
We're working still to resolve it.
✅ We’ve completed our investigation! If there was an issue, it should now be resolved, and everything should be back up and running. If this was a false alarm, we’ll be reviewing our monitoring setup to minimize these hiccups in the future.
🔄 If you're still experiencing problems, please try refreshing the page or reaching out to our support team. Thanks for your patience! 🚀
We've fixed the core issue, and are waiting for things to recover.
We've confirmed there is a problem, we're working to resolve it.
⚠️ We are aware of an issue affecting the EU2 environment. Some services are currently unavailable due to a messaging infrastructure failure. Our team is actively working on recovery.
We will provide updates as the situation progresses.
We’ll find your subscription and send you a link to login to manage your preferences.
We've sent you an email — please check your inbox and click the link to continue.
We’ll use your email to save your preferences so you can update them later.
Subscribe to other services using the bell icon on the subscribe button on the status page.
You’ll no long receive any status updates from MyQ Roger, are you sure?
{{ error }}
We’ll no longer send you any status updates about MyQ Roger.
Your email has been verified — you'll now receive status updates from MyQ Roger.