• Date: February 19, 2026
• Duration: 12:30 – 13:00 CET
• Impact:
• Full AMQP service outage affecting all connected services and event-driven workflows.
• RabbitMQ cluster instability following an Out-Of-Memory (OOM) event on a single node, which escalated into a cluster-wide failure.
• Approximately 30 minutes of downtime and partial message loss due to required cluster recovery.
We sincerely apologize for the disruption this caused. Reliable messaging is a critical component of our platform, and we deeply regret the inconvenience experienced by our users and teams.
The issue was resolved by temporarily detaching RabbitMQ from the load balancer, restoring the cluster to a stable clean state, and re-enabling inter-node communication before resuming traffic.
✅ Detached RabbitMQ from load balancer – Reduced connection churn and stabilized recovery efforts.
✅ Performed clean-state cluster restore – Allowed all nodes to return to a healthy state.
✅ Restored cluster communication service – Ensured proper inter-node synchronization before reattaching traffic.
At 12:28 CET, one RabbitMQ node experienced a sudden and significant memory spike, increasing from approximately 3 GiB to over 5 GiB within seconds. Kubernetes terminated the Erlang VM process due to an Out-Of-Memory (OOM) event:
Following the restart, the node was unable to fully stabilize due to the large number of existing queues, which significantly slowed the recovery process.
Contributing Factors
• Elevated queue volume increased memory pressure during node recovery.
• Client reconnection attempts amplified load on remaining nodes.
• Recovery required controlled traffic isolation to prevent further cascading effects.
Impact
• 30 minutes of full AMQP service unavailability.
• All services dependent on RabbitMQ messaging were affected.
• Some in-flight or queued messages were not recoverable following the cluster restore.
Completed Actions (Post-Incident Remediation)
1. Stabilized recovery procedure – Defined safe detach/attach process for the load balancer.
2. Validated cluster configuration consistency – Ensured required services for inter-node communication are present.
3. Initiated queue lifecycle policy review – Strengthened governance around queue management.
Long-Term Improvements
• Enhance queue lifecycle controls
• Reinforce automated policy enforcement during cluster provisioning and migration.
• Automate recovery workflows
• Implement protective measures to reduce reconnection storms during partial outages.
What Went Well
• Monitoring detected the node failure immediately and triggered alerts.
• Traffic isolation from the load balancer helped prevent prolonged cascading failure.
• Team response time was rapid and coordinated.
What Went Wrong
• Recovery complexity was higher than expected under load conditions.
• Manual intervention under pressure increased operational risk.
We deeply regret the impact this incident had on our users and teams. Ensuring resilient and highly available messaging infrastructure remains a top priority.
We are strengthening monitoring, governance, and recovery automation to further reduce the likelihood and impact of similar events in the future.
Owner: DevOps Team
We've now resolved the incident. Thanks for your patience.
✅ We’ve completed our investigation! If there was an issue, it should now be resolved, and everything should be back up and running. If this was a false alarm, we’ll be reviewing our monitoring setup to minimize these hiccups in the future.
🔄 If you're still experiencing problems, please try refreshing the page or reaching out to our support team. Thanks for your patience! 🚀
We've fixed the core issue, and are waiting for things to recover.
🚨 Our monitoring system (Pingdom) has detected that our website might be down, as it is currently not responding to pings. However, false positives can sometimes occur. We are actively investigating the situation to determine whether there is a real issue or just a glitch in the monitoring.
💡 Stay tuned—we'll update you as soon as we have more details!
We’ll find your subscription and send you a link to login to manage your preferences.
We’ve found your existing subscription and have emailed you a secure link to manage your preferences.
We’ll use your email to save your preferences so you can update them later.
Subscribe to other services using the bell icon on the subscribe button on the status page.
You’ll no long receive any status updates from MyQ Roger, are you sure?
{{ error }}
We’ll no longer send you any status updates about MyQ Roger.