EU2 RabbitMQ Unexpected Downtime

Incident Summary

Date: Today
Duration: 12:40 - 13:20 (Time Zone)
Impact:
- RabbitMQ services experienced memory depletion, leading to the locking of virtual host exchanges.
- Connections to port 5671 contributed to instability, exacerbating the failure.
- Removing the load balancer to mitigate issues resulted in visible downtime for customers.

We sincerely apologize for the disruption this caused. We understand the importance of service availability and deeply regret any inconvenience to our users and teams.

Resolution

The issue was resolved by completely shutting down RabbitMQ servers, clearing caches, and restarting nodes sequentially.

Remediation Actions Taken:

✅ Attempted service restarts - Initial troubleshooting step, but did not resolve the issue.
✅ Removed load balancer - To mitigate connection issues on port 5671, which resulted in temporary downtime.
✅ Complete RabbitMQ shutdown and sequential restart - Ensured a clean start and re-established cluster connections.

Root Cause Analysis

What Happened

RabbitMQ alert reported High Memory usage around 12:00.
Memory depletion led to the locking of virtual host exchanges, preventing normal operations.
Restarting services did not resolve the issue and resulted in a split-brain scenario, where rabbitmq-1 could not connect to rabbitmq-0 and rabbitmq-2.
Connections to port 5671 were contributing to further instability, leading to the decision to remove the load balancer. However, this resulted in customer-visible downtime.
The final resolution required a full shutdown, cache clearance, and sequential restart of all RabbitMQ nodes.

Why It Happened

High Memory Usage - RabbitMQ exhausted available memory, triggering the locking of exchanges.
Ineffective Restarts - Restarting services did not address the underlying memory issue and led to a split-brain cluster state.
Uncontrolled Connection Load - Incoming traffic through port 5671 may have exacerbated the issue.

Impact

RabbitMQ services were unavailable, leading to failed message exchanges.
Customer-visible downtime occurred when the load balancer was removed.
Cluster instability resulted in delayed recovery time.

Corrective & Preventive Measures

✅ Completed Actions (Post-Incident Remediation)

Cleared caches and monitored memory usage - Confirmed that RabbitMQ restarted without excessive memory consumption.
Reconfigured load balancer reattachment process - Minimized customer impact when changes are necessary.

📌 Long-Term Improvements

Improve Memory Monitoring & Alerts
- Implement automated alerts and threshold-based actions to prevent memory depletion before critical levels are reached.
Refine RabbitMQ Restart Procedures
- Update documentation to prevent split-brain scenarios and ensure clean recovery steps.
Optimize Load Balancer Failover Strategy
- Implement failover mechanisms to minimize customer impact during RabbitMQ failures.

Apology & Commitment to Improvement

We deeply regret the impact this incident had on our users and teams. Ensuring seamless service availability is our priority, and we acknowledge that this issue could have been avoided with better memory management and failover procedures.

We are committed to preventing similar incidents in the future by implementing proactive monitoring, structured restart guidelines, and improved load balancing strategies. Thank you for your patience and trust as we work to strengthen our systems.

Owner: Matus Szepe