EU Datacenter AMQP outage

Uptime Impact: 19 minutes and 46 seconds
Resolved
Updated

Postmortem Incident Report – AMQP Connectivity

Incident Summary

• Date: January 7, 2026
• Duration: 13:22 – 14:20 CET
• Affected Service: AMQP (RabbitMQ)
• Impact: Partial to full service unavailability during peak traffic hours

On January 7, 2026, we experienced an incident affecting AMQP connectivity. The incident started as a partial degradation and escalated into a full service outage during peak usage hours.

The service was fully restored by 14:20 CET. We sincerely apologize for the disruption and appreciate your patience.

What Happened

• 13:22 CET – One backend messaging node became unavailable, reducing overall system capacity.
• Some clients experienced intermittent connection issues.
• During peak hours, traffic shifted to the remaining nodes, significantly increasing system load.
• 13:54 CET – AMQP connectivity became unavailable for all customers.
• Recovery procedures were initiated immediately, and service was fully restored by 14:20 CET.

Customer Impact

• Some customers experienced connection failures or delays earlier in the incident.
• From 13:54 to 14:20 CET, AMQP connectivity was unavailable for all customers.
• Message processing was temporarily interrupted during the outage period.

Resolution

To restore service as quickly and safely as possible:

• The underlying infrastructure was stabilized.
• The messaging service was restarted using a fresh deployment to minimize recovery time.
• All data volumes were backed up prior to recovery to preserve investigation options.
• Service availability and message flow were verified before reopening access.

The system has been operating normally since recovery.

Data Integrity

✔️ No data loss has been identified as a result of this incident. Messages were temporarily unavailable while the service was down; normal processing resumed after recovery.

Corrective & Preventive Measures

We are implementing the following improvements to reduce the likelihood and impact of similar incidents:

• Increased capacity to better handle peak traffic, even during partial component failures.
• Improved early detection of partial service degradation before it escalates.
• Enhanced monitoring and alerting to ensure visibility remains available under high system load.
• Additional safeguards to reduce the impact of reconnect storms during partial outages.

Apology & Commitment to Improvement

We understand how critical reliable messaging is for your systems and workflows. We deeply regret the impact this incident had and take full responsibility for improving our resilience and detection capabilities.

We are committed to learning from this event and strengthening our systems to prevent similar incidents in the future. Thank you for your continued trust.

Owner: Platform Operations Team

Avatar for
Resolved

We've now resolved the incident. Thanks for your patience.

Avatar for
Recovering

We've fixed the core issue, and are waiting for things to recover.

Avatar for
Investigating

We are currently experiencing an outage affecting RabbitMQ. The service is down, and the team is actively investigating the issue.

Further updates will be provided as more information becomes available.

Avatar for
Began at:

Affected components
  • EU DataCenter
    • AMQP