EU Datacenter Roger API Memory Exhaustion

Resolved
Resolved

Postmortem Incident Report

Incident Summary

  • Date: November 9, 2024
  • Duration: 5:12 PM - 8:20 PM (UTC)
  • Impact:
    • Our application experienced unresponsiveness due to memory exhaustion caused by a slow memory leak.
    • The memory depletion led to application downtime until an automatic restart was triggered, causing service disruption.
    • Requests to the Roger API failed with status code 500 errors.
    • All clients on the EU cluster that interacted with Roger API over the weekend were impacted.
    • The SynchronizeCloudFilesWithDbRecordsBackgroundJob failed during the outage.

We sincerely apologize for the disruption this caused. We understand the importance of service availability and deeply regret any inconvenience to our users and teams.

Timeline

Detection

  • The incident was detected from alerts sent by Alertmanager to a Teams channel.
  • The first alert about missing Roger API deployment replicas was received on Saturday, Nov 9, 2024, at 5:12 PM (UTC).

Investigation

  • Grafana and Graylog/ES were the main tools used for investigation.
  • One of the Roger API pods was restarted around 6:00 PM (UTC) on Nov 9, 2024 to diagnose why replicas were missing.

Resolution

  • The Kubernetes deployment of Roger API was restarted to create new pods and free the allocated memory.
  • Services were fully restored by 8:20 PM (UTC).

Root Cause Analysis

What Happened

  • Due to a memory leak in Roger API, memory usage in the pods reached the limit, causing health check endpoints to report an unhealthy state.
  • Readiness probes detected the unhealthy state and marked the pods as unready.
  • Unready pods were removed from the Kubernetes service load balancer, leading to failing requests with HTTP 500 errors.
  • The garbage collector failed to free enough allocated objects to bring memory usage below the health check limit.

Why It Happened

  • Memory leak in application code - Unreleased objects caused incremental memory buildup.
  • Lack of early detection mechanisms - No real-time monitoring detected the leak before it became critical.
  • Alerts lacked actionable insights - Notifications did not contain sufficient information to diagnose or resolve the issue efficiently.
  • Delayed response - The issue occurred over the weekend, delaying the response time.
  • Known issue but delayed fix - The memory leak issue was first reported on Oct 17, 2024, but the fix took time to test and merge.

Impact

  • Affected Services: Primary application services (Roger API) were unavailable.
  • User Impact: Clients on the EU cluster were unable to interact with Roger API.
  • Business Impact: Potential revenue loss and operational disruption.
  • Background job failure: The SynchronizeCloudFilesWithDbRecordsBackgroundJob failed during the outage.

Corrective & Preventive Measures

Completed Actions (Post-Incident Remediation)

  1. Memory leak identified and fixed - Root cause resolved through profiling and patching.
  2. Regular server restarts (temporary fix) - Automated restarts implemented to prevent memory exhaustion.
  3. Enhanced monitoring and alerting - Introduced real-time memory usage checks to detect anomalies earlier.
  4. Restarted Kubernetes deployment - Cleared memory allocation and restored services.

📌 Long-Term Improvements

  • Improved memory management checks before releases
    • Memory leak detection added to sandbox testing before deployments.
  • Enhanced monitoring and logging
    • Proactive alerting for unusual memory consumption with detailed diagnostic data.
  • Optimized restart mechanisms
    • Reduced restart time to minimize service downtime.
  • Clearer and more actionable alerts
    • Alert messages now include resolution steps and escalation procedures.
  • Weekend on-call rotation review
    • Improved alert handling and response strategy for weekend incidents.

Apology & Commitment to Improvement

We deeply regret the impact this incident had on our users and teams. Ensuring seamless service availability is our priority, and we acknowledge that this issue could have been avoided with better monitoring, alerting, and response strategies.

We are committed to preventing similar incidents in the future by implementing enhanced memory monitoring, optimized restart mechanisms, clearer alerts, and improved weekend response procedures. Thank you for your patience and trust as we work to strengthen our systems.

Owner: Matus Szepe

Began at:

Affected components
  • EU DataCenter
    • API