On Wednesday, August 24th around 17:00 UTC Cloudera SRE team received reports of high error rates for many internal services in the CDP Public cloud control plane in the US-WEST-2 region. Upon investigation it was found that it was caused due to increased error rates and latencies from API Gateway endpoints. Approximately after 40 minutes the issue from AWS was resolved.
However, most of our internal services were still experiencing high error rates, hence the team investigated the issue further. After multiple debugging attempts, it was found that the service used to store the key service information was stuck waiting for additional resources. On restarting the service, the same was resolved. This was confirmed by reduced errors rate from internal services.
As a mitigation item the team is enhancing the service so that we have better error handling and also self-heal in possible scenarios.