Intermittent issues in US-West Control Plane
Incident Report for Cloudera
Postmortem

On Wednesday, August 24th around 17:00 UTC Cloudera SRE team received reports of high error rates for many internal services in the CDP Public cloud control plane in the US-WEST-2 region. Upon investigation it was found that it was caused due to increased error rates and latencies from API Gateway endpoints. Approximately after 40 minutes the issue from AWS was resolved. 

However, most of our internal services were still experiencing high error rates, hence the team investigated the issue further. After multiple debugging attempts, it was found that the service used to store the key service information was stuck waiting for additional resources. On restarting the service, the same was resolved. This was confirmed by reduced errors rate from internal services. 

As a mitigation item the team is enhancing the service so that we have better error handling and also self-heal in possible scenarios.

Posted Aug 31, 2022 - 09:50 UTC

Resolved
Due to issues in AWS us-west-2, some CDP services were intermittently unavailable or slow to respond. For more information, see https://status.aws.amazon.com/

We will investigate further to determine if any mitigation could be done if this issue occurs again, and if automatic notifications can be posted when there are AWS issues.
Posted Aug 24, 2022 - 20:24 UTC