SSO was offline impacting all CDP experiences
Incident Report for Cloudera
Postmortem

On July 21, 2021 (Wednesday) at 08:58 US Pacific time, Cloudera Single Sign On experienced an outage affecting access to all CDP products. Our SREs became aware of the issue via automated monitoring and alerting. Authenticating users would have seen "Sky 002" messages. Other users would have see different effects impeding their ability to use the platform. The total downtime was 60 minutes.

The service appears to have stopped because of it ran out of available memory during a high-usage period. Restarting the instances hosting the service brought it back online again.

Corrective actions:

  • Completed: The service configuration has been changed to increase the available memory.
  • Longer term: The SSO team will work with SRE on autoscaling the service.
Posted Jul 30, 2021 - 16:45 UTC

Resolved
As of 08:58 US Pacific time (15:38 UTC) CDP SSO went down. We found the root cause and resolved it. Total outage time was 60 minutes, according to remote monitoring. At that time, all CDP experiences would have seen an impact. Logins were showing the 'SKY 002 - One of our systems is down for maintenance and we were unable to process your request. ' error message.
Posted Jul 21, 2021 - 16:00 UTC