We are experiencing intermittent login issues with CDP.
Incident Report for Cloudera
Postmortem

On July 6, 2021 at 08:50 PDT, the CDP Control Plane became inaccessible to new sessions. User login would complete successfully but the Control Plane welcome screen did not render and users observed a 503 error. This condition persisted for 57 minutes.

Analysis revealed that the active Vault Secrets Store had an expired certificate. Although the Vault Secrets Store certificate had been renewed prior to the expiration date, the Vault service didn’t recognize the new certificate. In previous years, Vault had been restarted for other reasons between the renewal and the expiration date, masking the process defect.  An expired Vault certificate disables access to the Secrets Store. The Audit Service relies on the Secrets Store to function. Auditing is central to our application architecture. By design, a loss of the Auditing Service shuts down access to the Control Plane. The blast radius of this incident was limited to control plane and cluster management operations. No customer workload cluster was impacted by this incident.

Restarting the Vault Secrets Store Service allowed the service to recognize the new (valid) certificate and therefore resolving the incident

Cloudera has identified the following corrective actions:

  • Improved Monitoring of both Vault and Audit. Both of these services shutdown without triggering automated alerting.
  • Improved Monitoring of the Control Plane login process. The current monitoring process is inadequate as it failed to notice the error.
  • Improved documentation for Vault Certificate renewals that indicate a restart is required.
  • Automated Vault Certificate renewal.
Posted Jul 15, 2021 - 14:03 UTC

Resolved
This incident has been resolved.
Posted Jul 06, 2021 - 17:03 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 06, 2021 - 16:43 UTC
Identified
The issue has been identified and is being resolved.
Posted Jul 06, 2021 - 16:33 UTC
Investigating
We are currently investigating this issue.
Posted Jul 06, 2021 - 15:50 UTC
This incident affected: Cloudera Data Platform (US) (CDP Management Console) and Cloudera SSO.