On Monday, February 13th at 04:18 UTC, Cloudera SRE detected a spike in errors related to the CDP Control plane management console. After investigation, it was determined that a recent production change had caused the outage. Although Cloudera follows strict software development lifecycle standards, an unforeseen bug due to complex dependencies was still encountered. The issue was resolved by rolling back to the previous version.
To reduce the risk of similar issues in the future, we are improving our test suites, monitoring and dependency tracking to detect such scenarios as early in the development process as possible. Alongside also reviewing our rollback process to reduce mean time to recovery.