TL;DR: Golden Paths are standardised, automated cloud infrastructure management procedures that help prevent system crashes like the recent CrowdStrike update. By using predefined processes for tasks like deploying virtual machines (VMs), creating boot images and rotating images, Golden Paths ensure controlled and secure updates and minimise the risk of uncontrolled changes that can lead to system failures. They provide a structured approach for managing servers, including automated testing, backups and staged deployment, enhancing both security and operational efficiency.
What began as a seemingly routine CrowdStrike update left Microsoft users aimless and system administrators scrambling for answers. The breakdown put security patching smack dab in the news headline, but make no mistake – this issue has been lurking in the background for some time.
The CrowdStrike update that sent Windows into chaos was fundamentally an issue with an uncontrolled change – in this case, to boot images. The software update had access to the OS Kernal on Windows, and the resultant change was catastrophic to the machine and the business functions relying upon it. In many cases, there was no automated rollback, and the update was so uncontrolled that many machines were updated before the problem became obvious. As the earth turned, the updates continued to brick machines at a frightening rate on a follow-the-sun basis.
But this type of issue has been waiting to happen for a long time. A desktop PC crashing is a big issue for a user, but a critical server crashing and being unable to be rebooted is a business-threatening event. Patching OS images is a frequent task, and many companies either have some parts of it as a manual process, or allow automatic patching from their software providers. In the CrowdStrike issue, the patch was automatic for most users.
Disaster recovery (DR) and business continuity management (BCM) processes should have kicked in, but it’s likely that many companies – if they had active BCM – automatically applied the patch to both the primary and secondary nodes to their clusters, bringing everything to a stop until they could be manually restarted.
So, keeping the headline’s question in mind, how do Golden Paths help prevent and limit the impact of these types of public and private events? We’ll use this blog to unearth that answer.
What value does a Golden Path bring to your cloud security setup?
For now, you can think of Golden Paths as predefined routes (or instructions) in a platform that outlines the best practices, steps and required tools to achieve as specific goal – for example deploying a VM for a developer.
They are well-structured, repeatable pathways that guide developers through building, deploying and managing their cloud infrastructure. Golden Paths should cover Day 0, Day 1 and Day 2 operations. What are the differences? Put simply, Day 0 is the design, Day 1 is the provisioning (deployment) and Day 2 is the ongoing operation of that resource.
When we use Golden Paths at Endava, we automate all of those processes. For a simple VM, this would appear to be straightforward. But the more we discussed the topic, the larger it became.
In the case of the CrowdStrike issue, a Golden Path for the VM would have prevented or limited the impact on virtualised servers. Why? Because our VM Golden Path pulls the boot image out as its own Golden Path. When we deploy a VM, we specify the version of the Boot Image Golden Path that is required, and we only allow teams to select from a small number of Golden Path images.
We create images in a secured service known as the Image Bakery, an automated and codified setting that contains all the processes and tooling needed to create a standard, customised image for an organisation. Only images that have been baked in this process can be used as boot disks. The process also includes automated testing of images before they are promoted to being 'stable' images. We employ an approach that labels each Golden Path as Experimental (used for dev), Preview (used for testing) and Stable (used in Live).
The ill-fated CrowdStrike update would have been included in the next Experimental build. We have processes that prevent Experimental builds being used on VMs labelled as pre-prod (QA/Staging) or prod. They can only be released on dev-labelled VMs. Our Golden Path for VMs enforces standardised labelling of infra resources, but once the new image had been baked with the fatal update, the automated testing chain would have booted a dev machine with the new image and it would have failed. That image would never have been promoted to Experimental, Preview or Stable, and our automated provisioning process does not allow it to be deployed.
Where do we go now?
That process is great for new servers, but it leaves people wondering how it works for already-provisioned machines and how would it have prevented the server from being booted with an updated image. Automating Image Rotation is a Day-2 operation for the VM Golden Path. Our VM Golden Path also includes automated backups of the Boot Image, and the frequency of the backup is determined by the Business_Criticality label that is automatically applied to each machine when it is requested. We back up to low-criticality machines weekly and mission-critical servers every 15 minutes.
How long we retain the images is also automated. Even though the image is from a Golden Path and, therefore, exists in our centralised Repo, we still know that it’s possible for server configuration to drift if an engineer changes something on that server; that’s why we follow a belt-and-braces approach on our servers by backing up images. In addition, our Golden Path for a VM would always deploy a blue/green VM cluster for high-criticality servers.
In that way, an image can be applied to the green servers first and traffic slowly moved to it. If a problem is encountered, then traffic is moved back to blue. Our image rotation process is centralised and automated. There are many different thoughts on the process to follow – for instance, should dev teams be forced to upgrade or nudge them? Can that same team run old images or should they be given a grace period to upgrade?
The answers to these questions are unique to each organisation, but the Golden Path for a VM can be written to automate this policy. For example, an image containing a critical patch could be immediately and automatically rolled out to all servers labelled as dev. After a few hours, and assuming no incidents are raised on that release (our Golden Paths can update the configuration management database so we know if alerts are raised), the image can be rolled out to pre-prod automatically.
For prod, an approach of automatically applying the patch to low criticality servers could be followed, with critical servers being handled on a server-by-server basis. Once you adopt a process for Golden Paths, the possibilities for far higher control alongside far higher automation, become apparent.
Software updates usually result in necessary, run-of-the-mill upgrades, but the wrong integration at the wrong time can leave whole systems vulnerable to letdowns. Need tips to better prepare should another cloud catastrophe occur?