Securing Kubernetes Clusters – Part 1

Kubernetes has gained popularity among developers in a relatively short period of time as an easy way to quickly develop and deploy containerised applications. It simplifies change management and applying changes so that it becomes a straightforward matter of retiring the compromised running copy of your application, and the Kubernetes engine will do the rest, spinning up a new shiny copy of a fixed instance.

The important question, which is the focus of this two-part blog series, is to assess this from a security specialist’s point of view: does Kubernetes provide a security expert enough control of the environment and the ability to assess its state?

What is a container management platform?

In order to understand Kubernetes and the role it plays, we first need to look at the factors that led to the development of such a platform. The age of virtualisation and the rise of container-based application development gave birth to a need for an administration and operations group to manage the provisioning, configuration and deployment of those applications in a controlled, consistent and transparent manner. The development of continuous integration, continuous delivery and DevOps ways of working drove widespread automaton of the deployment process, through the philosophy of ‘everything as code’.

Mainstream use of containers for application packaging really began with the first release of the Docker engine in 2013, and since then, a comprehensive set of tools has developed around it to manage the definition of a Docker application service and its configuration options. Docker has created its own set of tools and a deployment definition framework, for example Docker Compose, Docker Swarm and Docker Machine.

Of course, Google had been running containers long before this, using its in-house platform Borg, which was a Google-internal project tightly tied to its own proprietary technologies. Due to its experience with large-scale container deployment, Google added many features to Borg to make deploying and operating large containerised applications easier. However, its dependency on its in-house technology meant that it wasn’t something that other organisations could use.

This was the reason why, in 2014, Google decided to create a new open-source project called Kubernetes that would create a Borg-like container management platform, but one independent of their internal technology, for everyone to use.

With the rise of Kubernetes, container management and orchestration is no longer a proprietary technology, and it is slowly moving into a space of commodity utilities. The expectation is that it will be seen as a typical technology to support managed services and will become the standard way of managing containerised platforms. We are already seeing that it has been adopted by leading cloud providers who have wrapped it into various services to help delivery teams easily deploy, manage and scale containerised applications.

Under the hood

So, now that we know where container management technologies came from, let’s move deeper into the mechanics and focus on a part that often gets less attention: securing the cluster. As a security specialist, I admit that the term ‘securing’ is pretty vague. For the purpose of this blog, I will define it as security controls that will help us protect the confidentiality, integrity and availability of the cluster and the resources it manages.

The purpose of Kubernetes is to abstract the underlying container management technology, be it containers or virtual machines. The cluster and its policies are defined using a relatively simple declarative language, YAML, that is interpreted by the engine to retrieve the necessary resources to deploy our application and monitor its health.

Kubernetes makes it easy for a developer to define a new configuration and have it up and running in seconds, without being fully aware of the ways in which each element communicates with the others in the cluster. To be fair, that was the concept of Kubernetes in the first place when Google developers made it easy for their applications to communicate. Although that solves one of the main business problems – ‘we have a service and it runs’ – there is a certain degree of scepticism from security specialists as the tendency is to build everything in privileged mode and let everything talk to everything within your cluster.

A common mistake that many people make is to assume that a containerised environment like a Kubernetes cluster is ‘secure by default’. This is true to some extent, but only in a limited way.

A Kubernetes deployment is secure in some ways because the virtualisation engine isolates the processes and abstracts the underlying infrastructure and network, which allows you to create virtual environments and control their access to resources. The container platforms have also implemented native built-in security and isolation of the container runtime.

This means that containerised platforms provide all the prerequisites and mechanisms to create a robust and secure environment to deploy your applications in. However, we must be aware that this does not come ‘for free’ – you must define your policies and configure their enforcement in your deployment scripts.

Let’s consider some of the specific security factors that are important in a Kubernetes environment.

Network

There are two specific risks that I would like to discuss in this section:

container isolation problems (misconfiguration of that isolation)
deployment and running of components with known vulnerabilities

Let’s consider container isolation problems first. Our focus here is the possible misconfiguration of the pod isolation in Kubernetes, rather than the more general possible problem of vulnerabilities in the container isolation mechanisms.

Container isolation problems

Firstly, what are isolated and non-isolated pods?

The concept of pod isolation is similar to the concept of network access control policies in software-defined networks. In the context of Kubernetes, the difference between an isolated pod and a non-isolated pod is that a pod becomes isolated when there is a network policy that selects it.

Based on this definition, “a network policy is a specification of how groups of pods are allowed to communicate with each other and other network endpoints. NetworkPolicy resources use labels to select pods and define rules which specify what traffic is allowed to the selected pods”. You can find a full spec of a network policy resource on the Kubernetes website, but here’s an example:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-network-policy
  namespace: default
spec:
  podSelector:
    matchLabels:
      role: db
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - ipBlock:
        cidr: 172.17.0.0/16
        except:
        - 172.17.1.0/24
    - namespaceSelector:
        matchLabels:
          project: myproject
    - podSelector:
        matchLabels:
          role: frontend
    ports:
    - protocol: TCP
      port: 6379
  egress:
  - to:
    - ipBlock:
        cidr: 10.0.0.0/24
    ports:
    - protocol: TCP
      port: 5978

The default network policy behaviour in Kubernetes is ‘allow all’, which means that a pod will accept connections from any source. The network policies are mechanisms that are added as plug-ins to define the ingress and/or egress restrictions associated with a pod or a selection of pods. Once you have selected (listed) a pod in your network policy, the only traffic allowed to and from that pod is the set defined in the relevant policy objects; all other traffic is denied. I think that concept is called ‘denying by allowing’ when you deny all other traffic except the one you have explicitly allowed.

Where multiple policies can be applied to a pod, they are combined using an OR condition, meaning the traffic will be allowed if it’s allowed by any of the policies. A good practice is to define a ‘deny all’ network policy with an empty podSelector attribute, thus enforcing a ‘deny all’ policy by default and then start adding network policies to allow the connections needed.

You can define network policies using YAML and apply them to your cluster. Network policies live within the boundaries of a namespace or the default namespace if a policy must extend to the whole cluster. It’s also important to know that network policies will only be enforced if a Container Network Interface (CNI) network plug-in is installed that supports the network policy definition.

To understand this, let’s consider a simple example of a Kubernetes two-node network topology as shown on the SUSE blog, for example. The nodes are logically segregated, and each contains pods. Each pod has its own IP address that is shared among containers in that pod, but usually, one pod runs one container. All the IPs are routable by default because the default deployment pods are not isolated, which means they accept connections from any source.

This point is worth remembering as it’s a call to action to look for network plug-ins that will support network policies in your cluster. As discussed, you isolate your pods by specifying the ingress and egress rules for that specific pod or selection of pods.

A YAML ingress rule could look like this:

...
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          project: myproject
    - podSelector:
        matchLabels:
          role: frontend
...

In the code example above, the rule allows ingress traffic from pods in the namespace with the label ‘project=myproject’ and the pod labelled ‘role=frontend’; all other traffic will be rejected.

Network policies are supported in Kubernetes by a number of CNI networking plug-ins, such as Calico, Romana, Weave Net, Flannel and Conntiv to list just a few.

Running components with known vulnerabilities

Now that we’ve talked about the security implications of network topology, network policies and isolation, let’s see how the two risks mentioned above (known vulnerabilities and misconfiguration) are related.

You are probably aware of the risk of known vulnerabilities in software, given that the exploitation of known vulnerabilities has accounted for a number of high-profile security breaches. In this context, we are interested in exploitable vulnerabilities that would allow an attacker to compromise the confidentiality, integrity or availability of your service. A distinctive attribute of this sort of attack is that it is carried out via a legitimate protocol allowed by the access policies. For the sake of this article, it doesn’t matter how that vulnerability ended up in the code: programming error, intentional backdoor, supply chain or any other ways you can think of.

For example, a well-known vulnerability that made the headlines (CVE-2017-5638, a Struts 2 RCE vulnerability) enables remote code execution and is exploitable via a legitimate access protocol (HTTP). There are examples on the web of how easy it is to compromise an entire cluster once the attacker has access to one of the containers by mounting a reverse shell attack as a result of this Struts2 exploit. In one of the examples, once inside, the attacker was able to elevate privileges, run services in privileged mode, move laterally in the network and ultimately take over the control panel and, therefore, the whole cluster.

Running containers in privileged mode

At some point, we’ve probably all made the mistake of running an administrative console in privileged mode. It just makes things so much easier by removing all the barriers so that everything talks to everything, and it all works like magic. If you take away only one thing from this article: stop that practice! If you build an image in privileged mode, open a bash command to any of your containers and run the ‘ps aux’ command, you will notice that all the processes inside it run as a root. Running ‘whoami’ will confirm that you are the root user, too! How cool is that? (If I was an attacker.)

Once an initial compromise is achieved, the attacker can run a package manager to install any tool they need and start any process inside the machine, such as creating a virtual jump box to launch an internal attack. I know what you are thinking: containers are ephemeral… they can vanish very quickly, so the risk must be low. That is true, but even a short time can be enough to compromise the entire cluster, move laterally to a process that has a longer lifetime or even gain privileged access to the entire cluster.

The solution is the same as it has always been: harden your base image, run your applications in least-privilege mode, patch your images and application code and watch the activity inside your containers.

A Kubernetes-specific action to take is to define the run mode and privilege escalation in the securityContext definition specs (runAsNonRoot:true, allowPrivilegeEscalation:false). This will result in Kubernetes killing containers that run in privileged mode if they were not meant to.

Our next question is understanding and monitoring what’s going on in our environment. Unfortunately, knowing what is happening inside an abstract, highly dynamic environment can be tricky.

The attack I mentioned above included all the components of a kill chain, using the elevation of privileges and misconfigured container deployment to take control of the cluster. Yet, the attack leaves tracks that can be detected and analysed in real time or by an external intrusion detection tool. This leads us neatly to the next topic, which is how you can watch what is going on inside your cluster and stop certain activities or alert the operations teams to any suspicious behaviour.

Cluster monitoring

There are many solutions on the market that can help protect your cluster from the security threats described in this article. They differ in the details of their implementation and the level of insights they provide as well as the type of cluster that they are designed for, like managed or hosted clusters. However, none of the tools will be a ‘silver bullet’ that solves all your cluster security problems. Similar to any software purchase, you will need to work out what threats you might be exposed to and what you need to monitor and then select an appropriate technology to meet those needs.

There are standard types of monitoring that will likely be interesting to your development and operations groups:

Resource utilisation (CPU, RAM, storage, network)
DevOps statistics (number of resources created and destroyed during a period of time)
Host monitoring (CPU, RAM, swap, storage)
Container health: the number of running containers, their performance and network I/O
API access: application endpoints and control plane endpoints, utilisation patterns
Image health: integrity and vulnerability status of your base images
Anomaly detection: traffic exchange (host, cluster, pod), resource utilisation, internal services and commands

You probably don’t need to worry about a good number of the standard performance and resource management indicators, as most container service providers take care of these, either by allowing you to query the cluster control plane API or by providing an administrative visualisation tool for that purpose.

The key takeaway from this article is that it’s easy for a developer to spin up a Kubernetes cluster and deploy applications. The difficult part is to understand the foundational concepts that make all the moving parts inside the cluster work together and leverage that knowledge to configure your cluster with these security principles in mind. In this article, we introduced two of the concepts: traffic restrictions using network policies and the basics of cluster monitoring.

In part 2 of this article, we will address some of the concerns related to image health, activity monitoring and anomaly detection. We will also outline a list of attack vectors and good practices to address security early and design the cluster with security concerns in mind.

The Twisted Concept of Securing Kubernetes Clusters – Part 1