Kubernetes NodePort and iptables rules
I have been working on kubernetes over the last few months and having fun learning about the underlying systems. When I started using
kubernetes services, I wanted to learn about the iptables rules that kube-proxy creates to enable them, however, I didn’t exactly know where to start. While there are some really good posts explaining
kubernetes networking and how the concept of
services works, I couldn’t easily find a starting point to explore these iptables except reading the code or trying to make sense of the output of
iptables-save command (which I can’t).
In this post, I share some of what I have learned by digging a little deeper into the iptables rules for NodePort type services and share answers to the following questions I came across while working on kube-proxy:
- What happens when a non-kubernetes process starts using a port that’s allocated as a NodePort to a service?
- Does the service endpoint continue to route traffic to pods if kube-proxy (configured in iptables mode) process dies on the node?
Kubernetes allows a few ways to expose applications to the world outside the kubernetes cluster through the concept of services. In our setup at work, we expose a few of our apps through NodePort type services. Also, kube-proxy is the kubernetes component that powers the services concept and we run it in iptables mode (default).
A little about NodePort
If you set the type field to NodePort, the Kubernetes control plane allocates a port from a range specified by
(default: 30000-32767). Each node proxies that port (the same port number on every Node) into your Service.
This means that if we have a
service with NodePort
30000, a request to
<kubernetes-node-ip>:30000 will get routed to our app. Under normal circumstances kube-proxy
binds and listens on all NodePorts to ensure these ports stay reserved and no other processes can use them.
err…a non-kubernetes is using the NodePort!!
A few weeks back at work, one of our kubernetes nodes was rebooted and as it was added back to the cluster, we saw errors in kube-proxy logs -
bind: address already in use:
I0707 20:57:38.648179 1 proxier.go:701] Syncing iptables rules E0707 20:57:38.679876 1 proxier.go:1072] can't open "nodePort for default/azure-vote-front:" (:30450/tcp), skipping this nodePort: listen tcp :30450: bind: address already in use
This error meant that a port allocated as a NodePort to a
service was already in use by another process. We found out that there was a non-kubernetes process using this NodePort as a client port to connect to a remote server. This happened due of a race condition post node reboot. This other process started using the NodePort before kube-proxy could bind and listen on the port to reserve it.
Will the NodePort on this node still route traffic to the pods?
This was the immediate question that popped up in our heads as the NodePort allocated to a
service was now being used by a non-kubernetes process. One of my colleagues,
Alex Dudko pointed out that kube-proxy creates iptables rules in the PREROUTING chain in nat table.
Because of these rules, the answer to the above question is - Yes! Traffic sent to the NodePort on this node will still correctly be routed to the backend pods it targets.
The NodePort would also continue to work if another process was listening on the port as a server and not just using it as a client. Though, would our
HTTP GET requests reach this server? More on this later.
Kube-proxy iptables rules
In iptables mode, kube-proxy creates iptables rules for kubernetes services which ensure that the request to the
service gets routed (and load balanced) to the appropriate pods.
These iptables rules also help answer the second question mentioned above. As long as these iptables rules exist, requests to
services will get routed to the appropriate pods even if kube-proxy process dies on the node. Endpoints for new
services won’t work from this node, however, since kube-proxy process won’t create the iptables rules for it.
Rules in the PREROUTING chain
As outlined in this flow chart, there are rules in the PREROUTING chain of the nat table for kubernetes services. As per iptables rules evaluation order, rules in the PREROUTING chain are the first ones to be consulted as a packet enters the linux kernel’s networking stack.
Rules created by kube-proxy in the PREROUTING chain help determine if the packet is meant for a local socket on the node or if it should be forwarded to a pod. It is these rules that ensure that the request to
<kubernetes-node-ip>:<NodePort> continue to get routed to pods even if the NodePort is in use by another process.
Below is an over-simplified diagram that demonstrates this:
In rest of this post, I walk through a setup that reproduces the above scenario and examine the iptables rules that make this work.
Setting up a Kubernetes cluster
I created a kubernetes cluster using Azure Kubernetes Service .
Creating pods and a NodePort Service
From the AKS docs, I also created a deployment for an app with a frontend and a backend. I updated the azure-vote-front to type: NodePort service.
- azure-vote-front pod in the cluster
$ kubectl get po -l app=azure-vote-front --show-labels -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS azure-vote-front-5bc759676c-x9bwg 1/1 Running 0 24d 10.244.0.11 aks-nodepool1-22391620-vmss000000 <none> <none> app=azure-vote-front,pod-template-hash=5bc759676c
- azure-vote-front service targeting the above pod
$ kubectl get svc azure-vote-front -o wide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR azure-vote-front NodePort 10.0.237.3 <none> 80:30450/TCP 24d app=azure-vote-front
The above output shows that the port 30450 is assigned as the NodePort to our service. And we can send HTTP requests to the app from host network:
$ curl -IX GET localhost:30450 HTTP/1.1 200 OK Server: nginx/1.13.7 Date: Wed, 08 Jul 2020 00:02:08 GMT Content-Type: text/html; charset=utf-8 Content-Length: 950 Connection: keep-alive $ curl -sSL localhost:30450 | grep title <title>Azure Voting App</title>
Examining iptables rules for a NodePort service
Note: AKS doesn’t allow ssh’ing into an AKS worker node directly, instead, a pod needs to be spun up to ssh into the worker node. Instructions here.
Let’s look at the iptables rules for the azure-vote-front service.
After ssh’ing into the kubernetes worker node, let’s look at the PREROUTING chain in the nat table. (
chain exists in
mangle tables, however, kube-proxy only creates
PREROUTING chain rules in
$ sudo iptables -t nat -L PREROUTING | column -t Chain PREROUTING (policy ACCEPT) target prot opt source destination KUBE-SERVICES all -- anywhere anywhere /* kubernetes service portals */ DOCKER all -- anywhere anywhere ADDRTYPE match dst-type LOCAL
KUBE-SERVICES chain in the target that’s created by kube-proxy. Let’s list the rules in that chain.
$ sudo iptables -t nat -L KUBE-SERVICES -n | column -t Chain KUBE-SERVICES (2 references) target prot opt source destination KUBE-MARK-MASQ tcp -- !10.244.0.0/16 10.0.202.188 /* kube-system/healthmodel-replicaset-service: cluster IP */ tcp dpt:25227 KUBE-SVC-KA44REDDIFNWY4W2 tcp -- 0.0.0.0/0 10.0.202.188 /* kube-system/healthmodel-replicaset-service: cluster IP */ tcp dpt:25227 KUBE-MARK-MASQ udp -- !10.244.0.0/16 10.0.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:53 KUBE-SVC-TCOU7JCQXEZGVUNU udp -- 0.0.0.0/0 10.0.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:53 KUBE-MARK-MASQ tcp -- !10.244.0.0/16 10.0.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53 KUBE-SVC-ERIFXISQEP7F7OF4 tcp -- 0.0.0.0/0 10.0.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53 KUBE-MARK-MASQ tcp -- !10.244.0.0/16 10.0.190.160 /* kube-system/kubernetes-dashboard: cluster IP */ tcp dpt:80 KUBE-SVC-XGLOHA7QRQ3V22RZ tcp -- 0.0.0.0/0 10.0.190.160 /* kube-system/kubernetes-dashboard: cluster IP */ tcp dpt:80 KUBE-MARK-MASQ tcp -- !10.244.0.0/16 10.0.195.6 /* kube-system/metrics-server: cluster IP */ tcp dpt:443 KUBE-SVC-LC5QY66VUV2HJ6WZ tcp -- 0.0.0.0/0 10.0.195.6 /* kube-system/metrics-server: cluster IP */ tcp dpt:443 KUBE-MARK-MASQ tcp -- !10.244.0.0/16 10.0.129.119 /* default/azure-vote-back: cluster IP */ tcp dpt:6379 KUBE-SVC-JQ4VBJ2YWO22DDZW tcp -- 0.0.0.0/0 10.0.129.119 /* default/azure-vote-back: cluster IP */ tcp dpt:6379 KUBE-MARK-MASQ tcp -- !10.244.0.0/16 10.0.237.3 /* default/azure-vote-front: cluster IP */ tcp dpt:80 KUBE-SVC-GHSLGKVXVBRM4GZX tcp -- 0.0.0.0/0 10.0.237.3 /* default/azure-vote-front: cluster IP */ tcp dpt:80 KUBE-MARK-MASQ tcp -- !10.244.0.0/16 10.0.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:443 KUBE-SVC-NPX46M4PTMTKRN6Y tcp -- 0.0.0.0/0 10.0.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:443 KUBE-NODEPORTS all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL
The last terget in the
KUBE-SERVICES chain is the
KUBE-NODEPORTS chain. Since the service we created is of type
NodePort, let’s list the rules in
$ sudo iptables -t nat -L KUBE-NODEPORTS -n | column -t Chain KUBE-NODEPORTS (1 references) target prot opt source destination KUBE-MARK-MASQ tcp -- 0.0.0.0/0 0.0.0.0/0 /* default/azure-vote-front: */ tcp dpt:30450 KUBE-SVC-GHSLGKVXVBRM4GZX tcp -- 0.0.0.0/0 0.0.0.0/0 /* default/azure-vote-front: */ tcp dpt:30450
We can see that the above targets are for packets destined to our NodePort 30450. The comments also show the namespace/pod name -
default/azure-vote-front. For requests originating from outside the cluster and destined to our app running as a pod, the
KUBE-MARK-MASQ rule marks the packet to be altered later in the
POSTROUTING chain to use SNAT (source network address translation) to rewrite the source IP as the node IP (so that other hosts outside the pod network can reply back).
KUBE-MARK-MASQ target is to
MASQUERADE packets later, let’s follow the
KUBE-SVC-GHSLGKVXVBRM4GZX chain to further examine our service.
$ sudo iptables -t nat -L KUBE-SVC-GHSLGKVXVBRM4GZX -n | column -t Chain KUBE-SVC-GHSLGKVXVBRM4GZX (2 references) target prot opt source destination KUBE-SEP-QXDNOBCCLOXLV7LV all -- 0.0.0.0/0 0.0.0.0/0 $ sudo iptables -t nat -L KUBE-SEP-QXDNOBCCLOXLV7LV -n | column -t Chain KUBE-SEP-QXDNOBCCLOXLV7LV (1 references) target prot opt source destination KUBE-MARK-MASQ all -- 10.244.0.11 0.0.0.0/0 DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp to:10.244.0.11:80
Here we see the DNAT (Destination Network Address Translation) target is used to rewrite the destination of the packet destined to port 30450 to our pod 10.244.0.11:80. We can verify the podIP and containerPort of our pod as follows:
$ kubectl get po azure-vote-front-5bc759676c-x9bwg -o json | jq -r '[.status.podIP, .spec.containers.ports.containerPort | tostring] | join(":")' 10.244.0.11:80
Verify kube-proxy is listening on NodePort
Under normal circumstances kube-proxy binds and listens on all NodePorts to ensure these ports stay reserved and no other processes can use them. We can verify this on the above kubernetes node:
$ sudo lsof -i:30450 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME hyperkube 11558 root 9u IPv6 95841832 0t0 TCP *:30450 (LISTEN) $ ps -aef | grep -v grep | grep 11558 root 11558 11539 0 Jul02 ? 00:06:37 /hyperkube kube-proxy --kubeconfig=/var/lib/kubelet/kubeconfig --cluster-cidr=10.244.0.0/16 --feature-gates=ExperimentalCriticalPodAnnotation=true --v=3
kube-proxy is listening on NodePort 30450.
Create a non-kubernetes process use the NodePort
Now let’s kill kube-proxy process and start a server that listens on this NodePort instead.
$ kill -9 11558 $ python -m SimpleHTTPServer 30450 &  123578 $ sudo lsof -i:30450 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME python 123578 azureuser 3u IPv4 95854834 0t0 TCP *:30450 (LISTEN) $ ps -aef | grep -v grep | grep 123578 azureus+ 123578 85107 0 00:49 pts/5 00:00:00 python -m SimpleHTTPServer 30450
Where will the requests to NodePort be routed - pod or the non-kubernetes process?
While the python process is listening on port 30450, also allocated as a
NodePort to our kubernetes service, the iptables rules in the PREROUTING chain will route all requests to port 30450 to our pod. We can verify this as below:
$ curl -I -XGET localhost:30450 HTTP/1.1 200 OK Server: nginx/1.13.7 Date: Wed, 08 Jul 2020 00:53:09 GMT Content-Type: text/html; charset=utf-8 Content-Length: 950 Connection: keep-alive
As we send a
GET request to port 30450, we receive a response from an nginx server hosting Azure Voting App, like we saw when
kube-proxy was listening on port 30450.
$ curl -sSL localhost:30450 | grep title <title>Azure Voting App</title>
- When kube-proxy is used in iptables mode, routing requests to services continues to work for existing services even when the kube-proxy process dies on the node
- Kube-proxy binds and listens (on all k8s nodes) to all ports allocated as NodePorts to ensure these ports stay reserved and no other processes can use them
- Even if a process starts using NodePort, iptables rules (because they are in PRESOUTING chain) ensure that the traffic sent to the
NodePortgets routed to the pods
- @thockin’s tweet, describing kube-proxy iptables rules in a flow chart, is an excellent way to holistically follow a packet’s path through various iptables rules before it reaches the destined pod.
- Kubernetes Networking Demystified: A Brief Guide is a great post explaining how kubernetes services and kubernetes networking work.