Kubernetes NodePort and iptables rules

I have been working on kubernetes over the last few months and having fun learning about the underlying systems. When I started using kubernetes services, I wanted to learn about the iptables rules that kube-proxy creates to enable them, however, I didn’t exactly know where to start. While there are some really good posts explaining kubernetes networking and how the concept of services works, I couldn’t easily find a starting point to explore these iptables except reading the code or trying to make sense of the output of iptables-save command (which I can’t).

In this post, I share some of what I have learned by digging a little deeper into the iptables rules for NodePort type services and share answers to the following questions I came across while working on kube-proxy:

  • What happens when a non-kubernetes process starts using a port that’s allocated as a NodePort to a service?
  • Does the service endpoint continue to route traffic to pods if kube-proxy (configured in iptables mode) process dies on the node?

Some background

Kubernetes allows a few ways to expose applications to the world outside the kubernetes cluster through the concept of services. In our setup at work, we expose a few of our apps through NodePort type services. Also, kube-proxy is the kubernetes component that powers the services concept and we run it in iptables mode (default).

A little about NodePort

NodePort is one of the ways of exposing Kubernetes Services to the world outside the kubernetes cluster. As per Kubernetes docs,

If you set the type field to NodePort, the Kubernetes control plane allocates a port from a range specified by --service-node-port-range flag (default: 30000-32767). Each node proxies that port (the same port number on every Node) into your Service.

This means that if we have a service with NodePort 30000, a request to <kubernetes-node-ip>:30000 will get routed to our app. Under normal circumstances kube-proxy binds and listens on all NodePorts to ensure these ports stay reserved and no other processes can use them.

err…a non-kubernetes is using the NodePort!!

A few weeks back at work, one of our kubernetes nodes was rebooted and as it was added back to the cluster, we saw errors in kube-proxy logs - bind: address already in use:

I0707 20:57:38.648179       1 proxier.go:701] Syncing iptables rules
E0707 20:57:38.679876       1 proxier.go:1072] can't open "nodePort for default/azure-vote-front:" (:30450/tcp), skipping this nodePort: listen tcp :30450: bind: address already in use

This error meant that a port allocated as a NodePort to a service was already in use by another process. We found out that there was a non-kubernetes process using this NodePort as a client port to connect to a remote server. This happened due of a race condition post node reboot. This other process started using the NodePort before kube-proxy could bind and listen on the port to reserve it.

Will the NodePort on this node still route traffic to the pods?

This was the immediate question that popped up in our heads as the NodePort allocated to a service was now being used by a non-kubernetes process. One of my colleagues, Alex Dudko pointed out that kube-proxy creates iptables rules in the PREROUTING chain in nat table.

Because of these rules, the answer to the above question is - Yes! Traffic sent to the NodePort on this node will still correctly be routed to the backend pods it targets.

The NodePort would also continue to work if another process was listening on the port as a server and not just using it as a client. Though, would our HTTP GET requests reach this server? More on this later.

Kube-proxy iptables rules

In iptables mode, kube-proxy creates iptables rules for kubernetes services which ensure that the request to the service gets routed (and load balanced) to the appropriate pods.

These iptables rules also help answer the second question mentioned above. As long as these iptables rules exist, requests to services will get routed to the appropriate pods even if kube-proxy process dies on the node. Endpoints for new services won’t work from this node, however, since kube-proxy process won’t create the iptables rules for it.

Rules in the PREROUTING chain

As outlined in this flow chart, there are rules in the PREROUTING chain of the nat table for kubernetes services. As per iptables rules evaluation order, rules in the PREROUTING chain are the first ones to be consulted as a packet enters the linux kernel’s networking stack.

Rules created by kube-proxy in the PREROUTING chain help determine if the packet is meant for a local socket on the node or if it should be forwarded to a pod. It is these rules that ensure that the request to <kubernetes-node-ip>:<NodePort> continue to get routed to pods even if the NodePort is in use by another process.

Below is an over-simplified diagram that demonstrates this:

prerouting-rule effect

In rest of this post, I walk through a setup that reproduces the above scenario and examine the iptables rules that make this work.

Setting up a Kubernetes cluster

I created a kubernetes cluster using Azure Kubernetes Service .

Creating pods and a NodePort Service

From the AKS docs, I also created a deployment for an app with a frontend and a backend. I updated the azure-vote-front to type: NodePort service.

  • azure-vote-front pod in the cluster
$ kubectl get po -l app=azure-vote-front --show-labels -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP            NODE                                NOMINATED NODE   READINESS GATES   LABELS
azure-vote-front-5bc759676c-x9bwg   1/1     Running   0          24d   10.244.0.11   aks-nodepool1-22391620-vmss000000   <none>           <none>            app=azure-vote-front,pod-template-hash=5bc759676c
  • azure-vote-front service targeting the above pod
$ kubectl get svc azure-vote-front -o wide
NAME               TYPE       CLUSTER-IP   EXTERNAL-IP   PORT(S)        AGE   SELECTOR
azure-vote-front   NodePort   10.0.237.3   <none>        80:30450/TCP   24d   app=azure-vote-front

The above output shows that the port 30450 is assigned as the NodePort to our service. And we can send HTTP requests to the app from host network:

$ curl -IX GET localhost:30450
HTTP/1.1 200 OK
Server: nginx/1.13.7
Date: Wed, 08 Jul 2020 00:02:08 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 950
Connection: keep-alive

$ curl -sSL localhost:30450 | grep title
    <title>Azure Voting App</title>

Examining iptables rules for a NodePort service

Note: AKS doesn’t allow ssh’ing into an AKS worker node directly, instead, a pod needs to be spun up to ssh into the worker node. Instructions here.

Let’s look at the iptables rules for the azure-vote-front service.

After ssh’ing into the kubernetes worker node, let’s look at the PREROUTING chain in the nat table. (PREROUTING chain exists in raw, nat and mangle tables, however, kube-proxy only creates PREROUTING chain rules in nat table)

$ sudo iptables -t nat -L PREROUTING | column -t
Chain          PREROUTING  (policy  ACCEPT)
target         prot        opt      source    destination
KUBE-SERVICES  all         --       anywhere  anywhere     /*        kubernetes  service   portals  */
DOCKER         all         --       anywhere  anywhere     ADDRTYPE  match       dst-type  LOCAL

There’s a KUBE-SERVICES chain in the target that’s created by kube-proxy. Let’s list the rules in that chain.

$ sudo iptables -t nat -L KUBE-SERVICES -n  | column -t
Chain                      KUBE-SERVICES  (2   references)
target                     prot           opt  source          destination
KUBE-MARK-MASQ             tcp            --   !10.244.0.0/16  10.0.202.188  /*  kube-system/healthmodel-replicaset-service:  cluster  IP          */     tcp   dpt:25227
KUBE-SVC-KA44REDDIFNWY4W2  tcp            --   0.0.0.0/0       10.0.202.188  /*  kube-system/healthmodel-replicaset-service:  cluster  IP          */     tcp   dpt:25227
KUBE-MARK-MASQ             udp            --   !10.244.0.0/16  10.0.0.10     /*  kube-system/kube-dns:dns                     cluster  IP          */     udp   dpt:53
KUBE-SVC-TCOU7JCQXEZGVUNU  udp            --   0.0.0.0/0       10.0.0.10     /*  kube-system/kube-dns:dns                     cluster  IP          */     udp   dpt:53
KUBE-MARK-MASQ             tcp            --   !10.244.0.0/16  10.0.0.10     /*  kube-system/kube-dns:dns-tcp                 cluster  IP          */     tcp   dpt:53
KUBE-SVC-ERIFXISQEP7F7OF4  tcp            --   0.0.0.0/0       10.0.0.10     /*  kube-system/kube-dns:dns-tcp                 cluster  IP          */     tcp   dpt:53
KUBE-MARK-MASQ             tcp            --   !10.244.0.0/16  10.0.190.160  /*  kube-system/kubernetes-dashboard:            cluster  IP          */     tcp   dpt:80
KUBE-SVC-XGLOHA7QRQ3V22RZ  tcp            --   0.0.0.0/0       10.0.190.160  /*  kube-system/kubernetes-dashboard:            cluster  IP          */     tcp   dpt:80
KUBE-MARK-MASQ             tcp            --   !10.244.0.0/16  10.0.195.6    /*  kube-system/metrics-server:                  cluster  IP          */     tcp   dpt:443
KUBE-SVC-LC5QY66VUV2HJ6WZ  tcp            --   0.0.0.0/0       10.0.195.6    /*  kube-system/metrics-server:                  cluster  IP          */     tcp   dpt:443
KUBE-MARK-MASQ             tcp            --   !10.244.0.0/16  10.0.129.119  /*  default/azure-vote-back:                     cluster  IP          */     tcp   dpt:6379
KUBE-SVC-JQ4VBJ2YWO22DDZW  tcp            --   0.0.0.0/0       10.0.129.119  /*  default/azure-vote-back:                     cluster  IP          */     tcp   dpt:6379
KUBE-MARK-MASQ             tcp            --   !10.244.0.0/16  10.0.237.3    /*  default/azure-vote-front:                    cluster  IP          */     tcp   dpt:80
KUBE-SVC-GHSLGKVXVBRM4GZX  tcp            --   0.0.0.0/0       10.0.237.3    /*  default/azure-vote-front:                    cluster  IP          */     tcp   dpt:80
KUBE-MARK-MASQ             tcp            --   !10.244.0.0/16  10.0.0.1      /*  default/kubernetes:https                     cluster  IP          */     tcp   dpt:443
KUBE-SVC-NPX46M4PTMTKRN6Y  tcp            --   0.0.0.0/0       10.0.0.1      /*  default/kubernetes:https                     cluster  IP          */     tcp   dpt:443
KUBE-NODEPORTS             all            --   0.0.0.0/0       0.0.0.0/0     /*  kubernetes                                   service  nodeports;  NOTE:  this  must       be  the  last  rule  in  this  chain  */  ADDRTYPE  match  dst-type  LOCAL

The last terget in the KUBE-SERVICES chain is the KUBE-NODEPORTS chain. Since the service we created is of type NodePort, let’s list the rules in KUBE-NODEPORTS chain.

$ sudo iptables -t nat -L KUBE-NODEPORTS -n  | column -t
Chain                      KUBE-NODEPORTS  (1   references)
target                     prot            opt  source       destination
KUBE-MARK-MASQ             tcp             --   0.0.0.0/0    0.0.0.0/0    /*  default/azure-vote-front:  */  tcp  dpt:30450
KUBE-SVC-GHSLGKVXVBRM4GZX  tcp             --   0.0.0.0/0    0.0.0.0/0    /*  default/azure-vote-front:  */  tcp  dpt:30450

We can see that the above targets are for packets destined to our NodePort 30450. The comments also show the namespace/pod name - default/azure-vote-front. For requests originating from outside the cluster and destined to our app running as a pod, the KUBE-MARK-MASQ rule marks the packet to be altered later in the POSTROUTING chain to use SNAT (source network address translation) to rewrite the source IP as the node IP (so that other hosts outside the pod network can reply back).

Since KUBE-MARK-MASQ target is to MASQUERADE packets later, let’s follow the KUBE-SVC-GHSLGKVXVBRM4GZX chain to further examine our service.

$ sudo iptables -t nat -L KUBE-SVC-GHSLGKVXVBRM4GZX  -n | column -t
Chain                      KUBE-SVC-GHSLGKVXVBRM4GZX  (2   references)
target                     prot                       opt  source       destination
KUBE-SEP-QXDNOBCCLOXLV7LV  all                        --   0.0.0.0/0    0.0.0.0/0

$ sudo iptables -t nat -L KUBE-SEP-QXDNOBCCLOXLV7LV  -n | column -t
Chain           KUBE-SEP-QXDNOBCCLOXLV7LV  (1   references)
target          prot                       opt  source       destination
KUBE-MARK-MASQ  all                        --   10.244.0.11  0.0.0.0/0
DNAT            tcp                        --   0.0.0.0/0    0.0.0.0/0    tcp  to:10.244.0.11:80

Here we see the DNAT (Destination Network Address Translation) target is used to rewrite the destination of the packet destined to port 30450 to our pod 10.244.0.11:80. We can verify the podIP and containerPort of our pod as follows:

$ kubectl get po azure-vote-front-5bc759676c-x9bwg -o json | jq -r '[.status.podIP, .spec.containers[0].ports[0].containerPort | tostring] | join(":")'
10.244.0.11:80

Verify kube-proxy is listening on NodePort

Under normal circumstances kube-proxy binds and listens on all NodePorts to ensure these ports stay reserved and no other processes can use them. We can verify this on the above kubernetes node:

$ sudo lsof -i:30450
COMMAND     PID USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
hyperkube 11558 root    9u  IPv6 95841832      0t0  TCP *:30450 (LISTEN)

$ ps -aef | grep -v grep | grep 11558
root      11558  11539  0 Jul02 ?        00:06:37 /hyperkube kube-proxy --kubeconfig=/var/lib/kubelet/kubeconfig --cluster-cidr=10.244.0.0/16 --feature-gates=ExperimentalCriticalPodAnnotation=true --v=3

kube-proxy is listening on NodePort 30450.

Create a non-kubernetes process use the NodePort

Now let’s kill kube-proxy process and start a server that listens on this NodePort instead.

$ kill -9 11558
$ python -m SimpleHTTPServer 30450 &
[1] 123578

$ sudo lsof -i:30450
COMMAND    PID      USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
python  123578 azureuser    3u  IPv4 95854834      0t0  TCP *:30450 (LISTEN)

$ ps -aef | grep -v grep | grep 123578
azureus+ 123578  85107  0 00:49 pts/5    00:00:00 python -m SimpleHTTPServer 30450

Where will the requests to NodePort be routed - pod or the non-kubernetes process?

While the python process is listening on port 30450, also allocated as a NodePort to our kubernetes service, the iptables rules in the PREROUTING chain will route all requests to port 30450 to our pod. We can verify this as below:

$ curl -I -XGET localhost:30450
HTTP/1.1 200 OK
Server: nginx/1.13.7
Date: Wed, 08 Jul 2020 00:53:09 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 950
Connection: keep-alive

As we send a GET request to port 30450, we receive a response from an nginx server hosting Azure Voting App, like we saw when kube-proxy was listening on port 30450.

$ curl -sSL localhost:30450 | grep title
    <title>Azure Voting App</title>

Summary

  • When kube-proxy is used in iptables mode, routing requests to services continues to work for existing services even when the kube-proxy process dies on the node
  • Kube-proxy binds and listens (on all k8s nodes) to all ports allocated as NodePorts to ensure these ports stay reserved and no other processes can use them
  • Even if a process starts using NodePort, iptables rules (because they are in PRESOUTING chain) ensure that the traffic sent to the NodePort gets routed to the pods

References

Ronak Nathani
Ronak Nathani
Site Reliability Engineer

Working on deployment and container infrastructure at LinkedIn