Jan
20
2021
--

Drain Kubernetes Nodes… Wisely

Drain Kubernetes Nodes Wisely

Drain Kubernetes Nodes WiselyWhat is Node Draining?

Anyone who ever worked with containers knows how ephemeral they are. In Kubernetes, not only can containers and pods be replaced, but the nodes as well. Nodes in Kubernetes are VMs, servers, and other entities with computational power where pods and containers run.

Node draining is the mechanism that allows users to gracefully move all containers from one node to the other ones. There are multiple use cases:

  • Server maintenance
  • Autoscaling of the k8s cluster – nodes are added and removed dynamically
  • Preemptable or spot instances that can be terminated at any time

Why Drain?

Kubernetes can automatically detect node failure and reschedule the pods to other nodes. The only problem here is the time between the node going down and the pod being rescheduled. Here’s how it goes without draining:

  1. Node goes down – someone pressed the power button on the server.
  2. kube-controller-manager

    , the service which runs on masters, cannot get the

    NodeStatus

    from the

    kubelet

    on the node. By default it tries to get the status every 5 seconds and it is controlled by

    --node-monitor-period

    parameter of the controller.

  3. Another important parameter of the
    kube-controller-manager

    is

    --node-monitor-grace-period

    , which defaults to 40s. It controls how fast the node will be marked as

    NotReady

    by the master.

  4. So after ~40 seconds
    kubectl get nodes

    shows one of the nodes as

    NotReady

    , but the pods are still there and shown as running. This leads us to

    --pod-eviction-timeout

    , which is 5 minutes by default (!). It means that after the node is marked as

    NotReady

    , only after 5 minutes Kubernetes starts to evict the Pods.

Drain Kubernetes Nodes

So if someone shuts down the server, then only after almost six minutes (with default settings), Kubernetes starts to reschedule the pods to other nodes. This timing is also valid for managed k8s clusters, like GKE.

These defaults might seem to be too high, but this is done to prevent frequent pods flapping, which might impact your application and infrastructure in a far more negative way.

Okay, Draining How?

As mentioned before – draining is the graceful method to move the pods to another node. Let’s see how draining works and what pitfalls are there.

Basics

kubectl drain {NODE_NAME}

command most likely will not work. There are at least two flags that need to be set explicitly:

  • --ignore-daemonsets

    – it is not possible to evict pods that run under a DaemonSet. This flag ignores these pods.

  • --delete-emptydir-data

    – is an acknowledgment of the fact that data from EmptyDir ephemeral storage will be gone once pods are evicted.

Once the drain command is executed the following happens:

  1. The node is cordoned. It means that no new pods can be placed on this node. In the Kubernetes world, it is a Taint
    node.kubernetes.io/unschedulable:NoSchedule

    placed on the node that most of the pods tolerate.

  2. Pods, except the ones that belong to DaemonSets, are evicted and hopefully scheduled on another node.

Pods are evicted and now the server can be powered off. Wrong.

DaemonSets

If for some reason your application or service uses a DaemonSet primitive, the pod was not drained from the node. It means that it still can perform its function and even receive the traffic from the load balancer or the service. 

The best way to ensure that it is not happening – delete the node from the Kubernetes itself.

  1. Stop the
    kubelet

    on the node.

  2. Delete the node from the cluster with
    kubectl delete {NODE_NAME}

If

kubelet

is not stopped, the node will appear again after the deletion.

Pods are evicted, node is deleted, and now the server can be powered off. Wrong again.

Load Balancer

Here is quite a standard setup:

kubernetes Load Balancer

The external load balancer sends the traffic to all Kubernetes nodes. Kube-proxy and Container Network Interface internals are dealing with routing the traffic to the correct pod.

There are various ways to configure the load balancer, but as you see it might be still sending the traffic to the node. Make sure that the node is removed from the load balancer before powering it off. For example, AWS node termination handler does not remove the node from the Load Balancer, which causes a short packet loss in the event of node termination.

Conclusion

Microservices and Kubernetes shifted the paradigm of systems availability. SRE teams are focused on resilience more than on stability. Nodes, containers, and load balancers can fail, but they are ready for it. Kubernetes is an orchestration and automation tool that helps here a lot, but there are still pitfalls that must be taken care of to meet SLAs.

Nov
04
2020
--

Running Percona Kubernetes Operator for Percona XtraDB Cluster with Kata Containers

Percona Kubernetes Operator for Percona XtraDB Cluster with Kata Containers

Percona Kubernetes Operator for Percona XtraDB Cluster with Kata ContainersKata containers are containers that use hardware virtualization technologies for workload isolation almost without performance penalties. Top use cases are untrusted workloads and tenant isolation (for example in a shared Kubernetes cluster). This blog post describes how to run Percona Kubernetes Operator for Percona XtraDB Cluster (PXC Operator) using Kata containers.

Prepare Your Kubernetes Cluster

Setting up Kata containers and Kubernetes is well documented in the official github repo (cri-o, containerd, Kubernetes DaemonSet). We will just cover the most important steps and pitfalls.

Virtualization Support

First of all, remember that Kata containers require hardware virtualization support from the CPU on the nodes. To check if your linux system supports it run on the node:

$ egrep ‘(vmx|svm)’ /proc/cpuinfo

VMX (Virtual Machine Extension) and SVM (Secure Virtual Machine) are Intel and AMD features that add various instructions to allow running a guest OS with full privileges, but still keeping host OS protected.

For example, on AWS only i3.metal and r5.metal instances provide VMX capability.

Containerd

Kata containers are OCI (Open Container Interface) compliant, which means that they work pretty well with CRI (Container Runtime Interface) and hence well supported by Kubernetes. To use Kata containers please make sure your Kubernetes nodes run using CRI-O or containerd runtimes.

The image below describes pretty well how Kubernetes works with Kata.

Kubernetes works with Kata

Hint: GKE or kops allows you to start your cluster with containerd out of the box and skip manual steps.

Setting Up Nodes

To run Kata containers, k8s nodes need to have kata-runtime installed and runtime configured properly. The easiest way is to use DaemonSet which installs required packages on every node and reconfigures containerd. As a first step apply the following yamls to create the DaemonSet:

$ kubectl apply -f https://raw.githubusercontent.com/kata-containers/packaging/master/kata-deploy/kata-rbac/base/kata-rbac.yaml
$ kubectl apply -f https://raw.githubusercontent.com/kata-containers/packaging/master/kata-deploy/kata-deploy/base/kata-deploy.yaml

DaemonSet reconfigures containerd to support multiple runtimes. It does that by changing /etc/containerd/config.toml. Please note that some tools (ex. kops) keep containerd in a separate configuration file config-kops.toml. You need to copy the configuration created by DaemonSet to the corresponding file and restart containerd.

Create runtimeClasses for Kata. RuntimeClass is a feature that allows you to pick runtime for the container during its creation. It has been available since Kubernetes 1.14 as Beta.

$ kubectl apply -f https://raw.githubusercontent.com/kata-containers/packaging/master/kata-deploy/k8s-1.14/kata-qemu-runtimeClass.yaml

Everything is set. Deploy test nginx pod and set the runtime:

$ cat nginx-kata.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nginx-kata
spec:
  runtimeClassName: kata-qemu
  containers:
    - name: nginx
      image: nginx

$ kubectl apply -f nginx-kata.yaml
$ kubectl describe pod nginx-kata | grep “Container ID”
    Container ID:   containerd://3ba8d62be5ee8cd57a35081359a0c08059cf08d8a53bedef3384d18699d13111

On the node verify if Kata is used for this container through ctr tool:

# ctr --namespace k8s.io containers list | grep 3ba8d62be5ee8cd57a35081359a0c08059cf08d8a53bedef3384d18699d13111
3ba8d62be5ee8cd57a35081359a0c08059cf08d8a53bedef3384d18699d13111    sha256:f35646e83998b844c3f067e5a2cff84cdf0967627031aeda3042d78996b68d35 io.containerd.kata-qemu.v2cat 

Runtime is showing kata-qemu.v2 as requested.

The current latest stable PXC Operator version (1.6) does not support runtimeClassName. It is still possible to run Kata containers by specifying

io.kubernetes.cri.untrusted-workload

annotation. To ensure containerd supports this annotation add the following into the configuration toml file on the node:

# cat <<EOF >> /etc/containerd/config.toml
[plugins.cri.containerd.untrusted_workload_runtime]
  runtime_type = "io.containerd.kata-qemu.v2"
EOF

# systemctl restart containerd

Install the Operator

We will install the operator with regular runtime but will put the PXC cluster into Kata containers.

Create the namespace and switch the context:

$ kubectl create namespace pxc-operator
$ kubectl config set-context $(kubectl config current-context) --namespace=pxc-operator

Get the operator from github:

$ git clone -b v1.6.0 https://github.com/percona/percona-xtradb-cluster-operator

Deploy the operator into your Kubernetes cluster:

$ cd percona-xtradb-cluster-operator
$ kubectl apply -f deploy/bundle.yaml

Now let’s deploy the cluster, but before that, we need to explicitly add an annotation to PXC pods and mark them untrusted to enforce Kubernetes to use Kata containers runtime. Edit

deploy/cr.yaml

 :

pxc:
  size: 3
  image: percona/percona-xtradb-cluster:8.0.20-11.1
  …
  annotations:

      io.kubernetes.cri.untrusted-workload: "true"

Now, let’s deploy the PXC cluster:

$ kubectl apply -f deploy/cr.yaml

The cluster is up and running (using 1 node for the sake of experiment):

$ kubectl get pods
NAME                                               READY   STATUS    RESTARTS   AGE
pxc-kata-haproxy-0                                 2/2     Running   0          5m32s
pxc-kata-pxc-0                                     1/1     Running   0          8m16s
percona-xtradb-cluster-operator-749b86b678-zcnsp   1/1     Running   0          44m

In crt output you should see percona-xtradb cluster running using Kata runtime:

# ctr --namespace k8s.io containers list | grep percona-xtradb-cluster | grep kata
448a985c82ae45effd678515f6cf8e11a6dfca159c9abf05a906c7090d297cba    docker.io/percona/percona-xtradb-cluster:8.0.20-11.2 io.containerd.kata-qemu.v2

We are working on adding the support for runtimeClassName option for our operators. The support of this feature enables users to freely choose any container runtime.

Conclusions

Running databases in containers is an ongoing trend and keeping data safe is always the top priority for a business. Kata containers provide security isolation through mature and extensively tested qemu virtualization with little-to-none changes to the existing environment.

Deploy Percona XtraDB Cluster with ease in your Kubernetes cluster with our Operator and Kata containers for better isolation without performance penalties.

 

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com