Jan
24
2022
--

Percona Distribution for MongoDB Operator with Local Storage and OpenEBS

MongoDB Operator with Local Storage and OpenEBS

Automating the deployment and management of MongoDB on Kubernetes is an easy journey with Percona Operator. By default, MongoDB is deployed using persistent volume claims (PVC). In the cases where you seek exceptional performance or you don’t have any external block storage, it is also possible to use local storage. Usually, it makes sense to use local NVMe SSD for better performance (for example Amazon’s i3 and i4i instance families come with local SSDs).

With PVCs, migrating the container from one Kubernetes node to another is straightforward and does not require any manual steps, whereas local storage comes with certain caveats. OpenEBS allows you to simplify local storage management on Kubernetes. In this blog post, we will show you how to deploy MongoDB with Percona Operator and leverage OpenEBS for local storage.

OpenEBS MongoDB

Set-Up

Install OpenEBS

We are going to deploy OpenEBS with a helm chart. Refer to OpenEBS documentation for more details.

helm repo add openebs https://openebs.github.io/charts
helm repo update
helm install openebs --namespace openebs openebs/openebs --create-namespace

This is going to install OpenEBS along with

openebs-hostpath

storageClass:

kubectl get sc
NAME                 PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
…
openebs-hostpath     openebs.io/local        Delete          WaitForFirstConsumer   false                  71s

Deploy MongoDB Cluster

We will use a helm chart for it as well and follow this document

helm repo add percona https://percona.github.io/percona-helm-charts/
helm repo update

Install the Operator:

helm install my-op percona/psmdb-operator

Deploy the database using local storage. We will disable sharding for this demo for simplicity:

helm install mytest percona/psmdb-db --set sharding.enabled=false \ 
--set "replsets[0].volumeSpec.pvc.storageClassName=openebs-hostpath" \ 
--set  "replsets[0].volumeSpec.pvc.resources.requests.storage=3Gi" \ 
--set "replsets[0].name=rs0" --set "replsets[0].size=3"

As a result, we should have a replica set with three nodes using

openebs-hostpath

storageClass.

$ kubectl get pods
NAME                                    READY   STATUS    RESTARTS   AGE
my-op-psmdb-operator-58c74cbd44-stxqq   1/1     Running   0          5m56s
mytest-psmdb-db-rs0-0                   2/2     Running   0          3m58s
mytest-psmdb-db-rs0-1                   2/2     Running   0          3m32s
mytest-psmdb-db-rs0-2                   2/2     Running   0          3m1s

$ kubectl get pvc
NAME                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS       AGE
mongod-data-mytest-psmdb-db-rs0-0   Bound    pvc-63d3b722-4b31-42ab-b4c3-17d8734c92a3   3Gi        RWO            openebs-hostpath   4m2s
mongod-data-mytest-psmdb-db-rs0-1   Bound    pvc-2bf6d908-b3c0-424c-9ccd-c3be3295da3a   3Gi        RWO            openebs-hostpath   3m36s
mongod-data-mytest-psmdb-db-rs0-2   Bound    pvc-9fa3e21e-bfe2-48de-8bba-0dae83b6921f   3Gi        RWO            openebs-hostpath   3m5s

Local Storage Caveats

Local storage is the node storage. It means that if something happens with the node, it will also have an impact on the data. We will review various regular situations and how they impact Percona Server for MongoDB on Kubernetes with local storage.

Node Restart

Something happened with the Kubernetes node – server reboot, virtual machine crash, etc. So the node is not lost but just restarted. Let’s see what would happen in this case with our MongoDB cluster.

I will restart one of my Kubernetes nodes. As a result, the Pod will go into a Pending state:

$ kubectl get pods
NAME                                    READY   STATUS    RESTARTS   AGE
my-op-psmdb-operator-58c74cbd44-stxqq   1/1     Running   0          58m
mytest-psmdb-db-rs0-0                   2/2     Running   0          56m
mytest-psmdb-db-rs0-1                   0/2     Pending   0          67s
mytest-psmdb-db-rs0-2                   2/2     Running   2          55m

In normal circumstances, the Pod should be rescheduled to another node, but it is not happening now. The reason is local storage and affinity rules. If you do

kubectl describe pod mytest-psmdb-db-rs0-1

, you would see something like this:

Events:
  Type     Reason             Age                From                Message
  ----     ------             ----               ----                -------
  Warning  FailedScheduling   72s (x2 over 73s)  default-scheduler   0/3 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict.
  Normal   NotTriggerScaleUp  70s                cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had volume node affinity conflict

As you see, the cluster is not scaled up as this Pod needs a specific node that has its storage. We can see this annotation in PVC itself:

$ kubectl describe pvc mongod-data-mytest-psmdb-db-rs0-1
Name:          mongod-data-mytest-psmdb-db-rs0-1
Namespace:     default
StorageClass:  openebs-hostpath
…
Annotations:   pv.kubernetes.io/bind-completed: yes
…
               volume.kubernetes.io/selected-node: gke-sergey-235-default-pool-9f5f2e2b-4jv3
…
Used By:       mytest-psmdb-db-rs0-1

In other words, this Pod will wait for the node to come back. Till it comes back your MongoDB cluster will be in a degraded state, running two nodes out of three. Keep this in mind when you perform maintenance or experience a Kubernetes node crash. With PVCs, this MongoDB Pod would be rescheduled to a new node right away.

Graceful Migration to Another Node

Let’s see what is the best way to migrate one MongoDB replica set Pod from one node to another when local storage is used. There can be multiple reasons – node maintenance, migration to another rack, datacenter, or newer hardware. We want to perform such a migration with no downtime and minimal performance impact on the database.

Firstly, we will add more nodes to the replica set by scaling up the cluster. We will use helm again and change the size from three to five:

helm upgrade mytest percona/psmdb-db --set sharding.enabled=false \ 
--set "replsets[0].volumeSpec.pvc.storageClassName=openebs-hostpath" \ 
--set  "replsets[0].volumeSpec.pvc.resources.requests.storage=3Gi" \ 
--set "replsets[0].name=rs0" --set "replsets[0].size=5"

This will create two more Pods in a replica set. Both Pods will use

openebs-hostpath

storage as well. By default, our affinity rules require you to run replica set nodes on different Kubernetes nodes, so either enable auto-scaling or ensure you have enough nodes in your cluster. We are adding more nodes to avoid performance impact.

Once all five replica set nodes are up, we will drain the Kubernetes node we need. This will remove all the pods from it gracefully.

kubectl drain gke-sergey-235-default-pool-9f5f2e2b-rtcz --ignore-daemonsets

As with the node restart described in the previous chapter, the replica set Pod will be stuck in Pending status waiting for the local storage.

kubectl get pods
NAME                                    READY   STATUS    RESTARTS   AGE
…
mytest-psmdb-db-rs0-2                   0/2     Pending   0          65s

The storage will not come back. To solve it we need to remove the PVC and delete the Pod: 

kubectl delete pvc mongod-data-mytest-psmdb-db-rs0-2
persistentvolumeclaim "mongod-data-mytest-psmdb-db-rs0-2" deleted


kubectl delete pod mytest-psmdb-db-rs0-2
pod "mytest-psmdb-db-rs0-2" deleted

This will trigger the creation of a new PVC and a Pod on another node:

NAME                                    READY   STATUS    RESTARTS   AGE
…
mytest-psmdb-db-rs0-2                   2/2     Running   2          1m

Again all five replica set pods are up and running. You can now perform the maintenance on your Kubernetes node.

What is left is to scale down replica set back to three nodes:

helm upgrade mytest percona/psmdb-db --set sharding.enabled=false \ 
--set "replsets[0].volumeSpec.pvc.storageClassName=openebs-hostpath" \ 
--set  "replsets[0].volumeSpec.pvc.resources.requests.storage=3Gi" \ 
--set "replsets[0].name=rs0" --set "replsets[0].size=3"

Node Loss

When the Kubernetes node is dead and there is no chance for it to recover, we will face the same situation as with graceful migration: Pod will be stuck in Pending status waiting for the node to come back. The recovery path is the same:

  1. Delete Persistent Volume
  2. Delete the Pod
  3. The pod will start on another node and sync the data to a new local PVC

 

Conclusion

Local storage can boost your database performance and remove the need for cloud storage completely. This can also lower your public cloud provider bill. In this blog post, we saw that these benefits come with a higher maintenance cost, that can be also automated. 

We encourage you to try out Percona Distribution for MongoDB Operator with local storage and share your results on our community forum.

There is always room for improvement and a time to find a better way. Please let us know if you face any issues with contributing your ideas to Percona products. You can do that on the Community Forum or JIRA. Read more about contribution guidelines for Percona Distribution for MongoDB Operator in CONTRIBUTING.md.

Nov
13
2020
--

Kubernetes Scaling Capabilities with Percona XtraDB Cluster

Kubernetes Scaling Capabilities with Percona XtraDB Cluster

Kubernetes Scaling Capabilities with Percona XtraDB ClusterOur recent survey showed that many organizations saw unexpected growth around cloud and data. Unexpected bills can become a big problem, especially in such uncertain times. This blog post talks about how Kubernetes scaling capabilities work with Percona Kubernetes Operator for Percona XtraDB Cluster (PXC Operator) and can help you to control the bill.

Resources

Kubernetes is a container orchestrator and on top of it, it has great scaling capabilities. Scaling can help you to utilize your cluster better and do not waste money on excessive capacity. But before scaling we need to understand what capacity is and how Kubernetes manages CPU and memory resources.

There are two resource concepts that you should be aware of: requests and limits. Requests is the amount of CPU or memory that a container is guaranteed to get on the node. Kubernetes uses requests during scheduling decisions, and it will not schedule a container to a node that does not have enough capacity. Limits is the maximum amount of resources that a container can get on the node. There is no guarantee though. In Linux, world limits are just cgroup maximums for processes.

Each node in a cluster has its own capacity. Part of this capacity is reserved for the operating system and kubelet, and what is left can be utilized by containers (allocatable).

resource allocation in Kubernetes

Okay, now we know a thing or two about resource allocation in Kubernetes. Let’s dive into the problem space.

Problem #1: Requested Too Much

If you request resources for containers but do not utilize them well enough, you end up wasting resources. This is where Vertical Pod Autoscaler (VPA) comes in handy. It can automatically scale up or down container requests based on its historical real usage.

request resources for containers

VPA has 3 modes:

  1. Recommender – it only provides recommendations for containers’ requests. We suggest starting with this mode.
  2. Initial – webhook applies changes to the container during its creation
  3. Auto/Recreate – webhook applies changes to the container during its creation and can also dynamically change the requests for the container

Configure VPA

As a starting point, deploy Percona Kubernetes Operator for Percona XtraDB Cluster and the database by following the guide. VPA is deployed via a single command (see the guide here). VPA requires a metrics-server to get real usage for containers.

We need to create a VPA resource that will monitor our PXC cluster and provide recommendations for requests tuning. For recommender mode set UpdateMode to “Off”:

$ cat vpa.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: pxc-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       StatefulSet
    name:       <name of the STS>
    namespace:  <your namespace>
  updatePolicy:
    updateMode: "Off"

Run the following command to get the name of the StatefulSet:

$ kubectl get sts
NAME           READY   AGE
...
cluster1-pxc   3/3     3h47m

The one with -pxc has the PXC cluster. Apply the VPA object:

$ kubectl apply -f vpa.yaml

After a few minutes you should be able to fetch recommendations from the VPA object:

$ kubectl get vpa pxc-vpa -o yaml
...
  recommendation:
    containerRecommendations:
    - containerName: pxc
      lowerBound:
        cpu: 25m
        memory: "503457402"
      target:
        cpu: 25m
        memory: "548861636"
      uncappedTarget:
        cpu: 25m
        memory: "548861636"
      upperBound:
        cpu: 212m
        memory: "5063059194"

Resources in the target section are the ones that VPA recommends and applies if Auto or Initial modes are configured. Read more here to understand other recommendation sections.

VPA will apply the recommendations once it is running in Auto mode and will persist the recommended configuration even after the pod being restarted. To enable Auto mode patch the VPA object:

$ kubectl patch vpa pxc-vpa --type='json' -p '[{"op": "replace", "path": "/spec/updatePolicy/updateMode", "value": "Auto"}]'

After a few minutes, VPA will restart PXC pods and apply recommended requests.

$ kubectl describe pod cluster1-pxc-0
...
Requests:
      cpu: “25m”
      memory: "548861636"

Delete VPA object to stop autoscaling:

$ kubectl delete vpa pxc-vpa

Please remember few things about VPA and Auto mode:

  1. It changes container requests, but does not change Deployments or StatefulSet resources.
  2. It is not application aware. For PXC, for example, it does not change
    innodb_buffer_pool_size

      which is configured to take 75% of RAM by the operator. To change it, please, set corresponding requests configuration in

    cr.yaml

    and apply.

  3. It respects
    podDistruptionBudget

    to protect your application. In our default

    cr.yaml

      PDB is configured to lose one pod at a time. It means VPA will change requests and restart one pod at a time:

    podDisruptionBudget:
      maxUnavailable: 1

Problem #2: Spiky Usage

The utilization of the application might change over time. It can happen gradually, but what if it is daily spikes of usage or completely unpredictable patterns? Constantly running additional containers is an option, but it leads to resource waste and increases in infrastructure costs. This is where Horizontal Pod Autoscaler (HPA) can help. It monitors container resources or even application metrics to automatically increase or decrease the number of containers serving the application.

Horizontal Pod Autoscaler

Looks nice, but unfortunately, the current version of the PXC Operator will not work with HPA. HPA tries to scale the StatefulSet, which in our case is strictly controlled by the operator. It will overwrite any scaling attempts from the horizontal scaler. We are researching the opportunities to enable this support for PXC Operator.

Problem #3: My Cluster is Too Big

You have tuned resources requests and they are close to real usage, but the cloud bill is still not going down. It might be that your Kubernetes cluster is overprovisioned and should be scaled with Cluster Autoscaler. CA adds and removes nodes to your Kubernetes cluster based on their requests usage. When nodes are removed pods are rescheduled to other nodes automatically.

Kubernetes cluster is overprovisioned

Configure CA

On Google, Kubernetes Engine Cluster Autoscaler can be enabled through gcloud utility. On AWS you need to install autoscaler manually and add corresponding autoscaling groups into the configuration.

In general, CA monitors if there are any pods that are in Pending status (waiting to be scheduled, read more on pod statuses here) and adds more nodes to the cluster to meet the demand. It removes nodes if it sees the possibility to pack pods densely on other nodes. To add and remove nodes it relies on the cloud primitives: node groups in GCP, auto-scaling groups in AWS, virtual machine scale set on Azure, and so on. The installation of CA differs from cloud to cloud, but here are some interesting tricks.

Overprovision the Cluster

If your workloads are scaling up CA needs to provision new nodes. Sometimes it might take a few minutes. If there is a requirement to scale faster it is possible to overprovision the cluster. Detailed instruction is here. The idea is to always run pause pods with low priority, real workloads with higher priority push them out from nodes when needed.

Expanders

Expanders control how to scale up the cluster; which nodes to add. Configure expanders and multiple node groups to fine-tune the scaling. My preference is to use priority expander as it allows us to cherry-pick the nodes by customizable priorities, it is especially useful for a rapidly changing spot market.

Safety

Pay extremely close attention to scaling down. First of all, you can disable it completely by setting

scale-down-enabled

  to

false

(not recommended). For clusters with big nodes with lots of pods be careful with

scale-down-utilization-threshold

  – do not set it to more than 50%, it might impact other nodes and overutilize them. For clusters with a dynamic workload and lots of nodes do not set

scale-down-delay-after-delete

and scale-down-unneeded-time too low, it will lead to non-stop cluster scaling with absolutely no value.

Cluster Autoscaler also respects

podDistruptionBudget

. When you run it along with PXC Operator please make sure PDBs are correctly configured, so that the PXC cluster does not crash in the event of scaling down the Kubernetes.

Conclusion

In cloud environments, day two operations must include cost management. Overprovisioning Kubernetes clusters is a common theme that can quickly become visible in the bills. When running Percona XtraDB Cluster on Kubernetes you can leverage Vertical Pod Autoscaler to tune requests and apply Cluster Autoscaler to reduce the number of instances to minimize your cloud spend. It will be possible to use Horizontal Pod Autoscaler in future releases as well to dynamically adjust your cluster to demand.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com