Percona Monitoring and Management in Kubernetes is now in Tech Preview

Percona Monitoring and Management in Kubernetes

Percona Monitoring and Management in KubernetesOver the course of the years, we see the growing interest in running databases and stateful workloads in Kubernetes. With Container Storage Interfaces (CSI) maturing and more and more Operators appearing, running stateful workloads in your favorite platform is not that scary anymore. Our Kubernetes story at Percona started with Operators for MySQL and MongoDB, adding PostgreSQL later on. 

Percona Monitoring and Management (PMM) is an open source database monitoring, observability, and management tool. It can be deployed in a virtual appliance or a Docker container. Our customers requested us to provide a way to deploy PMM in Kubernetes for a long time. We had an unofficial helm chart which was created as a PoC by Percona teams and the community (GitHub).

We are introducing the Technical Preview of the helm chart that is supported by Percona to easily deploy PMM in Kubernetes. You can find it in our helm chart repository here

Use cases

Single platform

If Kubernetes is a platform of your choice, currently you need a separate virtual machine to run Percona Monitoring and Management. No more with an introduction of a helm chart. 

As you know, Percona Operators all have integration with the PMM which enables monitoring for databases deployed on Kubernetes. Operators configure and deploy pmm-client sidecar container and register the nodes on a PMM server. Bringing PMM into Kubernetes simplifies this integration and the whole flow. Now the network traffic between pmm-client and PMM server does not leave the Kubernetes cluster at all.

All you have to do is to set the correct endpoint in a pmm section in the Custom Resource manifest. For example, for Percona Operator for MongoDB, the pmm section will look like this:

    enabled: true
    image: percona/pmm-client:2.28.0
    serverHost: monitoring-service

Where monitoring-service is the service created by a helm chart to expose a PMM server.

High availability

Percona Monitoring and Management has lots of moving parts: Victoria Metrics to store time-series data, ClickHouse for query analytics functionality, and PostgreSQL to keep PMM configuration. Right now all these components are a part of a single container or virtual machine, with Grafana as a frontend. To provide a zero-downtime deployment in any environment, we need to decouple these components. It is going to substantially complicate the installation and management of PMM.

What we offer instead right now are ways to automatically recover PMM in case of failure within minutes (for example leveraging the EC2 self-healing mechanism).

Kubernetes is a control plane for container orchestration that automates manual tasks for engineers. When you run software in Kubernetes it is best if you rely on its primitives to handle all the heavy lifting. This is what PMM looks like in Kubernetes:

PMM in Kubernetes

  • StatefulSet controls the Pod creation
  • There is a single Pod with a single container with all the components in it
  • This Pod is exposed through a Service that is utilized by PMM Users and pmm-clients
  • All the data is stored on a persistent volume
  • ConfigMap has various environment variable settings that can help to fine-tune PMM

In case of a node or a Pod failure, the StatefulSet is going to recover PMM Pod automatically and remount the Persistent Volume to it. The recovery time depends on the load of the cluster and node availability, but in normal operating environments, PMM Pod is up and running again within a minute.


Let’s see how PMM can be deployed in Kubernetes.

Add the helm chart:

helm repo add percona https://percona.github.io/percona-helm-charts/
helm repo update

Install PMM:

helm install pmm --set service.type="LoadBalancer" percona/pmm

You can now login into PMM using the LoadBalancer IP address and use a randomly generated password stored in a


Secret object (default user is admin).

The Service object created for PMM is called



$ kubectl get services monitoring-service
NAME                 TYPE           CLUSTER-IP    EXTERNAL-IP     PORT(S)         AGE
monitoring-service   LoadBalancer   443:32591/TCP   3m34s

$ kubectl get secrets pmm-secret -o yaml
apiVersion: v1

$ echo 'LE5lSTx3IytrUWBmWEhFTQ==' | base64 --decode && echo

Login to PMM by connecting to HTTPS://<YOUR_PUBLIC_IP>.


Helm chart is a template engine for YAML manifests and it allows users to customize the deployment. You can see various parameters that you can set to fine-tune your PMM installation in our README

For example, to set choose another storage class and set the desired storage size, set the following flags:

helm install pmm percona/pmm \
--set storage.storageClassName="premium-rwo" \
-–set storage.size=”20Gi”

You can also change these parameters in values.yaml and use “-f” flag:

# values.yaml contents
  storageClassName: “premium-rwo”
  size: 20Gi

helm install pmm percona/pmm -f values.yaml


For most of the maintenance tasks, regular Kubernetes techniques would apply. Let’s review a couple of examples.

Compute scaling

It is possible to vertically scale PMM by adding or removing resources through


variable in values.yaml. 

    memory: "4Gi"
    cpu: "2"
    memory: "8Gi"
    cpu: "4"

Once done, upgrade the deployment:

helm upgrade -f values.yaml pmm percona/pmm

This will restart a PMM Pod, so better plan it carefully not to disrupt your team’s work.

Storage scaling

This depends a lot on your storage interface capabilities and the underlying implementation. In most clouds, Persistent Volumes can be expanded. You can check if your storage class supports it by describing it:

kubectl describe storageclass standard
AllowVolumeExpansion:  True

Unfortunately, just changing the size of the storage in values.yaml (storage.size) will not do the trick and you will see the following error:

helm upgrade -f values.yaml pmm percona/pmm
Error: UPGRADE FAILED: cannot patch "pmm" with kind StatefulSet: StatefulSet.apps "pmm" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden

We use the StatefulSet object to deploy PMM, and StatefulSets are mostly immutable and there are a handful of things that can be changed on the fly. There is a trick though.

First, delete the StatefulSet, but keep the Pods and PVCs:

kubectl delete sts pmm --cascade=orphan

Recreate it again with the new storage configuration:

helm upgrade -f values.yaml pmm percona/pmm

It will recreate the StatefulSet with the new storage size configuration. 

Edit Persistent Volume Claim manually and change the storage size (the name of the PVC can be different for you). You need to change the storage in spec.resources.requests.storage section:

kubectl edit pvc pmm-storage-pmm-0

The PVC is not resized yet and you can see the following message when you describe it:

kubectl describe pvc pmm-storage-pmm-0
  Type                      Status  LastProbeTime                     LastTransitionTime                Reason  Message
  ----                      ------  -----------------                 ------------------                ------  -------
  FileSystemResizePending   True    Mon, 01 Jan 0001 00:00:00 +0000   Thu, 16 Jun 2022 11:55:56 +0300           Waiting for user to (re-)start a pod to finish file system resize of volume on node.

The last step would be to restart the Pod:

kubectl delete pod pmm-0


Running helm upgrade is a recommended way. Either once a new helm chart is released or when you want to upgrade the newer version of PMM by replacing the image in the image section. 

Backup and restore

PMM stores all the data on a Persistent Volume. As said before, regular Kubernetes techniques can be applied here to backup and restore the data. There are numerous options:

  • Volume Snapshots – check if it is supported by your CSI and storage implementation
  • Third-party tools, like Velero, can handle the backups and restores of PVCs
  • Snapshots provided by your storage (ex. AWS EBS Snapshots) with manual mapping to PVC during restoration

What is coming

To keep you excited there are numerous things that we are working on or have planned to enable further Kubernetes integrations.

OpenShift support

We are working on building a rootless container so that OpenShift users can run Percona Monitoring and Management there without having to grant elevated privileges. 

Microservices architecture

This is something that we have been discussing internally for some time now. As mentioned earlier, there are lots of components in PMM. To enable proper horizontal scaling, we need to break down our monolith container and run these components as separate microservices. 

Managing all these separate containers and Pods (if we talk about Kubernetes), would require coming up with separate maintenance strategies. This brings up the question of creating a separate Operator for PMM only to manage all this, but it is a significant effort. If you have an opinion here – please let us know on our community forum.

Automated k8s registration in DBaaS

As you know, Percona Monitoring and Management comes with a technical preview of  Database as a Service (DBaaS). Right now when you install PMM on a Kubernetes cluster, you still need to register the cluster to deploy databases. We want to automate this process so that after the installation you can start deploying and managing databases right away.


Percona Monitoring and Management enables database administrators and site reliability engineers to pinpoint issues in their open source database environments, whether it is a quick look through the dashboards or a detailed analysis with Query Analytics. Percona’s support for PMM on Kubernetes is a response to the needs of our customers who are transforming their infrastructure.

Some useful links that would help you to deploy PMM on Kubernetes:


Store and Manage Logs of Percona Operator Pods with PMM and Grafana Loki

Store and Manage Logs of Percona Operator Pods with PMM and Grafana Loki

While it is convenient to view the log of MySQL or MongoDB pods with

kubectl logs

, sometimes the log is purged when the pod is deleted, which makes searching historical logs a bit difficult. Grafana Loki, an aggregation logging tool from Grafana, can be installed in the existing Kubernetes environment to help store historical logs and build a comprehensive logging stack.

What Is Grafana Loki

Grafana Loki is a set of tools that can be integrated to provide a comprehensive logging stack solution. Grafana-Loki-Promtail is the major component of this solution. The Promtail agent collects and ships the log to the Loki datastore and then visualizes logs with Grafana. While collecting the logs, Promtail can label, convert and filter the log before sending. Next step, Loki receives the logs and indexes the metadata of the log. At this step, the logs are ready to be visualized in the Grafana and the administrator can use Grafana and Loki’s query language, LogQL, to explore the logs. 


     Grafana Loki

Installing Grafana Loki in Kubernetes Environment

We will use the official helm chart to install Loki. The Loki stack helm chart supports the installation of various components like








$ helm repo add grafana https://grafana.github.io/helm-charts
$ helm repo update
$ helm search repo grafana

grafana/grafana                     6.29.2       8.5.0      The leading tool for querying and visualizing t...
grafana/grafana-agent-operator      0.1.11       0.24.1     A Helm chart for Grafana Agent Operator
grafana/fluent-bit                  2.3.1        v2.1.0     Uses fluent-bit Loki go plugin for gathering lo...
grafana/loki                        2.11.1       v2.5.0     Loki: like Prometheus, but for logs.
grafana/loki-canary                 0.8.0        2.5.0      Helm chart for Grafana Loki Canary
grafana/loki-distributed            0.48.3       2.5.0      Helm chart for Grafana Loki in microservices mode
grafana/loki-simple-scalable        1.0.0        2.5.0      Helm chart for Grafana Loki in simple, scalable...
grafana/loki-stack                  2.6.4        v2.4.2     Loki: like Prometheus, but for logs.

For a quick introduction, we will install only




. We will use the Grafana pod, that is deployed by Percona Monitoring and Management to visualize the log. 

$ helm install loki-stack grafana/loki-stack --create-namespace --namespace loki-stack --set promtail.enabled=true,loki.persistence.enabled=true,loki.persistence.size=5Gi

Let’s see what has been installed:

$ kubectl get all -n loki-stack
NAME                            READY   STATUS    RESTARTS   AGE
pod/loki-stack-promtail-xqsnl   1/1     Running   0          85s
pod/loki-stack-promtail-lt7pd   1/1     Running   0          85s
pod/loki-stack-promtail-fch2x   1/1     Running   0          85s
pod/loki-stack-promtail-94rcp   1/1     Running   0          85s
pod/loki-stack-0                1/1     Running   0          85s

NAME                          TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/loki-stack-headless   ClusterIP   None           <none>        3100/TCP   85s
service/loki-stack            ClusterIP   <none>        3100/TCP   85s

daemonset.apps/loki-stack-promtail   4         4         4       4            4           <none>          85s

NAME                          READY   AGE
statefulset.apps/loki-stack   1/1     85s




pods have been created, together with a service that


will use to publish the logs to Grafana for visualization.

You can see the promtail pods are deployed by a


 which spawns a promtail pod in every node. This is to make sure the logs from every pod in all the nodes are collected and shipped to the


  pod for centralizing storing and management.

$ kubectl get pods -n loki-stack -o wide
NAME                        READY   STATUS    RESTARTS   AGE    IP           NODE                 NOMINATED NODE   READINESS GATES
loki-stack-promtail-xqsnl   1/1     Running   0          2m7s   phong-dinh-default   <none>           <none>
loki-stack-promtail-lt7pd   1/1     Running   0          2m7s   phong-dinh-node1     <none>           <none>
loki-stack-promtail-fch2x   1/1     Running   0          2m7s   phong-dinh-node2     <none>           <none>
loki-stack-promtail-94rcp   1/1     Running   0          2m7s   phong-dinh-node3     <none>           <none>
loki-stack-0                1/1     Running   0          2m7s   phong-dinh-default   <none>           <none>

Integrating Loki With PMM Grafana

Next, we will add Loki as a data source of PMM Grafana, so we can use PMM Grafana to visualize the logs. You can do it from the GUI or with kubectl CLI. 

Below is the step to add data source from PMM GUI:

Navigate to


  then select

Add data source


Integrating Loki With PMM Grafana

Then Select Loki

Next, in the Settings, in the HTTP URL box, enter the DNS records of the



Loki settings

You can also use the below command to add the data source. Make sure to specify the name of PMM pod, in this command,


  is the PMM pod. 

$ kubectl -n default exec -it monitoring-0 -- bash -c "curl 'http://admin:verysecretpassword@' -X POST -H 'Content-Type: application/json;charset=UTF-8' --data-binary '{ \"orgId\": 1, \"name\": \"Loki\", \"type\": \"loki\", \"typeLogoUrl\": \"\", \"access\": \"proxy\", \"url\": \"http://loki-stack.loki-stack.svc.cluster.local:3100\", \"password\": \"\", \"user\": \"\", \"database\": \"\", \"basicAuth\": false, \"basicAuthUser\": \"\", \"basicAuthPassword\": \"\", \"withCredentials\": false, \"isDefault\": false, \"jsonData\": {}, \"secureJsonFields\": {}, \"version\": 1, \"readOnly\": false }'"

Exploring the Logs in PMM Grafana

Now, you can explore the pod logs in Grafana UI, navigate to Explore, and select Loki in the dropdown list as the source of metrics:

Exploring the Logs in PMM Grafana

In the Log Browser, you can select the appropriate labels to form your first LogQL query, for example, I select the following attributes


{app="percona-xtradb-cluster", component="pxc", container="logs",pod="cluster1-pxc-1"} 


Click Show logs  and you can see all the logs of



You can perform a simple filter with |= “message-content”. For example, filtering all the messages related to State transfer  by

{app="percona-xtradb-cluster", component="pxc", container="logs",pod="cluster1-pxc-1"} |= "State transfer"


Deploying Grafana Loki in the Kubernetes environment is feasible and straightforward. Grafana Loki can be integrated easily with Percona Monitoring and Management to provide both centralized logging and comprehensive monitoring when running Percona XtraDB Cluster and Percona Server for MongoDB in Kubernetes.

It would be interesting to know if you have any thoughts while reading this, please share your comments and thoughts.


Running Rocket.Chat with Percona Server for MongoDB on Kubernetes

Running Rocket.Chat with Percona Server for MongoDB on Kubernetes

Our goal is to have a Rocket.Chat deployment which uses highly available Percona Server for MongoDB cluster as the backend database and it all runs on Kubernetes. To get there, we will do the following:

  • Start a Google Kubernetes Engine (GKE) cluster across multiple availability zones. It can be any other Kubernetes flavor or service, but I rely on multi-AZ capability in this blog post.
  • Deploy Percona Operator for MongoDB and database cluster with it
  • Deploy Rocket.Chat with specific affinity rules
    • Rocket.Chat will be exposed via a load balancer

Rocket.Chat will be exposed via a load balancer

Percona Operator for MongoDB, compared to other solutions, is not only the most feature-rich but also comes with various management capabilities for your MongoDB clusters – backups, scaling (including sharding), zero-downtime upgrades, and many more. There are no hidden costs and it is truly open source.

This blog post is a walkthrough of running a production-grade deployment of Rocket.Chat with Percona Operator for MongoDB.


All YAML manifests that I use in this blog post can be found in this repository.

Deploy Kubernetes Cluster

The following command deploys GKE cluster named


in 3 availability zones:

gcloud container clusters create --zone us-central1-a --node-locations us-central1-a,us-central1-b,us-central1-c percona-rocket --cluster-version 1.21 --machine-type n1-standard-4 --preemptible --num-nodes=3

Read more about this in the documentation.

Deploy MongoDB

I’m going to use helm to deploy the Operator and the cluster.

Add helm repository:

helm repo add percona https://percona.github.io/percona-helm-charts/

Install the Operator into the percona namespace:

helm install psmdb-operator percona/psmdb-operator --create-namespace --namespace percona

Deploy the cluster of Percona Server for MongoDB nodes:

helm install my-db percona/psmdb-db -f psmdb-values.yaml -n percona

Replica set nodes are going to be distributed across availability zones. To get there, I altered the affinity keys in the corresponding sections of psmdb-values.yaml:

antiAffinityTopologyKey: "topology.kubernetes.io/zone"

Prepare MongoDB

For Rocket.Chat to connect to our database cluster, we need to create the users. By default, clusters provisioned with our Operator have


user, its password is set in




For production-grade systems, do not forget to change this password or create dedicated secrets to provision those. Read more about user management in our documentation.

Spin up a client Pod to connect to the database:

kubectl run -i --rm --tty percona-client1 --image=percona/percona-server-mongodb:4.4.10-11 --restart=Never -- bash -il

Connect to the database with



[mongodb@percona-client1 /]$ mongo "mongodb://userAdmin:userAdmin123456@my-db-psmdb-db-rs0-0.percona/admin?replicaSet=rs0"

We are going to create the following:

  • rocketchat


  • rocketChat

    user to store data and connect to the database

  • oplogger

    user to provide access to oplog for rocket chat

    • Rocket.Chat uses Meteor Oplog tailing to improve performance. It is optional.
use rocketchat
  user: "rocketChat",
  pwd: passwordPrompt(),
  roles: [
    { role: "readWrite", db: "rocketchat" }

use admin
  user: "oplogger",
  pwd: passwordPrompt(),
  roles: [
    { role: "read", db: "local" }

Deploy Rocket.Chat

I will use helm here to maintain the same approach. 

helm install -f rocket-values.yaml my-rocketchat rocketchat/rocketchat --version 3.0.0

You can find rocket-values.yaml in the same repository. Please make sure you set the correct passwords in the corresponding YAML fields.

As you can see, I also do the following:

  • Line 11: expose Rocket.Chat through

    service type

  • Line 13-14: set number of replicas of Rocket.Chat Pods. We want three – one per each availability zone.
  • Line 16-23: set affinity to distribute Pods across availability zones

Load Balancer will be created with a public IP address:

$ kubectl get service my-rocketchat-rocketchat
NAME                       TYPE           CLUSTER-IP    EXTERNAL-IP    PORT(S)        AGE
my-rocketchat-rocketchat   LoadBalancer   80:32548/TCP   12m

You should now be able to connect to

and enjoy your highly available Rocket.Chat installation.

Rocket.Chat installation

Clean Up

Uninstall all helm charts to remove MongoDB cluster, the Operator, and Rocket.Chat:

helm uninstall my-rocketchat
helm uninstall my-db -n percona
helm uninstall psmdb-operator -n percona

Things to Consider


Instead of exposing Rocket.Chat through a load balancer, you may also try ingress. By doing so, you can integrate it with cert-manager and have a valid TLS certificate for your chat server.


It is also possible to run a sharded MongoDB cluster with Percona Operator. If you do so, Rocket.Chat will connect to mongos Service, instead of the replica set nodes. But you will still need to connect to the replica set directly to get oplogs.


We encourage you to try out Percona Operator for MongoDB with Rocket.Chat and let us know on our community forum your results.

There is always room for improvement and a time to find a better way. Please let us know if you face any issues with contributing your ideas to Percona products. You can do that on the Community Forum or JIRA. Read more about contribution guidelines for Percona Operator for MongoDB in CONTRIBUTING.md.

Percona Operator for MongoDB contains everything you need to quickly and consistently deploy and scale Percona Server for MongoDB instances into a Kubernetes cluster on-premises or in the cloud. The Operator enables you to improve time to market with the ability to quickly deploy standardized and repeatable database environments. Deploy your database with a consistent and idempotent result no matter where they are used.


Percona Operator for MongoDB and Kubernetes MCS: The Story of One Improvement

Percona Operator for MongoDB and Kubernetes MCS

Percona Operator for MongoDB supports multi-cluster or cross-site replication deployments since version 1.10. This functionality is extremely useful if you want to have a disaster recovery deployment or perform a migration from or to a MongoDB cluster running in Kubernetes. In a nutshell, it allows you to use Operators deployed in different Kubernetes clusters to manage and expand replica sets.Percona Kubernetes Operator

For example, you have two Kubernetes clusters: one in Region A, another in Region B.

  • In Region A you deploy your MongoDB cluster with Percona Operator. 
  • In Region B you deploy unmanaged MongoDB nodes with another installation of Percona Operator.
  • You configure both Operators, so that nodes in Region B are added to the replica set in Region A.

In case of failure of Region A, you can switch your traffic to Region B.

Migrating MongoDB to Kubernetes describes the migration process using this functionality of the Operator.

This feature was released in tech preview, and we received lots of positive feedback from our users. But one of our customers raised an internal ticket, which was pointing out that cross-site replication functionality does not work with Multi-Cluster Services. This started the investigation and the creation of this ticket – K8SPSMDB-625

This blog post will go into the deep roots of this story and how it is solved in the latest release of Percona Operator for MongoDB version 1.12.

The Problem

Multi-Cluster Services or MCS allows you to expand network boundaries for the Kubernetes cluster and share Service objects across these boundaries. Someone calls it another take on Kubernetes Federation. This feature is already available on some managed Kubernetes offerings,  Google Cloud Kubernetes Engine (GKE) and AWS Elastic Kubernetes Service (EKS). Submariner uses the same logic and primitives under the hood.

MCS Basics

To understand the problem, we need to understand how multi-cluster services work. Let’s take a look at the picture below:

multi-cluster services

  • We have two Pods in different Kubernetes clusters
  • We add these two clusters into our MCS domain
  • Each Pod has a service and IP-address which is unique to the Kubernetes cluster
  • MCS introduces new Custom Resources –




    • Once you create a

      object in one cluster,


      object appears in all clusters in your MCS domain.

    • This

        object is in


      domain and with the network magic introduced by MCS can be accessed from any cluster in the MCS domain

Above means that if I have an application in the Kubernetes cluster in Region A, I can connect to the Pod in Kubernetes cluster in Region B through a domain name like


. And it works from another cluster as well.

MCS and Replica Set

Here is how cross-site replication works with Percona Operator if you use load balancer:

MCS and Replica Set

All replica set nodes have a dedicated service and a load balancer. A replica set in the MongoDB cluster is formed using these public IP addresses. External node added using public IP address as well:

  - name: rs0
    size: 3
    - host:

All nodes can reach each other, which is required to form a healthy replica set.

Here is how it looks when you have clusters connected through multi-cluster service:

Instead of load balancers replica set nodes are exposed through Cluster IPs. We have ServiceExports and ServiceImports resources. All looks good on the networking level, it should work, but it does not.

The problem is in the way the Operator builds MongoDB Replica Set in Region A. To register an external node from Region B to a replica set, we will use MCS domain name in the corresponding section:

  - name: rs0
    size: 3
    - host: rs0-4.mongo.svc.clusterset.local

Now our rs.status() will look like this:

"name" : "my-cluster-rs0-0.mongo.svc.cluster.local:27017"
"role" : "PRIMARY"
"name" : "my-cluster-rs0-1.mongo.svc.cluster.local:27017"
"role" : "SECONDARY"
"name" : "my-cluster-rs0-2.mongo.svc.cluster.local:27017"
"role" : "SECONDARY"
"name" : "rs0-4.mongo.svc.clusterset.local:27017"
"role" : "UNKNOWN"

As you can see, Operator formed a replica set out of three nodes using


domain, as it is how it should be done when you expose nodes with


Service type. In this case, a node in Region B cannot reach any node in Region A, as it tries to connect to the domain that is local to the cluster in Region A. 

In the picture below, you can easily see where the problem is:

The Solution

Luckily we have a Special Interest Group (SIG), a Kubernetes Enhancement Proposal (KEP) and multiple implementations for enabling Multi-Cluster Services. Having a KEP is great since we can be sure the implementations from different providers (i.e GCP, AWS) will follow the same standard more or less.

There are two fields in the Custom Resource that control MCS in the Operator:

    enabled: true
    DNSSuffix: svc.clusterset.local

Let’s see what is happening in the background with these flags set.

ServiceImport and ServiceExport Objects

Once you enable MCS by patching the CR with

spec.multiCluster.enabled: true

, the Operator creates a


object for each service. These ServiceExports will be detected by the MCS controller in the cluster and eventually a


for each


will be created in the same namespace in each cluster that has MCS enabled.

As you see, we made a decision and empowered the Operator to create


objects. There are two main reasons for doing that:

  • If any infrastructure-as-a-code tool is used, it would require additional logic and level of complexity to automate the creation of required MCS objects. If Operator takes care of it, no additional work is needed. 
  • Our Operators take care of the infrastructure for the database, including Service objects. It just felt logical to expand the reach of this functionality to MCS.

Replica Set and Transport Encryption

The root cause of the problem that we are trying to solve here lies in the networking field, where external replica set nodes try to connect to the wrong domain names. Now, when you enable multi-cluster and set


(it defaults to


), Operator does the following:

  • Replica set is formed using MCS domain set in


  • Operator generates TLS certificates as usual, but adds

    domains into the picture

With this approach, the traffic between nodes flows as expected and is encrypted by default.

Replica Set and Transport Encryption

Things to Consider


Please note that the operator won’t install MCS APIs and controllers to your Kubernetes cluster. You need to install them by following your provider’s instructions prior to enabling MCS for your PSMDB clusters. See our docs for links to different providers.

Operator detects if MCS is installed in the cluster by API resources. The detection happens before controllers are started in the operator. If you installed MCS APIs while the operator is running, you need to restart the operator. Otherwise, you’ll see an error like this:

  "level": "error",
  "ts": 1652083068.5910048,
  "logger": "controller.psmdb-controller",
  "msg": "Reconciler error",
  "name": "cluster1",
  "namespace": "psmdb",
  "error": "wrong psmdb options: MCS is not available on this cluster",
  "errorVerbose": "...",
  "stacktrace": "..."

ServiceImport Provisioning Time

It might take some time for


objects to be created in the Kubernetes cluster. You can see the following messages in the logs while creation is in progress: 

  "level": "info",
  "ts": 1652083323.483056,
  "logger": "controller_psmdb",
  "msg": "waiting for service import",
  "replset": "rs0",
  "serviceExport": "cluster1-rs0"

During testing, we saw wait times up to 10-15 minutes. If you see your cluster is stuck in initializing state by waiting for service imports, it’s a good idea to check the usage and quotas for your environment.


We also made a decision to automatically generate TLS certificates for Percona Server for MongoDB cluster with


domain, even if MCS is not enabled. This approach simplifies the process of enabling MCS for a running MongoDB cluster. It does not make much sense to change the


 field, unless you have hard requirements from your service provider, but we still allow such a change. 

If you want to enable MCS with a cluster deployed with an operator version below 1.12, you need to update your TLS certificates to include


SANs. See the docs for instructions.


Business relies on applications and infrastructure that serves them more than ever nowadays. Disaster Recovery protocols and various failsafe mechanisms are routine for reliability engineers, not an annoying task in the backlog. 

With multi-cluster deployment functionality in Percona Operator for MongoDB, we want to equip users to build highly available and secured database clusters with minimal effort.

Percona Operator for MongoDB is truly open source and provides users with a way to deploy and manage their enterprise-grade MongoDB clusters on Kubernetes. We encourage you to try this new Multi-Cluster Services integration and let us know your results on our community forum. You can find some scripts that would help you provision your first MCS clusters on GKE or EKS here.

There is always room for improvement and a time to find a better way. Please let us know if you face any issues with contributing your ideas to Percona products. You can do that on the Community Forum or JIRA. Read more about contribution guidelines for Percona Operator for MongoDB in CONTRIBUTING.md.


Zero Impact on Index Creation with Amazon Aurora 3

Zero Impact on Index Creation with Aurora 3

Zero Impact on Index Creation with Aurora 3In the last quarter of 2021, AWS released Aurora version 3. This new version aligns Aurora with the latest MySQL 8 version, porting many of the advantages MySQL 8 has over previous versions.

While this brings a lot of new interesting features for Aurora, what we are going to cover here is to see how DDLs behave when using the ONLINE option. With a quick comparison with what happens in MySQL 8 standard and with Group Replication.


All tests were run on an Aurora instance r6g.large with a secondary availability zone. The test was composed of:

        Four connections

    • #1 to perform DDL
    • #2 to perform insert data in the table I am altering
    • #3 to perform insert data on a different table 
    • #4 checking the other node operations

In the Aurora instance, a sysbench schema with 10 tables and five million rows was created, just to get a bit of traffic. While the test table with 5ml rows as well was:

CREATE TABLE `windmills_test` (
  `uuid` char(36) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `millid` smallint NOT NULL,
  `kwatts_s` int NOT NULL,
  `date` date NOT NULL,
  `location` varchar(50) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `active` tinyint NOT NULL DEFAULT '1',
  `strrecordtype` char(3) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`id`),
  KEY `IDX_millid` (`millid`,`active`),
  KEY `IDX_active` (`id`,`active`),
  KEY `kuuid_x` (`uuid`),
  KEY `millid_x` (`millid`),
  KEY `active_x` (`active`),
  KEY `idx_1` (`uuid`,`active`)

The executed commands:

Connection 1:
    ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=NONE;
    ALTER TABLE windmills_test drop INDEX idx_1, ALGORITHM=INPLACE;
Connection 2:
 while [ 1 = 1 ];do da=$(date +'%s.%3N');mysql --defaults-file=./my.cnf -D windmills_large -e "insert into windmills_test  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills4 limit 1;" -e "select count(*) from windmills_large.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done

Connection 3:
 while [ 1 = 1 ];do da=$(date +'%s.%3N');mysql --defaults-file=./my.cnf -D windmills_large -e "insert into windmills3  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills4 limit 1;" -e "select count(*) from windmills_large.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done

Connections 4:
     while [ 1 = 1 ];do echo "$(date +'%T.%3N')";mysql --defaults-file=./my.cnf -h <secondary aurora instance> -D windmills_large -e "show full processlist;"|egrep -i -e "(windmills_test|windmills_large)"|grep -i -v localhost;sleep 1;done

1) start inserts from connections
2) start commands in connections 4 – 5 on the other nodes
3) execute: DC1-1(root@localhost) [windmills_large]>ALTER TABLE windmills_test ADD INDEX idx_1 (uuid,active), ALGORITHM=INPLACE, LOCK=NONE;

With this, what I was looking to capture is the operation impact in doing a common action as creating an Index. My desired expectation is to have no impact when doing operations that are declared “ONLINE” such as creating an index, as well as data consistency between nodes.

Let us see what happened…


While running the insert in the same table, performing the alter:

mysql>  ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=NONE;
Query OK, 0 rows affected (16.51 sec)
Records: 0  Duplicates: 0  Warnings: 0

It is NOT stopping the operation in the same table or any other table in the Aurora instance.

We can only identify a minimal performance impact:

[root@ip-10-0-0-11 tmp]# while [ 1 = 1 ];do da=$(date +'%s.%3N');mysql --defaults-file=./my.cnf -D windmills_large -e "insert into windmills_test  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills4 limit 1;" -e "select count(*) from windmills_large.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done
.686  ? start
.512  ? end

The secondary node is not affected at all, and this is because Aurora managed at the storage level the data replication. There is no such thing as Apply from Relaylog, as we have in standard MySQL asynchronous or data replicated with Group Replication.  

The result is that in Aurora 3, we can have zero impact index (or any other ONLINE/INSTANT) operation, with this I include the data replicated in the other instances for High Availability. 

If we compare this with Group replication (see blog):

GR         Aurora 3
Time on hold for insert for altering table   	~0.217 sec   ~0 sec
Time on hold for insert for another table   	~0.211 sec   ~0 sec

However, keep in mind that MySQL with Group Replication will still need to apply the data on the Secondaries. This means that if your alter was taking 10 hours to build the index, the Secondary nodes will be misaligned with the Source for approximately another 10 hours. 

With Aurora 3 or with Percona XtraDB Cluster (PXC), changes will be there when Source has completed the operation.    

What about PXC? Well, we have a different scenario:

PXC(NBO)     Aurora 3
Time on hold for insert for altering table   	~120 sec      ~0 sec
Time on hold for insert for another table   	~25  sec      ~0 sec

We will have a higher impact while doing the Alter operation, but the data will be on all nodes at the same time maintaining a high level of consistency in the cluster. 


Aurora is not for all uses, and not for all budgets. However, it has some very good aspects like the one we have just seen. The difference between standard MySQL and Aurora is not in the time of holding/locking (aka operation impact) but in the HA aspects. If I have my data/structure on all my Secondary at the same time as the Source, I will feel much more comfortable than having to wait an additional T time.

This is why PXC in that case is a better alternative if you can afford the locking time. If not, well, Aurora 3 is your solution, just do your math properly and be conservative with the instance resources.


Some Things to Consider Before Moving Your Database to the Cloud

Moving Your Database to the Cloud

Moving Your Database to the CloudBefore you transition your database environment to the cloud, there are a few considerations you should consider first.  Some of the touted benefits of the cloud also carry with them some risks or negative impacts.  Let’s take a look at just a few of these.

First, consider whether or not you will face vendor lock-in.  Many people choose Open Source databases precisely to avoid this.  The interesting fact, however, is that you can actually get locked in without realizing it.  Many cloud providers have their own versions of popular database platforms such as MySQL, PostgreSQL, MongoDB, etc.  These versions may be utilizing heavily engineered versions of these database systems.

While it is often easy to migrate your data into the environments, applications are often modified to adapt to unique aspects of these database platforms.  In some cases, this may be utilizing specialized features that were added while other times it can even be dealing with a lack of features with extra development.  This can occur because often the cloud versions may be based upon older codebases which may not contain all of the latest feature sets of the DBMS.  This makes migrating off the cloud more challenging as it may require changes to the code to go back.  In this sense, you may pragmatically be locked into a particular cloud database without even realizing it.

Also, consider the additional cost of time and resources if you need to re-engineer your application to work with the cloud platform.  It may not be as simple as simply migrating over.  Instead, you will need to do extensive testing and perhaps rewrite code to make it all work as it should.

cloud save moneyA common reason many migrate to the cloud is to save cost.  Some cloud providers cite projected savings of 50% or more due to not needing so much infrastructure, staff, etc.  While this is certainly possible, it is also possible that your costs could rise.  With the ease of creating and configuring new servers, it is easy to spin up more instances very quickly.  Of course, each of these instances is increasing your costs.  Without proper oversight and someone managing the spend, that monthly bill could quickly cause some sticker shock!

Storage and networking are areas that can easily increase costs in addition to sheer server instance counts.  Although storage costs are relatively cheap nowadays, think about what happens when teams set up additional test servers and leave large backups and datasets lying around.  And, of course, you have to pay networking costs as these large data sets are transferred from server to server.  The inter-connected networking of servers that used to be essentially “free” in your on-prem Data Center, is now generating costs.  It can add up quickly.

Moreover, with the “cheap” storage costs, archiving becomes less of a concern.  This is a real double-edged sword as not only do your costs increase, but your performance decreases when data sets are not archived properly.  Databases typically lose performance querying these enormous data sets and the additional time to update indexes negatively impacts performance.

Also, consider the loss of control.  In your on-prem databases, you control the data entirely.  Security is completely your responsibility.  Without a doubt, cloud providers have built their systems around security controls and this can be a huge advantage.  The thing you have to consider is that you really don’t know who is managing the systems housing your data.  You lose insight into the human aspects and this cannot be discounted.

In addition, if you have components of your application that will be either in another cloud or will remain on-prem, you must think about the effects of network latency on your application.  No longer are the various components sitting in the same Data Center.  They may be geographically dispersed across the globe.  Again, this can be a benefit, but it also carries with it a cost in performance.

You also need to consider whether you will need to retrain your staff.  Certainly, today most people are familiar with the cloud but is almost certain you will have some holdouts who are not sold on the change in the way you manage your servers.  Without a doubt, the way they work day-to-day will change.  Done properly, this shift can prove beneficial by allowing your DBAs to focus more on database performance and less on setting up and configuring servers, managing backups, monitoring, etc.

Moving to the cloud is all the rage nowadays and this article is certainly not meant to dissuade you from doing so.  Instead, the idea is to consider some of the ramifications before making the move to determine if it is really in your best interest and help you avoid some of the pitfalls.


Expose Databases on Kubernetes with Ingress

Expose Databases on Kubernetes with Ingress

Ingress is a resource that is commonly used to expose HTTP(s) services outside of Kubernetes. To have ingress support, you will need an Ingress Controller, which in a nutshell is a proxy. SREs and DevOps love ingress as it provides developers with a self-service to expose their applications. Developers love it as it is simple to use, but at the same time quite flexible.

High-level ingress design looks like this: 

High-level ingress design

  1. Users connect through a single Load Balancer or other Kubernetes service
  2. Traffic is routed through Ingress Pod (or Pods for high availability)
    • There are multiple flavors of Ingress Controllers. Some use nginx, some envoy, or other proxies. See a curated list of Ingress Controllers here.
  3. Based on HTTP headers traffic is routed to corresponding Pods which run websites. For HTTPS traffic Server Name Indication (SNI) is used, which is an extension in TLS supported by most browsers. Usually, ingress controller integrates nicely with cert-manager, which provides you with full TLS lifecycle support (yeah, no need to worry about renewals anymore).

The beauty and value of such a design is that you have a single Load Balancer serving all your websites. In Public Clouds, it leads to cost savings, and in private clouds, it simplifies your networking or reduces the number of IPv4 addresses (if you are not on IPv6 yet). 

TCP and UDP with Ingress

Quite interestingly, some ingress controllers also support TCP and UDP proxying. I have been asked on our forum and Kubernetes slack multiple times if it is possible to use ingress with Percona Operators. Well, it is. Usually, you need a load balancer per database cluster: 

TCP and UDP with Ingress

The design with ingress is going to be a bit more complicated but still allows you to utilize the single load balancer for multiple databases. In cases where you run hundreds of clusters, it leads to significant cost savings. 

  1. Each TCP port represents the database cluster
  2. Ingress Controller makes a decision about where to route the traffic based on the port

The obvious downside of this design is non-default TCP ports for databases. There might be weird cases where it can turn into a blocker, but usually, it should not.


My goal is to have the following:

All configuration files I used for this blog post can be found in this Github repository.

Deploy Percona XtraDB Clusters (PXC)

The following commands are going to deploy the Operator and three clusters:

kubectl apply -f https://raw.githubusercontent.com/spron-in/blog-data/master/operators-and-ingress/bundle.yaml

kubectl apply -f https://raw.githubusercontent.com/spron-in/blog-data/master/operators-and-ingress/cr-minimal.yaml
kubectl apply -f https://raw.githubusercontent.com/spron-in/blog-data/master/operators-and-ingress/cr-minimal2.yaml
kubectl apply -f https://raw.githubusercontent.com/spron-in/blog-data/master/operators-and-ingress/cr-minimal3.yaml

Deploy Ingress

helm upgrade --install ingress-nginx ingress-nginx   --repo https://kubernetes.github.io/ingress-nginx   --namespace ingress-nginx --create-namespace  \
--set controller.replicaCount=2 \
--set tcp.3306="default/minimal-cluster-haproxy:3306"  \
--set tcp.3307="default/minimal-cluster2-haproxy:3306" \
--set tcp.3308="default/minimal-cluster3-haproxy:3306"

This is going to deploy a highly available ingress-nginx controller. 

  • controller.replicaCount=2

    – defines that we want to have at least two Pods of ingress controller. This is to provide a highly available setup.

  • tcp flags do two things:
    • expose ports 3306-3308 on the ingress’s load balancer
    • instructs ingress controller to forward traffic to corresponding services which were created by Percona Operator for PXC clusters. For example, port 3307 is the one to use to connect to minimal-cluster2. Read more about this configuration in ingress documentation.

Here is the load balancer resource that was created:

$ kubectl -n ingress-nginx get service
NAME                                 TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)                                                                   AGE
ingress-nginx-controller             LoadBalancer   80:30261/TCP,443:32112/TCP,3306:31583/TCP,3307:30786/TCP,3308:31827/TCP   4m13s

As you see, ports 3306-3308 are exposed.

Check the Connection

This is it. Database clusters should be exposed and reachable. Let’s check the connection. 

Get the root password for minimal-cluster2:

$ kubectl get secrets minimal-cluster2-secrets | grep root | awk '{print $2}' | base64 --decode && echo

Connect to the database:

$ mysql -u root -h --port 3307  -p
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 5722

It works! Notice how I use port 3307, which corresponds to minimal-cluster2.

Adding More Clusters

What if you add more clusters into the picture, how do you expose those?


If you use helm, the easiest way is to just add one more flag into the command:

helm upgrade --install ingress-nginx ingress-nginx   --repo https://kubernetes.github.io/ingress-nginx   --namespace ingress-nginx --create-namespace  \
--set controller.replicaCount=2 \
--set tcp.3306="default/minimal-cluster-haproxy:3306"  \
--set tcp.3307="default/minimal-cluster2-haproxy:3306" \
--set tcp.3308="default/minimal-cluster3-haproxy:3306" \
--set tcp.3309="default/minimal-cluster4-haproxy:3306"

No helm

Without Helm, it is a two-step process:

First, edit the


which configures TCP services exposure. By default it is called



kubectl -n ingress-nginx edit cm ingress-nginx-tcp
apiVersion: v1
  "3306": default/minimal-cluster-haproxy:3306
  "3307": default/minimal-cluster2-haproxy:3306
  "3308": default/minimal-cluster3-haproxy:3306
  "3309": default/minimal-cluster4-haproxy:3306

Change in


will trigger the reconfiguration of nginx in ingress pods. But as a second step, it is also necessary to expose this port on a load balancer. To do so – edit the corresponding service:

kubectl -n ingress-nginx edit services ingress-nginx-controller
  - name: 3309-tcp
    port: 3309
    protocol: TCP
    targetPort: 3309-tcp

The new cluster is exposed on port 3309 now.

Limitations and considerations

Ports per Load Balancer

Cloud providers usually limit the number of ports that you can expose through a single load balancer:

  • AWS has 50 listeners per NLB, GCP 100 ports per service.

If you hit the limit, just create another load balancer pointing to the ingress controller.


Cost-saving is a good thing, but with Kubernetes capabilities, users expect to avoid manual tasks, not add more. Integrating the change of ingress configMap and load balancer ports into CICD is not a hard task, but maintaining the logic of adding new load balancers to add more ports is harder. I have not found any projects that implement the logic of reusing load balancer ports or automating it in any way. If you know of any – please leave a comment under this blog post.


Using PostgreSQL Tablespaces with Percona Operator for PostgreSQL

PostgreSQL Tablespaces with Percona Operator for PostgreSQL

Tablespaces allow DBAs to store a database on multiple file systems within the same server and to control where (on which file systems) specific parts of the database are stored. You can think about it as if you were giving names to your disk mounts and then using those names as additional parameters when creating database objects.

PostgreSQL supports this feature, allowing you to store data outside of the primary data directory, and Percona Operator for PostgreSQL is a good option to bring this to your Kubernetes environment when needed.

When might it be needed?

The most obvious use case for tablespaces is performance optimization. You place appropriate parts of the database on fast but expensive storage and use slower but cheaper storage for lesser-used database objects. The classic example would be using an SSD for heavily-used indexes and using a large slow HDD for archive data.

Of course, Kubernetes already provides you with powerful approaches to achieve this on a per-Pod basis (with Node Selectors, Tolerations, etc.). But if you would like to go deeper and make such differentiation at the level of your database objects (tables and indexes), tablespaces are exactly what you would need for that.

Another well-known use case for tablespaces is quickly adding a new partition to the database cluster when you run out of space on the initially used one and cannot extend it (which may look less typical for cloud storage). Finally, you may need tablespaces when migrating your existing architecture to the cloud.

Each tablespace created by Percona Operator for PostgreSQL corresponds to a separate Persistent Volume, mounted in a container to the  /tablespaces directory.

PostgreSQL Tablespaces with Percona Operator for PostgreSQL

Providing a new tablespace for your database in Kubernetes involves two parts:

  1. Configure the new tablespace storage with the Operator,
  2. Create database objects in this tablespace with PostgreSQL.

The first part is done in the traditional way of Percona Operators, by modifying Custom Resource via the deploy/cr.yaml configuration file. It has a special spec.tablespaceStorages section with subsections names equal to PostgreSQL tablespace names.

The example already present in deploy/cr.yaml shows how to create tablespace storage named lake 1Gb in size with dynamic provisioning (you can see official Kubernetes documentation on Persistent Volumes for details):

        size: 1G
        accessmode: ReadWriteOnce
        storagetype: dynamic
        storageclass: ""
        matchLabels: ""

After you apply this by running the kubectl apply -f deploy/cr.yaml command, the new lake tablespace will appear within your database. Please take into account that if you add your new tablespace to the already existing PostgreSQL cluster, it may take time for the Operator to create Persistent Volume Claims and get Persistent Volumes actually mounted.

Now you can append TABLESPACE <tablespace_name> to your CREATE SQL statements to implicitly create tables, indexes, or even entire databases in specific tablespaces (of course, your user should have appropriate CREATE privileges to make it possible).

Let’s create an example table in the already mentioned lake tablespace:

CREATE TABLE products (
    product_sku character(10),
    quantity int,
    manufactured_date timestamptz)

It is also possible to set a default tablespace with the SET default_tablespace = <tablespace_name>; statement. It will affect all further CREATE TABLE and CREATE INDEX commands without an explicit tablespace specifier, until you unset it with an empty string.

As you can see, Percona Operator for PostgreSQL simplifies tablespace creation by carrying on all necessary modifications with Persistent Volumes and Pods. The same would not be true for the deletion of an already existing tablespace, which is not automated, neither by the Operator nor by PostgreSQL.

Deleting an existing tablespace from your database in Kubernetes also involves two parts:

  1. Delete related database objects and tablespace with PostgreSQL
  2. Delete tablespace storage in Kubernetes

To make tablespace deletion with PostgreSQL possible, you should make this tablespace empty (it is impossible to drop a tablespace until all objects in all databases using this tablespace have been removed). Tablespaces are listed in the pg_tablespace table, and you can use it to find out which objects are stored in a specific tablespace. The example command for the lake tablespace will look as follows:

SELECT relname FROM pg_class WHERE reltablespace=(SELECT oid FROM pg_tablespace WHERE spcname='lake');

When your tablespace is empty, you can log in to the PostgreSQL Primary instance as a superuser, and then execute the DROP TABLESPACE <tablespace_name>; command.

Now, when the PostgreSQL part is finished, you can remove the tablespace entry from the tablespaceStorages section (don’t forget to run the kubectl apply -f deploy/cr.yaml command to apply changes).

However, Persistent Volumes will still be mounted to the /tablespaces directory in PostgreSQL Pods. To remove these mounts, you should edit all Deployment objects for pgPrimary and pgReplica instances in your Kubernetes cluster and remove the Volume and VolumeMount entries related to your tablespace.

You can see the list of Deployment objects with the kubectl get deploy command. Running it for a default cluster named cluster1 results in the following output:

NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
cluster1                        1/1     1            1           156m
cluster1-backrest-shared-repo   1/1     1            1           156m
cluster1-pgbouncer              3/3     3            3           154m
cluster1-repl1                  1/1     1            1           154m
cluster1-repl2                  1/1     1            1           154m
postgres-operator               1/1     1            1           157m

Now run kubectl edit deploy <oblect_name> for cluster1, cluster1-repl1, and cluster1-repl2 objects consequently. Each command will open a text editor, where you should remove the appropriate lines, which the in case of the lake tablespace will look as follows:

      - name: database
          - name: tablespace-lake
            mountPath: /tablespaces/lake
      - name: tablespace-lake
          claimName: cluster1-tablespace-lake

Finishing the edit causes Pods to be recreated without tablespace mounts.

So, only scheduled scale-out makes sense.


Face to Face with Semi-Synchronous Replication

Semi-Synchronous Replication

Semi-Synchronous ReplicationLast month I performed a review of the Percona Operator for MySQL Server which is still Alpha.  That operator is based on Percona Server for MySQL and uses standard asynchronous replication, with the option to activate semi-synchronous replication to gain higher levels of data consistency between nodes. 

The whole solution is composed as:

semi-synchronous replication

Additionally, Orchestrator (https://github.com/openark/orchestrator) is used to manage the topology and the settings to enable on the replica nodes, the semi-synchronous flag if required. While we have not too much to say when using standard Asynchronous replication, I want to write a few words on the needs and expectations of the semi-synchronous (semi-sync) solution. 

A Look into Semi-Synchronous

Difference between Async and Semi-sync.



The above diagram represents the standard asynchronous replication. This method is expected by design, to have transactions committed on the Source that are not present on the Replicas. The Replica is supposed to catch up when possible.   

It is also important to understand that there are two steps in replication:

  • Data copy, which is normally very fast. The Data is copied from the binlog of the Source to the relay log on the Replica (IO_Thread).
  • Data apply, where the data is read from the relay log on the Replica node and written inside the database itself (SQL_Thread). This step is normally the bottleneck and while there are some parameters to tune, the efficiency to apply transactions depends on many factors including schema design. 

Production deployments that utilize the Asynchronous solution are typically designed to manage the possible inconsistent scenario given data on Source is not supposed to be on Replica at commit. At the same time, the level of High Availability assigned to this solution is lower than the one we normally obtain with (virtually-)synchronous replication, given we may need to wait for the Replicate to catch up on the gap accumulated in the relay-logs before performing the fail-over.



The above diagram represents the Semi-sync replication method. The introduction of semi-sync adds a checking step on the Source before it returns the acknowledgment to the client. This step happens at the moment of the data-copy, so when the data is copied from the Binary-log on Source to the Relay-log on Replica. 

This is important, there is NO mechanism to ensure a more resilient or efficient data replication, there is only an additional step, that tells the Source to wait a given amount of time for an answer from N replicas, and then return the acknowledgment or timeout and return to the client no matter what. 

This mechanism is introducing a possible significant delay in the service, without giving a 100% guarantee of data consistency. 

In terms of availability of the service, when in presence of a high load, this method may lead the Source to stop serving the request while waiting for acknowledgments, significantly reducing the availability of the service itself.   

At the same time, the only acceptable setting for rpl_semi_sync_source_wait_point is AFTER_SYNC (default) is because:  In the event of source failure, all transactions committed on the source have been replicated to the replica (saved to its relay log). An unexpected exit of the source server and failover to the replica is lossless because the replica is up to date.

All clear? No? Let me simplify the thing. 

  • In standard replication, you have two moments (I am simplifying)
    • Copy data from Source to Replica
    • Apply data in the Replica node
  • There is no certification on the data applied about its consistency with the Source
  • With asynchronous the Source task is to write data in the binlog and forget
  • With semi-sync, the Source writes the data on binlog and waits T seconds to receive an acknowledgment from N servers about them having received the data.

To enable semi-sync you follow these steps:  https://dev.mysql.com/doc/refman/8.0/en/replication-semisync-installation.html 

In short:

  • Register the plugins
  • Enable Source rpl_semi_sync_source_enabled=1
  • Enable Replica rpl_semi_sync_replica_enabled = 1
    • If replication is already running STOP/START REPLICA IO_THREAD

And here starts the fun, be ready for many “wait whaaat?”

What are the T and N I have just mentioned above?

Well, the T is a timeout that you can set to avoid having the source wait forever for the Replica acknowledgment. The default is 10 seconds. What happens if the Source waits for more than the timeout? 
rpl_semi_sync_source_timeout controls how long the source waits on a commit for acknowledgment from a replica before timing out and reverting to asynchronous replication.

Careful of the wording here! The manual says SOURCE, so it is not that MySQL revert to asynchronous, by transaction or connection, it is for the whole server.

Now analyzing the work-log (see https://dev.mysql.com/worklog/task/?id=1720 and more in the references) the Source should revert to semi-synchronous as soon as all involved replicas are aligned again. 

However, checking the code (see  https://github.com/mysql/mysql-server/blob/beb865a960b9a8a16cf999c323e46c5b0c67f21f/plugin/semisync/semisync_source.cc#L844 and following), we can see that we do not have a 100% guarantee that the Source will be able to switch back. 

Also in the code:
But, it is not that easy to detect that the replica has caught up.  This is caused by the fact that MySQL’s replication protocol is asynchronous, meaning that if the source does not use the semi-sync protocol, the replica would not send anything to the source.

In all the run tests, the Source was not able to switch back. In short, Source was moving out from semi-sync and that was forever, no rollback. Keep in mind that while we go ahead.

What is the N I mentioned above? It represents the number of Replicas that must provide the acknowledgment back. 

If you have a cluster of 10 nodes you may need to have only two of them involved in the semi-sync, no need to include them all. But if you have a cluster of three nodes where one is the Source, relying on one Replica only, it is not really secure. What I mean here is that if you choose to be semi-synchronous to ensure the data replicates, having it enabled for one single node is not enough, if that node crashes or whatever, you are doomed, as such you need at least two nodes with semi-sync.

Anyhow, the point is that if one of the Replica takes more than T to reply, the whole mechanism stops working, probably forever. 

As we have seen above, to enable semi-sync on Source we manipulate the value of the GLOBAL variable rpl_semi_sync_source_enabled.

However, if I check the value of rpl_semi_sync_source_enabled when the Source shift to simple Asynchronous replication because of timeout:

select @@rpl_semi_sync_source_enabled;

select @@rpl_semi_sync_source_enabled;
| @@rpl_semi_sync_source_enabled |
|                              1 |

As you can see the Global variable reports a value of 1, meaning that semi-sync is active also if not.

In the documentation, it is reported that to monitor the semi-sync activity we should check for Rpl_semi_sync_source_status. Which means that you can have Rpl_semi_sync_source_status = 0 and rpl_semi_sync_source_enabled =1 at the same time.

Is this a bug? Well according to documentation:

When the source switches between asynchronous or semisynchronous replication due to commit-blocking timeout or a replica catching up, it sets the value of the Rpl_semi_sync_source_status or Rpl_semi_sync_master_status status variable appropriately. Automatic fallback from semisynchronous to asynchronous replication on the source means that it is possible for the rpl_semi_sync_source_enabled or rpl_semi_sync_master_enabled system variable to have a value of 1 on the source side even when semisynchronous replication is in fact not operational at the moment. You can monitor the Rpl_semi_sync_source_status or Rpl_semi_sync_master_status status variable to determine whether the source currently is using asynchronous or semisynchronous replication.

It is not a bug. However, because you documented it, it doesn’t change the fact this is a weird/unfriendly/counterintuitive way of doing, which opens the door to many, many possible issues. Especially given you know the Source may fail to switch semi-synch back.  

Just to close this part, we can summarize as follows:

  • You activate semi-sync setting a global variable
  • Server/Source can disable it (silently) without changing that variable 
  • The server will never restore semi-sync automatically
  • The way to check if semi-sync works is to use the Status variable
  • When Rpl_semi_sync_source_status = 0 and rpl_semi_sync_source_enabled =1 you had a Timeout and Source is now working in asynchronous replication
  • The way to reactivate semi-sync is to set rpl_semi_sync_source_enabled  to OFF first then rpl_semi_sync_source_enabled = ON. 
  • Replicas can be set with semi-sync ON/OFF but unless you do not STOP/START the Replica_IO_THREAD the state of the variable can be inconsistent with the state of the Server.

What can go wrong?

Semi-Synchronous is Not Seriously Affecting the Performance

Others had already discussed semi-sync performance in better detail. However, I want to add some color given the recent experience with our operator testing.

In the next graphs, I will show you the behavior of writes/reads using Asynchronous replication and the same load with Semi-synchronous.

For the record, the test was a simple Sysbench-tpcc test using 20 tables, 20 warehouses, and 256 threads for 600 seconds.   

no semi-sync no semi-sync details write

The one above indicates a nice and consistent set of load in r/w with minimal fluctuations. This is what we like to have. 

The graphs below, represent the exact same load on the exact same environment but with semi-sync activated and no timeout. 

read/sec semi-sync 4 cpu semi-sync

Aside from the performance loss (we went from Transaction 10k/s to 3k/s), the constant stop/go imposed by the semi-sync mechanism has a very bad effect on the application behavior when you have many concurrent threads and high loads. I challenge any serious production system to work in this way.   

Of course, the results are in line with this yoyo game:

MySQL Operator async replication

In the best case, when all was working as expected, and no crazy stuff happening, I had something around the 60% loss. I am not oriented to see this as a minor performance drop. 

But at Least Your Data is Safe

As already stated at the beginning, the scope of semi-synchronous replication is to guarantee that the data in server A reaches server B before returning the OK to the application. 

In short, given a period of one second, we should have minimal transactions in flight and limited transactions in the apply queue. While for standard replication (asynchronous), we may have … thousands. 

In the graphs below we can see two lines:

  • The yellow line represents the number of GTIDs “in-flight” from Source to destination, Y2 axes.  In case of a Source crash, those transactions are lost and we will have data loss.
  • The blue line represents the number of GTIDs already copied over from Source to Replica but not applied in the database Y1 axes. In case of a Source crash, we must wait for the Replica to process these entries, before making the node Write active, or we will have data inconsistency.

Asynchronous replication:

Asynchronous replication

As expected we can see a huge queue in applying the transactions from relay-log, and some spike of transactions in flight. 

Using semi-synchronous replication:


Yes, apparently we have reduced the queue and no spikes so no data loss.

But this happens when all goes as expected, and we know in production this is not the norm. What if we need to enforce the semi-sync but at the same time we cannot set the Timeout to ridiculous values like one week? 

Simple, we need to have a check that puts back the semi-sync as soon as it is silently disabled (for any reason).
However doing this without waiting for the Replicas to cover the replication gap, cause the following interesting effects:

tpcc transaction distance

Thousands of transactions queued and shipped with the result of having a significant increase in the possible data loss and still a huge number of data to apply from the relay-log. 

So the only possible alternative is to set the Timeout to a crazy value, however, this can cause a full production stop in the case a Replica hangs or for any reason, it disables the semi-sync locally. 


First of all, I want to say that the tests on our operator using Asynchronous replication show a consistent behavior with the standard deployments in the cloud or premises.  It has the same benefits, like better performance, and the same possible issues as a longer time to failover when it needs to wait for a Replica to apply the relay-log queue. 

The semi-synchronous flag in the operator is disabled, and the tests I have done bring me to say “keep it like that!”. At least unless you know very well what you are doing and are able to deal with a semi-sync timeout of days.

I was happy to have the chance to perform these tests because they give me a way/time/need to investigate more on the semi-synchronous feature. Personally, I was not convinced about the semi-synchronous replication when it came out, and I am not now. I never saw a less consistent and less trustable feature in MySQL as semi-sync.  

If you need to have a higher level of synchronicity in your database, just go for Group Replication, or Percona XtraDB Cluster, and stay away from semi-sync. 

Otherwise, stay on Asynchronous replication, which is not perfect but it is predictable.  















Run PostgreSQL on Kubernetes with Percona Operator & Pulumi

Run PostgreSQL on Kubernetes with Percona Operator and Pulumi

Avoid vendor lock-in, provide a private Database-as-a-Service for internal teams, quickly deploy-test-destroy databases with CI/CD pipeline – these are some of the most common use cases for running databases on Kubernetes with operators. Percona Distribution for PostgreSQL Operator enables users to do exactly that and more.

Pulumi is an infrastructure-as-a-code tool, which enables developers to write code in their favorite language (Python, Golang, JavaScript, etc.) to deploy infrastructure and applications easily to public clouds and platforms such as Kubernetes.

This blog post is a step-by-step guide on how to deploy a highly-available PostgreSQL cluster on Kubernetes with our Percona Operator and Pulumi.

Desired State

We are going to provision the following resources with Pulumi:

  • Google Kubernetes Engine cluster with three nodes. It can be any Kubernetes flavor.
  • Percona Operator for PostgreSQL
  • Highly available PostgreSQL cluster with one primary and two hot standby nodes
  • Highly available pgBouncer deployment with the Load Balancer in front of it
  • pgBackRest for local backups

Pulumi code can be found in this git repository.


I will use the Ubuntu box to run Pulumi, but almost the same steps would work on macOS.

Pre-install Packages

gcloud and kubectl

echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -
sudo apt-get update
sudo apt-get install -y google-cloud-sdk docker.io kubectl jq unzip


Pulumi allows developers to use the language of their choice to describe infrastructure and applications. I’m going to use python. We will also pip (python package-management system) and venv (virtual environment module).

sudo apt-get install python3 python3-pip python3-venv


Install Pulumi:

curl -sSL https://get.pulumi.com | sh

On macOS, this can be installed view Homebrew with

brew install pulumi


You will need to add .pulumi/bin to the $PATH:

export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/percona/.pulumi/bin



You will need to provide access to Google Cloud to provision Google Kubernetes Engine.

gcloud config set project your-project
gcloud auth application-default login
gcloud auth login


Generate Pulumi token at app.pulumi.com. You will need it later to init Pulumi stack:


This repo has the following files:

  • Pulumi.yaml

    – identifies that it is a folder with Pulumi project

  • __main__.py

    – python code used by Pulumi to provision everything we need

  • requirements.txt

    – to install required python packages

Clone the repo and go to the



git clone https://github.com/spron-in/blog-data
cd blog-data/pg-k8s-pulumi

Init the stack with:

pulumi stack init pg

You will need the key here generated before on app.pulumi.com.


Python code that Pulumi is going to process is in __main__.py file. 

Lines 1-6: importing python packages

Lines 8-31: configuration parameters for this Pulumi stack. It consists of two parts:

  • Kubernetes cluster configuration. For example, the number of nodes.
  • Operator and PostgreSQL cluster configuration – namespace to be deployed to, service type to expose pgBouncer, etc.

Lines 33-80: deploy GKE cluster and export its configuration

Lines 82-88: create the namespace for Operator and PostgreSQL cluster

Lines 91-426: deploy the Operator. In reality, it just mirrors the operator.yaml from our Operator.

Lines 429-444: create the secret object that allows you to set the password for pguser to connect to the database

Lines 445-557: deploy PostgreSQL cluster. It is a JSON version of cr.yaml from our Operator repository

Line 560: exports Kubernetes configuration so that it can be reused later 


At first, we will set the configuration for this stack. Execute the following commands:

pulumi config set gcp:project YOUR_PROJECT
pulumi config set gcp:zone us-central1-a
pulumi config set node_count 3
pulumi config set master_version 1.21

pulumi config set namespace percona-pg
pulumi config set pg_cluster_name pulumi-pg
pulumi config set service_type LoadBalancer
pulumi config set pg_user_password mySuperPass

These commands set the following:

  • GCP project where GKE is going to be deployed
  • GCP zone 
  • Number of nodes in a GKE cluster
  • Kubernetes version
  • Namespace to run PostgreSQL cluster
  • The name of the cluster
  • Expose pgBouncer with LoadBalancer object

Deploy with the following command:

$ pulumi up
Previewing update (pg)

View Live: https://app.pulumi.com/spron-in/percona-pg-k8s/pg/previews/d335d117-b2ce-463b-867d-ad34cf456cb3

     Type                                                           Name                                Plan       Info
 +   pulumi:pulumi:Stack                                            percona-pg-k8s-pg                   create     1 message
 +   ?? random:index:RandomPassword                                 pguser_password                     create
 +   ?? random:index:RandomPassword                                 password                            create
 +   ?? gcp:container:Cluster                                       gke-cluster                         create
 +   ?? pulumi:providers:kubernetes                                 gke_k8s                             create
 +   ?? kubernetes:core/v1:ServiceAccount                           pgoPgo_deployer_saServiceAccount    create
 +   ?? kubernetes:core/v1:Namespace                                pgNamespace                         create
 +   ?? kubernetes:batch/v1:Job                                     pgoPgo_deployJob                    create
 +   ?? kubernetes:core/v1:ConfigMap                                pgoPgo_deployer_cmConfigMap         create
 +   ?? kubernetes:core/v1:Secret                                   percona_pguser_secretSecret         create
 +   ?? kubernetes:rbac.authorization.k8s.io/v1:ClusterRoleBinding  pgo_deployer_crbClusterRoleBinding  create
 +   ?? kubernetes:rbac.authorization.k8s.io/v1:ClusterRole         pgo_deployer_crClusterRole          create
 +   ?? kubernetes:pg.percona.com/v1:PerconaPGCluster               my_cluster_name                     create

  pulumi:pulumi:Stack (percona-pg-k8s-pg):
    E0225 14:19:49.739366105   53802 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies

Do you want to perform this update? yes

Updating (pg)
View Live: https://app.pulumi.com/spron-in/percona-pg-k8s/pg/updates/5
     Type                                                           Name                                Status      Info
 +   pulumi:pulumi:Stack                                            percona-pg-k8s-pg                   created     1 message
 +   ?? random:index:RandomPassword                                 pguser_password                     created
 +   ?? random:index:RandomPassword                                 password                            created
 +   ?? gcp:container:Cluster                                       gke-cluster                         created
 +   ?? pulumi:providers:kubernetes                                 gke_k8s                             created
 +   ?? kubernetes:core/v1:ServiceAccount                           pgoPgo_deployer_saServiceAccount    created
 +   ?? kubernetes:core/v1:Namespace                                pgNamespace                         created
 +   ?? kubernetes:core/v1:ConfigMap                                pgoPgo_deployer_cmConfigMap         created
 +   ?? kubernetes:batch/v1:Job                                     pgoPgo_deployJob                    created
 +   ?? kubernetes:core/v1:Secret                                   percona_pguser_secretSecret         created
 +   ?? kubernetes:rbac.authorization.k8s.io/v1:ClusterRole         pgo_deployer_crClusterRole          created
 +   ?? kubernetes:rbac.authorization.k8s.io/v1:ClusterRoleBinding  pgo_deployer_crbClusterRoleBinding  created
 +   ?? kubernetes:pg.percona.com/v1:PerconaPGCluster               my_cluster_name                     created

  pulumi:pulumi:Stack (percona-pg-k8s-pg):
    E0225 14:20:00.211695433   53839 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies

    kubeconfig: "[secret]"

    + 13 created

Duration: 5m30s


Get kubeconfig first:

pulumi stack output kubeconfig --show-secrets > ~/.kube/config

Check if Pods of your PG cluster are up and running:

$ kubectl -n percona-pg get pods
NAME                                             READY   STATUS      RESTARTS   AGE
backrest-backup-pulumi-pg-dbgsp                  0/1     Completed   0          64s
pgo-deploy-8h86n                                 0/1     Completed   0          4m9s
postgres-operator-5966f884d4-zknbx               4/4     Running     1          3m27s
pulumi-pg-787fdbd8d9-d4nvv                       1/1     Running     0          2m12s
pulumi-pg-backrest-shared-repo-f58bc7657-2swvn   1/1     Running     0          2m38s
pulumi-pg-pgbouncer-6b6dc4564b-bh56z             1/1     Running     0          81s
pulumi-pg-pgbouncer-6b6dc4564b-vpppx             1/1     Running     0          81s
pulumi-pg-pgbouncer-6b6dc4564b-zkdwj             1/1     Running     0          81s
pulumi-pg-repl1-58d578cf49-czm54                 0/1     Running     0          46s
pulumi-pg-repl2-7888fbfd47-h98f4                 0/1     Running     0          46s
pulumi-pg-repl3-cdd958bd9-tf87k                  1/1     Running     0          46s

Get the IP-address of pgBouncer LoadBalancer:

$ kubectl -n percona-pg get services
NAME                             TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)                      AGE
pulumi-pg-pgbouncer              LoadBalancer   5432:32042/TCP               3m17s

You can connect to your PostgreSQL cluster through this IP-address. Use pguser password that was set earlier with

pulumi config set pg_user_password


psql -h -p 5432 -U pguser pgdb

Clean up

To delete everything it is enough to run the following commands:

pulumi destroy
pulumi stack rm

Tricks and Quirks

Pulumi Converter

kube2pulumi is a huge help if you already have YAML manifests. You don’t need to rewrite the whole code, but just convert YAMLs to Pulumi code. This is what I did for operator.yaml.


There are two ways for Custom Resource management in Pulumi:

crd2pulumi generates libraries/classes out of Custom Resource Definitions and allows you to create custom resources later using these. I found it a bit complicated and it also lacks documentation.

apiextensions.CustomResource on the other hand allows you to create Custom Resources by specifying them as JSON. It is much easier and requires less manipulation. See lines 446-557 in my __main__.py.

True/False in JSON

I have the following in my Custom Resource definition in Pulumi code:

perconapg = kubernetes.apiextensions.CustomResource(
    spec= {
    "disableAutofail": False,
    "tlsOnly": False,
    "standby": False,
    "pause": False,
    "keepData": True,

Be sure that you use boolean of the language of your choice and not the “true”/”false” strings. For me using the strings turned into a failure as the Operator was expecting boolean, not the strings.

Depends On…

Pulumi makes its own decisions on the ordering of provisioning resources. You can enforce the order by specifying dependencies

For example, I’m ensuring that Operator and Secret are created before the Custom Resource:


Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com