May
28
2026
--

Percona Operator for PostgreSQL 3.0.0: Hard Fork, OLM Scoping, Major Upgrades


The Percona Operator for PostgreSQL 3.0.0 is here. This is the release that completes the hard fork of the operator from the Crunchy Data PostgreSQL Operator into a fully independent project, with a dedicated upstream.pgv2.percona.com API group for the inherited CRDs, an automatic CRD-rename rollout for existing 2.x installs on upgrade, and a public roadmap that drives what comes next.

This release ships three headline changes that matter for production teams. The CRD renaming under a Percona-owned API group, which finally lets the Crunchy operator and the Percona operator coexist in the same Kubernetes cluster. Proper OLM namespace scoping for OpenShift installations. And the move to the official Percona Distribution image for major PostgreSQL version upgrades, aligning the upgrade path with the same binaries that run in your clusters.

 

All three land in service of the same goal: making 3.0.0 a clean, durable operational baseline for the operator’s next several years as an independent project. Future releases will be shaped by what the community asks for and contributes back. The public roadmap is the durable signal of that commitment.

In this post, you will learn about:

  • The hard fork and how the CRD rename unlocks coexistence with the Crunchy operator
  • OLM namespace-scoping improvements for OpenShift installations
  • The move to the official Percona Distribution image for major PostgreSQL version upgrades
  • Other improvements and the 2.7.0 deprecation
  • Supported PostgreSQL versions and platforms

 

Hard fork: CRDs renamed under upstream.pgv2.percona.com

The Percona Operator for PostgreSQL has, until now, been a soft fork. Custom Resources inherited from Crunchy PGO used the upstream postgres-operator.crunchydata.com API group. The two operators shared CRDs, which meant you could only run one of them in a given Kubernetes cluster. Installing both would lead to overlapping CRDs, conflicting webhooks, and finalizer collisions, so platform teams had to pick a side before they had finished evaluating.

Starting with 3.0.0, every inherited CRD is renamed into a new dedicated upstream.pgv2.percona.com API group (K8SPG-1007). Percona’s own native CRDs (such as PerconaPGCluster under pgv2.percona.com/v2) are unchanged. The change applies to the inherited resources: PostgresCluster, PGUpgrade, PGAdmin, and the rest.

 

Coexistence: running both operators in the same cluster

The practical effect is that the Crunchy Data PostgreSQL Operator and the Percona Operator for PostgreSQL can now run on the same Kubernetes cluster at the same time, even in the same namespaces, with no CRD or webhook conflict. That unlocks a few real workflows: evaluating both operators on the same staging cluster without spinning up a second cluster, running existing Crunchy-managed clusters in some namespaces while bringing up new Percona-managed clusters in others, or testing a new database version on the Percona side while production stays on Crunchy until you are confident. The choice between the two operators stops being all-or-nothing.

 

Upgrade behavior for existing 2.x installs

For an existing install, the upgrade to 3.0.0 is mechanically simple. The operator creates the new-API-group CRDs alongside the legacy ones, then runs a one-time migration that updates dependent objects (Secrets, certificates, finalizer references) to point at the new CRD instances. Existing custom resources keep working through the legacy CRDs during the transition, and once migration completes, all reconciliation moves to the new group.

Old PostgresCluster reference:

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: cluster1


New (after upgrade to 3.0.0):

apiVersion: upstream.pgv2.percona.com/v1beta1
kind: PostgresCluster
metadata:
  name: cluster1

 

Day-to-day, your PerconaPGCluster Custom Resource (the one most teams interact with directly) is unchanged. The rename mostly matters in three situations: when a kubectl filter or a GitOps repository hard-codes the old API group, when a CI pipeline references the legacy CRD by name, and when you run the Percona and Crunchy operators side by side and need them not to collide.

Note: During the CRD migration on upgrade, the release notes report brief disruptions to pgBackRest operations (typically 1 to 2 minutes) while Kubernetes propagates certificate changes. Plan the upgrade during a maintenance window if backup continuity is critical, or pause scheduled backups during the upgrade.

Full details on the API-group change are in the Percona PostgreSQL operator documentation.

 

Improved OLM namespace scoping for OpenShift

OpenShift users install operators through the OpenShift Lifecycle Manager (OLM), and OLM enforces an OperatorGroup to scope which namespaces an operator watches. In practice, 2.x had quirks: teams that selected “Single namespace” mode would sometimes see the operator reconciling CRs in other namespaces, and teams in “All namespaces” mode would sometimes see incomplete coverage when CRs were created in newly-added namespaces.

3.0.0 fixes this by aligning the operator’s namespace watch list with the OperatorGroup that OLM applies. All-namespaces installs watch all namespaces. Single-namespace installs respect the targetNamespaces set on the OperatorGroup.

 

Why it matters in shared infrastructure

For an OpenShift platform team running shared infrastructure, this distinction matters operationally. A typical setup has the database operator installed once in a platform namespace (such as openshift-operators) but expected to serve PerconaPGCluster resources owned by individual application teams in their own namespaces. If the operator over-reaches into namespaces it should not watch, RBAC noise multiplies. If it under-reaches, application teams file tickets about clusters that never reconcile. The 3.0.0 alignment with OperatorGroup semantics removes both failure modes.

 

OperatorGroup wiring

For users installing through OLM via the OpenShift web console, the install flow is unchanged. The fix is in how the operator’s reconciler interprets the OLM-supplied namespace scope after install. For users who manage OperatorGroups directly, a single-namespace install looks like this:

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: percona-pg-operator-group
  namespace: postgres-prod
spec:
  targetNamespaces:
    - postgres-prod

And an all-namespaces install:

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: percona-pg-operator-group
  namespace: openshift-operators
spec: {}

The empty spec: {} (or an OperatorGroup with no targetNamespaces) means “watch all namespaces” by OLM convention. The 3.0.0 operator now honors that.

 

Note: After you upgrade an existing 2.x install to 3.0.0, the operator may begin reconciling PerconaPGCluster resources in namespaces it had previously ignored due to the prior scoping bug. Audit existing CRs across your cluster before upgrading, especially if you have stale test clusters in unintended namespaces. The release notes call this out explicitly.

Note for community vs certified bundle users: Community OLM bundles did not support cluster-wide (all-namespaces) mode in earlier versions, 3.0.0 adds it. Certified bundles already supported cluster-wide mode, but they used a separate stable-cw channel for it with 3.0.0 the channels are unified, so users upgrading from a certified stable-cw install need to switch their subscription channel to stable to receive the upgrade.

For the full install workflow on OpenShift, see the OpenShift installation documentation.


Major PostgreSQL version upgrades now use the official Percona Distribution image


Major-version upgrades (for example, PostgreSQL 17 to 18) require running pg_upgrade, which needs binaries for both the source and target versions in the same environment. The operator has supported major-version upgrades since 2.x, but it shipped its own dedicated upgrade image to do so. That worked, but it meant a Percona-specific image lived in the upgrade path, separate from the same Percona Distribution for PostgreSQL build that runs in your clusters.

 

Switching to the official Percona Distribution image

In 3.0.0, the operator switches to using the official Percona Distribution for PostgreSQL image for major-version upgrades: percona/percona-distribution-postgresql-upgrade (current tag: 18.4-17.10-16.14-15.18-14.23-1, which encodes the bundled major versions). The benefit is alignment: the binaries that run pg_upgrade are the same binaries that ship in the corresponding percona-distribution-postgresql image you already run in production, built from the same source, signed the same way, and patched on the same schedule. The operator orchestrates the upgrade through the PerconaPGUpgrade Custom Resource that names the source and target versions, the upgrade image, and the target component images (PostgreSQL, pgBouncer, pgBackRest).

 

Running an upgrade through the PerconaPGUpgrade CR

A PostgreSQL 17 to 18 upgrade looks like this:

apiVersion: pgv2.percona.com/v2
kind: PerconaPGUpgrade
metadata:
  name: cluster1-17-to-18
spec:
  postgresClusterName: cluster1
  image: docker.io/percona/percona-distribution-postgresql-upgrade:18.4-17.10-16.14-15.18-14.23-1
  fromPostgresVersion: 17
  toPostgresVersion: 18
  toPostgresImage: docker.io/percona/percona-distribution-postgresql:18.4-1
  toPgBouncerImage: docker.io/percona/percona-pgbouncer:1.25.2-1
  toPgBackRestImage: docker.io/percona/percona-pgbackrest:2.58.0-2

Apply it with kubectl apply -f upgrade.yaml -n <namespace>. The operator reconciles the upgrade as a controlled, observable process: it brings the cluster down for the upgrade window, runs pg_upgrade from the bundled image, brings the cluster back up on the target version, and updates pgBouncer and pgBackRest images in the same step.

Operationally, this matters for teams running on PostgreSQL’s annual major-version cadence. Every September brings a new major release; staying on a supported version means executing one major upgrade per cluster per year. Pulling the upgrade image from the same percona-distribution-postgresql registry path as the runtime image means image-signature verification, mirror-to-private-registry rules, and CVE-scanning policies you already have in place apply to the upgrade flow without any per-image exception.

Note: The pgaudit extension is not upgraded automatically. After the operator completes the major version upgrade, drop and recreate pgaudit manually in each database that uses it: DROP EXTENSION pgaudit; followed by CREATE EXTENSION pgaudit;. The release notes call this out as a required step (K8SPG-1022). Also worth scanning for collation-dependent indexes after the upgrade and refreshing collation metadata with ALTER DATABASE <name> REFRESH COLLATION VERSION; per the upstream PostgreSQL 18 release notes.

Full procedure, prerequisites, and rollback notes are in the major version upgrade documentation.

Other Improvements

Operational polish landed alongside the headline changes:

  • Go 1.26 update (K8SPG-1019): the operator binary is now built with Go 1.26, picking up performance optimizations, tooling improvements, and the security fixes that landed in the Go runtime since the previous release.
  • pgaudit upgrade documentation (K8SPG-1022): the major-version upgrade docs now include an explicit pgaudit drop-and-recreate procedure, surfacing the gotcha that previously caught users mid-upgrade.

The release also defaults the cluster-upgrade documentation to PostgreSQL 18 across all examples and tutorials.

 

Supported software and platforms

The Percona Operator for PostgreSQL 3.0.0 is developed and tested on:

  • PostgreSQL: 14.23-1, 15.18-1, 16.14-1, 17.10-1, 18.4-1 
  • pgBackRest: 2.58.0-2
  • pgBouncer: 1.25.2-1
  • Patroni: 4.1.3
  • PostGIS: 3.5.6
  • PMM Client: 2.44.1-1 and 3.7.1

 

Supported Kubernetes platforms:

  • Google Kubernetes Engine (GKE) 1.33 to 1.35
  • Amazon Elastic Kubernetes Service (EKS) 1.33 to 1.35
  • OpenShift 4.18 to 4.21
  • Azure Kubernetes Service (AKS) 1.33 to 1.35
  • Minikube 1.38.1 (Kubernetes v1.35.1) for local development

 

Deprecation: 2.7.0 support dropped

Support for Custom Resource Definitions from operator version 2.7.0 has been removed. If you are still on 2.7.0, upgrade to 2.8.x or 2.9.x first, then upgrade to 3.0.0. The CRD migration described above only handles 2.8.x and 2.9.x to 3.0.0 transitions cleanly.

 

Conclusion

3.0.0 is the release where the Percona Operator for PostgreSQL becomes a fully independent project. The CRD rename removes the last upstream coupling that mattered operationally. The OLM scoping fix removes a long-standing OpenShift quirk. The official major-version upgrade image removes one of the more painful operational gaps in earlier versions.

Beyond the technical work, 3.0.0 is also where Percona’s commitment to community-driven development moves from intent to mechanism. The public roadmap is open. The issue tracker is open. The images are freely redistributable. Future releases will be shaped by what the community asks for, files, and contributes back. If there is a feature you want to see in 3.1.0 or 3.2.0, open an issue or a PR, that is where the work happens now.

 

Try It Out

The post Percona Operator for PostgreSQL 3.0.0: Hard Fork, OLM Scoping, Major Upgrades appeared first on Percona.

May
24
2026
--

Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: Standby Cluster Method

A Crunchy to Percona PostgreSQL migration is more straightforward than most cross-operator moves on Kubernetes, because the Percona PostgreSQL Operator is a hard fork of the Crunchy Data PostgreSQL Operator. Same Patroni HA, same pgBackRest backups, same overall CRD shape. This post walks through the safest of the three migration paths: a standby cluster method with near-zero downtime.

This is part 2 of a 3-part series on running PostgreSQL on Kubernetes with a fully open-source operator. Part 1 walked through the changing open-source landscape and announced the hard fork of the Crunchy Data PostgreSQL Operator into the fully independent Percona PostgreSQL Operator v3.0.0.

This post is the first practical playbook of the series. It covers the standby cluster method, the safest migration path when the downtime budget is tight. Part 3 will cover two simpler paths: backup-and-restore and persistent-volume reuse.

If you are landing here without context on why you might want to migrate at all, start with part 1. The rest of this post assumes you have already decided to move and want a tested playbook.

 

Migration approach in one paragraph

The Percona PostgreSQL Kubernetes Operator is a hard fork of the Crunchy Data PostgreSQL Kubernetes Operator, which simplifies the migration paths considerably: the same underlying tools (Patroni, pgBackRest, PgBouncer) and the same overall design are used in both operators. All three migration paths in this series are reversible: because Percona’s operator is fully open source and remains compatible with the same backup format, the move back to Crunchy is also possible if your team decides to walk it

 

A note on the storage layer

All examples in this guide use an in-cluster SeaweedFS instance as the pgBackRest S3 repository. SeaweedFS is Apache-2.0 licensed, actively maintained, and a clean drop-in replacement for the role MinIO used to fill in this stack. Any other S3-compatible storage works just as well: AWS S3, Google Cloud Storage (via HMAC keys), Ceph RadosGW, Cloudflare R2, and so on. For non-SeaweedFS endpoints, remove repo1-s3-uri-style: path and repo1-s3-verify-tls: “n” from the pgBackRest configuration and replace the endpoint with your provider’s URL.

 

What this series does NOT cover

To keep scope honest:

  • Application-side connection-string changes beyond updating to the new pgBouncer service. If your app uses connection-pool tuning, custom auth, or a service mesh, that work stays with you.
  • Schema-changing upgrades, major PostgreSQL version upgrades, or extension migrations. The PostgreSQL major version must match between the source and the target.
  • Crunchy enterprise-only features like TDE, Crunchy Postgres for Kubernetes-specific operators, or pgBackRest custom encryption. If your environment uses these, contact the Percona team for a tailored plan.
  • Operating two operators against the same namespace before the PGO hard fork. Use Percona PostgreSQL Operator v3.0.0 or higher.

 

Tested with

Component Version
Crunchy Data PostgreSQL Kubernetes Operator v5.8.x (tested on v5.8.7)
Percona PostgreSQL Kubernetes Operator v3.x.x (tested on v3.0.0)
PostgreSQL 18 (must match between source and target)
Object storage SeaweedFS (Apache-2.0), or any other S3-compatible service accessible from all cluster pods
Tools kubectl, helm (v3), yq

Different versions may differ slightly in CR fields or behavior. Always consult the official documentation for the operator and PostgreSQL version you are running.

 

Migration using a standby cluster

This is the safest method when the downtime budget is tight. The Percona cluster is brought up as a standby of the Crunchy primary, catches up via pgBackRest plus streaming replication, and is promoted at cutover. The only downtime is the cutover step itself.

You can wire the standby in two ways, and combining both gives you maximum safety:

  • pgBackRest repo-based standby seeds the standby from the latest base backup and replays archived WAL
  • Streaming replication keeps the standby in sync with the live primary

 

Overview


 

Before you begin

Set the target namespace once. Every command in this guide reads from this variable, so you can change it in a single place:

export MIGRATION_NS=postgres-migration
kubectl create namespace $MIGRATION_NS

 

Deploy SeaweedFS

Skip this step if you already have an S3-compatible repository (AWS S3, GCS, Ceph). Update the endpoint and credentials in the YAML examples accordingly.

SeaweedFS provides an S3-compatible object store that runs inside Kubernetes. Both operators will use it as the shared pgBackRest WAL archive.

TLS is required. pgBackRest always connects to S3 endpoints over HTTPS, even when repo1-s3-verify-tls: “n” is set (that flag skips certificate verification, it does not fall back to HTTP). The steps below generate a self-signed certificate and pass it to SeaweedFS via Helm values.

# Generate a self-signed TLS certificate for SeaweedFS S3
openssl req -x509 -nodes -days 3650 -newkey rsa:2048 \
  -keyout /tmp/seaweedfs.key \
  -out /tmp/seaweedfs.crt \
  -subj "/CN=seaweedfs-all-in-one"

kubectl -n $MIGRATION_NS create secret tls seaweedfs-s3-tls \
  --cert=/tmp/seaweedfs.crt \
  --key=/tmp/seaweedfs.key

helm repo add seaweedfs https://seaweedfs.github.io/seaweedfs/helm
helm repo update

helm install seaweedfs seaweedfs/seaweedfs \
  --namespace $MIGRATION_NS \
  --version 4.23.0 \
  -f https://raw.githubusercontent.com/percona/percona-postgresql-operator/refs/heads/migration-from-crunchy-guide/e2e-tests/tests/migration-from-crunchy-standby/examples/seaweedfs-values.yaml \
  --wait

The Helm values file in the repo creates the pg-migration bucket on first start, so no separate aws s3 mb step is needed.

 

Step 0. Create pgBackRest secrets

Both operators need credentials to read and write the shared SeaweedFS bucket. Apply the secrets from examples/01-pgbackrest-secret.yaml after filling in your access key and secret key:

# Copy and edit the file first to set your credentials.

kubectl apply -n $MIGRATION_NS \
  -f https://raw.githubusercontent.com/percona/percona-postgresql-operator/refs/heads/migration-from-crunchy-guide/e2e-tests/tests/migration-from-crunchy-standby/examples/01-pgbackrest-secret.yaml

Both secrets contain the same SeaweedFS credentials (pgmigration / pgmigration123). For AWS S3, replace those with your IAM access key ID and secret access key.

 

Step 1. Start with your existing Crunchy Data cluster

If you already have a running Crunchy cluster, ensure its pgBackRest repo1 points at the shared bucket and path. The repo1-path value must be identical in both cluster specs. Mismatched paths will prevent the Percona standby from finding the WAL archive.

The Helm install below is shown only as a quick way to reproduce this blog post’s example. The migration steps in the rest of this post do not depend on how you deployed the source operator.

Optional: deploy a Crunchy operator to test the migration end to end:

helm install pgo \
  oci://registry.developers.crunchydata.com/crunchydata/pgo \
  -n $MIGRATION_NS \
  --version 5.8.7 \
  --set singleNamespace=true \
  --wait


Apply
examples/02-crunchy-source-cluster.yaml (or adapt your existing cluster’s pgBackRest config):

kubectl apply -n $MIGRATION_NS \
  -f https://raw.githubusercontent.com/percona/percona-postgresql-operator/refs/heads/migration-from-crunchy-guide/e2e-tests/tests/migration-from-crunchy-standby/examples/02-crunchy-source-cluster.yaml


The key pgBackRest settings in the example:

global:
  repo1-path: /crunchy-to-percona/repo1   # shared path, must match Percona side
  repo1-s3-uri-style: path                # required for path-style S3 endpoints (SeaweedFS, MinIO)
  repo1-s3-verify-tls: "n"                # skip TLS verification for self-signed cert; remove for AWS S3
repos:
  - name: repo1
    s3:
      bucket: pg-migration
      endpoint: seaweedfs-all-in-one.postgres-migration.svc.cluster.local:8443
      region: us-east-1


Wait for the cluster to be ready:

kubectl wait pod \
  --selector postgres-operator.crunchydata.com/cluster=crunchy-source,postgres-operator.crunchydata.com/data=postgres \
  --namespace $MIGRATION_NS \
  --for=condition=Ready \
  --timeout=300s

 


Step 2. Trigger a full backup on the Crunchy cluster

Wait for the pgBackRest stanza to be created:

kubectl wait postgrescluster/crunchy-source \
  -n $MIGRATION_NS \
  --for=jsonpath='{.status.pgbackrest.repos[0].stanzaCreated}'=true \
  --timeout=300s

Take a full backup before creating the Percona standby. This gives the standby a recent base to restore from, so it only needs to replay a small amount of WAL to catch up. This matches the realistic production migration pattern.

kubectl annotate postgrescluster crunchy-source \
  --namespace $MIGRATION_NS \
  postgres-operator.crunchydata.com/pgbackrest-backup="$(date +%s)"


Wait for the backup job to complete:

kubectl wait job \
  -l postgres-operator.crunchydata.com/pgbackrest-backup=manual,postgres-operator.crunchydata.com/cluster=crunchy-source \
  -n $MIGRATION_NS \
  --for=condition=Complete \
  --timeout=600s

 


Step 3. Copy TLS certificates (cross-namespace only)

If the Percona cluster is in a different namespace from the Crunchy cluster, copy the Crunchy TLS secrets to the Percona namespace. These allow mutual TLS authentication during streaming replication:

for secret in crunchy-source-cluster-cert crunchy-source-replication-cert; do
  kubectl get secret "${secret}" -n <CRUNCHY_NS> -o json | \
    yq '{"apiVersion": .apiVersion, "kind": .kind, "data": .data,
         "metadata": {"name": .metadata.name}, "type": .type}' -o yaml | \
    kubectl -n $MIGRATION_NS apply -f -
done

If both clusters are in the same namespace, skip this step. The secrets are already accessible.

 

Step 4. Deploy the Percona PG Operator

The Crunchy PGO operator can stay in the same or a different namespace.

kubectl apply -n $MIGRATION_NS --server-side \
  -f https://raw.githubusercontent.com/percona/percona-postgresql-operator/refs/tags/v3.0.0/deploy/bundle.yaml

Wait until the operator deployment is ready:

kubectl wait deployment percona-postgresql-operator \
  -n $MIGRATION_NS \
  --for=condition=Available \
  --timeout=120s

 

Step 5. Create the Percona cluster in standby mode

Note: The kubectl apply below pulls the CR manifest from the migration-from-crunchy-guide branch of the operator repo, which is the source for this guide’s examples. For production deployments, follow the official Percona Operator for PostgreSQL installation documentation and pin to a released version tag rather than a feature branch.

Apply examples/03-percona-standby-cluster.yaml:

kubectl apply -n $MIGRATION_NS \
  -f https://raw.githubusercontent.com/percona/percona-postgresql-operator/refs/heads/migration-from-crunchy-guide/e2e-tests/tests/migration-from-crunchy-standby/examples/03-percona-standby-cluster.yaml

The key settings that wire the Percona cluster to the Crunchy source:

standby:
  enabled: true
  repoName: repo1                             # restore initial base backup from this repo
  host: crunchy-source-ha.postgres-migration.svc.cluster.local
  port: 5432

secrets:
  customTLSSecret:
    name: crunchy-source-cluster-cert         # Crunchy CA for mutual TLS
  customReplicationTLSSecret:
    name: crunchy-source-replication-cert     # cert for _crunchyreplication user

The Percona operator will:

  1. Restore the base backup from the SeaweedFS bucket.
  2. Replay WAL from SeaweedFS until it catches up with the live Crunchy cluster.
  3. Switch to streaming replication from crunchy-source-ha.

Wait for the cluster to reach the ready state:

kubectl wait perconapgcluster/percona-standby \
  -n $MIGRATION_NS \
  --for=jsonpath='{.status.state}'=ready \
  --timeout=600s

Verify that data is replicating to the standby:

STANDBY_POD=$(kubectl get pod -n $MIGRATION_NS \
  -l postgres-operator.crunchydata.com/cluster=percona-standby,postgres-operator.crunchydata.com/data=postgres \
  -o jsonpath='{.items[0].metadata.name}')

kubectl -n $MIGRATION_NS exec "${STANDBY_POD}" -c database -- \
  psql -t -c "SELECT pg_is_in_recovery(), pg_last_wal_replay_lsn();"

Expected output: t (in recovery) and a non-null LSN.

 

Step 6. Verify replication lag before cutover

Query the Crunchy primary to confirm the Percona standby has caught up:

CRUNCHY_PRIMARY=$(kubectl get pod \
  -l postgres-operator.crunchydata.com/cluster=crunchy-source,postgres-operator.crunchydata.com/role=master \
  -n $MIGRATION_NS \
  -o jsonpath='{.items[0].metadata.name}')

kubectl -n $MIGRATION_NS exec "${CRUNCHY_PRIMARY}" -c database -- \
  psql -c "
    SELECT
        client_addr,
        state,
        pg_wal_lsn_diff(sent_lsn, replay_lsn) AS byte_lag,
        write_lag,
        flush_lag,
        replay_lag
    FROM pg_stat_replication;
  "

Proceed to the next step only when write_lag and replay_lag are NULL or under a few seconds.

 

Step 7. Cutover the Crunchy cluster

This is the only step that causes downtime. Stop accepting writes on the application side, then patch the Crunchy cluster into standby mode. Patroni steps down and archives the final WAL.

kubectl patch postgrescluster crunchy-source \
  -n $MIGRATION_NS \
  --type=merge \
  -p '{"spec": {"standby": {"enabled": true, "repoName": "repo1"}}}'

Verify demotion (poll until pg_is_in_recovery() returns t):

kubectl -n $MIGRATION_NS exec "${CRUNCHY_PRIMARY}" -c database -- \
  psql -t -c "SELECT pg_is_in_recovery();"

 

Step 8. (Optional) Shut down the Crunchy cluster

Once the Percona standby has replayed all WAL, shut down the Crunchy cluster to prevent split-brain:

kubectl patch postgrescluster crunchy-source \
  -n $MIGRATION_NS \
  --type=merge \
  -p '{"spec": {"shutdown": true}}'

kubectl wait pod \
  -l postgres-operator.crunchydata.com/cluster=crunchy-source,postgres-operator.crunchydata.com/data=postgres \
  -n $MIGRATION_NS \
  --for=delete \
  --timeout=120s || true

 

Step 9. Promote the Percona cluster

Confirm that the Percona standby has finished replaying all WAL (the LSN stops advancing):

kubectl -n $MIGRATION_NS exec "${STANDBY_POD}" -c database -- \
  psql -t -c "SELECT pg_last_wal_replay_lsn();"

Run this a few times. When the LSN is stable, replay is complete.

kubectl patch perconapgcluster percona-standby \
  -n $MIGRATION_NS \
  --type=merge \
  -p '{"spec": {"standby": {"enabled": false}}}'

Wait for the cluster to become ready and confirm it is writable:

kubectl wait perconapgcluster/percona-standby \
  -n $MIGRATION_NS \
  --for=jsonpath='{.status.state}'=ready \
  --timeout=480s

PERCONA_PRIMARY=$(kubectl get pod -n $MIGRATION_NS \
  -l postgres-operator.crunchydata.com/cluster=percona-standby,postgres-operator.crunchydata.com/role=primary \
  -o jsonpath='{.items[0].metadata.name}')

kubectl -n $MIGRATION_NS exec "${PERCONA_PRIMARY}" -c database -- \
  psql -t -c "SELECT pg_is_in_recovery();"

Expected output: f (the cluster is now the primary and accepts writes).

 

Step 10. Verify stanza creation

kubectl wait perconapgcluster/percona-standby \
  -n $MIGRATION_NS \
  --for=jsonpath='{.status.pgbackrest.repos[0].stanzaCreated}'=true \
  --timeout=300s

 

Step 11. Take a post-migration backup

Apply examples/04-post-migration-backup.yaml:

kubectl apply -n $MIGRATION_NS \
  -f https://raw.githubusercontent.com/percona/percona-postgresql-operator/refs/heads/migration-from-crunchy-guide/e2e-tests/tests/migration-from-crunchy-standby/examples/04-post-migration-backup.yaml

kubectl wait perconapgbackup/post-migration-backup \
  -n $MIGRATION_NS \
  --for=jsonpath='{.status.state}'=Succeeded \
  --timeout=600s

This creates a clean recovery point on the new timeline. All future PITR restores will use this backup as their starting point, independent of the old Crunchy WAL archive.

 

Reconnecting your application

Update your application’s connection string to point at the Percona cluster’s pgBouncer service:

kubectl get service -n $MIGRATION_NS \
  -l postgres-operator.crunchydata.com/cluster=percona-standby,postgres-operator.crunchydata.com/role=pgbouncer

This migration path works almost entirely out of the box. For users coming from the Crunchy Data PostgreSQL Operator, this method feels familiar because it leverages the same standby/replica mechanisms used for HA and disaster recovery. The key difference is that you can now use this familiar mechanism to migrate safely to the Percona PostgreSQL Operator, a fully open-source alternative running on a fully open-source storage layer.

 

Rollback

The standby method is the most rollback-friendly of the three. Until you take the post-migration backup, the Crunchy cluster still holds the original timeline. To roll back:

  1. Stop writes on the Percona side and patch the Percona cluster back into standby mode (spec.standby.enabled: true).
  2. Patch the Crunchy cluster out of standby mode and let Patroni promote it.
  3. Verify with pg_is_in_recovery() on both sides.
  4. Switch the application connection string back to the Crunchy pgBouncer service.

After Step 11 (post-migration backup), the timelines have diverged. From that point, the rollback story is the same as a fresh restore, and you should treat the Crunchy cluster as a historical reference, not a live target.

 

Troubleshooting

Percona standby not connecting to the Crunchy primary. Verify the crunchy-source-ha service resolves from within the Percona pod:

kubectl -n $MIGRATION_NS exec "${STANDBY_POD}" -c database -- \
  bash -c "getent hosts crunchy-source-ha.${MIGRATION_NS}.svc.cluster.local"

Replication authentication errors. The Percona standby authenticates as the _crunchyreplication PostgreSQL user using the certificate in crunchy-source-replication-cert. Verify the secret exists and matches what the Crunchy operator generated:

kubectl get secret crunchy-source-replication-cert -n $MIGRATION_NS

pgBackRest restore fails. Confirm both secrets contain identical credentials and that repo1-path is the same in both cluster specs (/crunchy-to-percona/repo1 in this guide). Mismatched paths cause an archive.info missing error. Verify the bucket is reachable:

kubectl run -i --rm s3-check \
  --image=perconalab/awscli \
  --restart=Never \
  -n $MIGRATION_NS \
  -- bash -c "
    AWS_ACCESS_KEY_ID=pgmigration \
    AWS_SECRET_ACCESS_KEY=pgmigration123 \
    AWS_DEFAULT_REGION=us-east-1 \
    aws --endpoint-url https://seaweedfs-all-in-one.${MIGRATION_NS}.svc.cluster.local:8443 \
        --no-verify-ssl \
        s3 ls s3://pg-migration
  "

Timeline history file (00000002.history) missing after promotion. This is a known issue with Crunchy PGO’s async archive mode. After promotion, push the history file synchronously:

kubectl -n $MIGRATION_NS exec "${PERCONA_PRIMARY}" -c database -- \
  bash -c "
    pgbackrest --stanza=db --no-archive-async \
      archive-push \"\${PGDATA}/pg_wal/00000002.history\" || true
  "

 

What’s next

This was the safest migration path. Part 3 will cover two simpler options:

  • Backup and restore. The simplest path. You take a Crunchy pgBackRest backup and the Percona cluster bootstraps from it. Cutover is the time between the final backup and pointing the application at the new cluster.
  • Persistent volume reuse. For when you want to skip the data copy entirely. The Percona cluster takes over the existing PGDATA volume, no restore step required.

Pick the method that fits your downtime budget, data size, and storage layout.

This post covers basic deployment patterns and simplified configuration examples. If your environment is more complex, uses custom images, includes Crunchy enterprise features like TDE, or otherwise needs tailored migration steps, contact the Percona team and we will help you plan and execute the move.

 

Try It Out

The post Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: Standby Cluster Method appeared first on Percona.

May
24
2026
--

Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: Standby Cluster Method

A Crunchy to Percona PostgreSQL migration is more straightforward than most cross-operator moves on Kubernetes, because the Percona PostgreSQL Operator is a hard fork of the Crunchy Data PostgreSQL Operator. Same Patroni HA, same pgBackRest backups, same overall CRD shape. This post walks through the safest of the three migration paths: a standby cluster method with near-zero downtime.

This is part 2 of a 3-part series on running PostgreSQL on Kubernetes with a fully open-source operator. Part 1 walked through the changing open-source landscape and announced the hard fork of the Crunchy Data PostgreSQL Operator into the fully independent Percona PostgreSQL Operator v3.0.0.

This post is the first practical playbook of the series. It covers the standby cluster method, the safest migration path when the downtime budget is tight. Part 3 will cover two simpler paths: backup-and-restore and persistent-volume reuse.

If you are landing here without context on why you might want to migrate at all, start with part 1. The rest of this post assumes you have already decided to move and want a tested playbook.

 

Migration approach in one paragraph

The Percona PostgreSQL Kubernetes Operator is a hard fork of the Crunchy Data PostgreSQL Kubernetes Operator, which simplifies the migration paths considerably: the same underlying tools (Patroni, pgBackRest, PgBouncer) and the same overall design are used in both operators. All three migration paths in this series are reversible: because Percona’s operator is fully open source and remains compatible with the same backup format, the move back to Crunchy is also possible if your team decides to walk it

 

A note on the storage layer

All examples in this guide use an in-cluster SeaweedFS instance as the pgBackRest S3 repository. SeaweedFS is Apache-2.0 licensed, actively maintained, and a clean drop-in replacement for the role MinIO used to fill in this stack. Any other S3-compatible storage works just as well: AWS S3, Google Cloud Storage (via HMAC keys), Ceph RadosGW, Cloudflare R2, and so on. For non-SeaweedFS endpoints, remove repo1-s3-uri-style: path and repo1-s3-verify-tls: “n” from the pgBackRest configuration and replace the endpoint with your provider’s URL.

 

What this series does NOT cover

To keep scope honest:

  • Application-side connection-string changes beyond updating to the new pgBouncer service. If your app uses connection-pool tuning, custom auth, or a service mesh, that work stays with you.
  • Schema-changing upgrades, major PostgreSQL version upgrades, or extension migrations. The PostgreSQL major version must match between the source and the target.
  • Crunchy enterprise-only features like TDE, Crunchy Postgres for Kubernetes-specific operators, or pgBackRest custom encryption. If your environment uses these, contact the Percona team for a tailored plan.
  • Operating two operators against the same namespace before the PGO hard fork. Use Percona PostgreSQL Operator v3.0.0 or higher.

 

Tested with

Component Version
Crunchy Data PostgreSQL Kubernetes Operator v5.8.x (tested on v5.8.7)
Percona PostgreSQL Kubernetes Operator v3.x.x (tested on v3.0.0)
PostgreSQL 18 (must match between source and target)
Object storage SeaweedFS (Apache-2.0), or any other S3-compatible service accessible from all cluster pods
Tools kubectl, helm (v3), yq

Different versions may differ slightly in CR fields or behavior. Always consult the official documentation for the operator and PostgreSQL version you are running.

 

Migration using a standby cluster

This is the safest method when the downtime budget is tight. The Percona cluster is brought up as a standby of the Crunchy primary, catches up via pgBackRest plus streaming replication, and is promoted at cutover. The only downtime is the cutover step itself.

You can wire the standby in two ways, and combining both gives you maximum safety:

  • pgBackRest repo-based standby seeds the standby from the latest base backup and replays archived WAL
  • Streaming replication keeps the standby in sync with the live primary

 

Overview


 

Before you begin

Set the target namespace once. Every command in this guide reads from this variable, so you can change it in a single place:

export MIGRATION_NS=postgres-migration
kubectl create namespace $MIGRATION_NS

 

Deploy SeaweedFS

Skip this step if you already have an S3-compatible repository (AWS S3, GCS, Ceph). Update the endpoint and credentials in the YAML examples accordingly.

SeaweedFS provides an S3-compatible object store that runs inside Kubernetes. Both operators will use it as the shared pgBackRest WAL archive.

TLS is required. pgBackRest always connects to S3 endpoints over HTTPS, even when repo1-s3-verify-tls: “n” is set (that flag skips certificate verification, it does not fall back to HTTP). The steps below generate a self-signed certificate and pass it to SeaweedFS via Helm values.

# Generate a self-signed TLS certificate for SeaweedFS S3
openssl req -x509 -nodes -days 3650 -newkey rsa:2048 \
  -keyout /tmp/seaweedfs.key \
  -out /tmp/seaweedfs.crt \
  -subj "/CN=seaweedfs-all-in-one"

kubectl -n $MIGRATION_NS create secret tls seaweedfs-s3-tls \
  --cert=/tmp/seaweedfs.crt \
  --key=/tmp/seaweedfs.key

helm repo add seaweedfs https://seaweedfs.github.io/seaweedfs/helm
helm repo update

helm install seaweedfs seaweedfs/seaweedfs \
  --namespace $MIGRATION_NS \
  --version 4.23.0 \
  -f https://raw.githubusercontent.com/percona/percona-postgresql-operator/refs/heads/migration-from-crunchy-guide/e2e-tests/tests/migration-from-crunchy-standby/examples/seaweedfs-values.yaml \
  --wait

The Helm values file in the repo creates the pg-migration bucket on first start, so no separate aws s3 mb step is needed.

 

Step 0. Create pgBackRest secrets

Both operators need credentials to read and write the shared SeaweedFS bucket. Apply the secrets from examples/01-pgbackrest-secret.yaml after filling in your access key and secret key:

# Copy and edit the file first to set your credentials.

kubectl apply -n $MIGRATION_NS \
  -f https://raw.githubusercontent.com/percona/percona-postgresql-operator/refs/heads/migration-from-crunchy-guide/e2e-tests/tests/migration-from-crunchy-standby/examples/01-pgbackrest-secret.yaml

Both secrets contain the same SeaweedFS credentials (pgmigration / pgmigration123). For AWS S3, replace those with your IAM access key ID and secret access key.

 

Step 1. Start with your existing Crunchy Data cluster

If you already have a running Crunchy cluster, ensure its pgBackRest repo1 points at the shared bucket and path. The repo1-path value must be identical in both cluster specs. Mismatched paths will prevent the Percona standby from finding the WAL archive.

The Helm install below is shown only as a quick way to reproduce this blog post’s example. The migration steps in the rest of this post do not depend on how you deployed the source operator.

Optional: deploy a Crunchy operator to test the migration end to end:

helm install pgo \
  oci://registry.developers.crunchydata.com/crunchydata/pgo \
  -n $MIGRATION_NS \
  --version 5.8.7 \
  --set singleNamespace=true \
  --wait


Apply
examples/02-crunchy-source-cluster.yaml (or adapt your existing cluster’s pgBackRest config):

kubectl apply -n $MIGRATION_NS \
  -f https://raw.githubusercontent.com/percona/percona-postgresql-operator/refs/heads/migration-from-crunchy-guide/e2e-tests/tests/migration-from-crunchy-standby/examples/02-crunchy-source-cluster.yaml


The key pgBackRest settings in the example:

global:
  repo1-path: /crunchy-to-percona/repo1   # shared path, must match Percona side
  repo1-s3-uri-style: path                # required for path-style S3 endpoints (SeaweedFS, MinIO)
  repo1-s3-verify-tls: "n"                # skip TLS verification for self-signed cert; remove for AWS S3
repos:
  - name: repo1
    s3:
      bucket: pg-migration
      endpoint: seaweedfs-all-in-one.postgres-migration.svc.cluster.local:8443
      region: us-east-1


Wait for the cluster to be ready:

kubectl wait pod \
  --selector postgres-operator.crunchydata.com/cluster=crunchy-source,postgres-operator.crunchydata.com/data=postgres \
  --namespace $MIGRATION_NS \
  --for=condition=Ready \
  --timeout=300s

 


Step 2. Trigger a full backup on the Crunchy cluster

Wait for the pgBackRest stanza to be created:

kubectl wait postgrescluster/crunchy-source \
  -n $MIGRATION_NS \
  --for=jsonpath='{.status.pgbackrest.repos[0].stanzaCreated}'=true \
  --timeout=300s

Take a full backup before creating the Percona standby. This gives the standby a recent base to restore from, so it only needs to replay a small amount of WAL to catch up. This matches the realistic production migration pattern.

kubectl annotate postgrescluster crunchy-source \
  --namespace $MIGRATION_NS \
  postgres-operator.crunchydata.com/pgbackrest-backup="$(date +%s)"


Wait for the backup job to complete:

kubectl wait job \
  -l postgres-operator.crunchydata.com/pgbackrest-backup=manual,postgres-operator.crunchydata.com/cluster=crunchy-source \
  -n $MIGRATION_NS \
  --for=condition=Complete \
  --timeout=600s

 


Step 3. Copy TLS certificates (cross-namespace only)

If the Percona cluster is in a different namespace from the Crunchy cluster, copy the Crunchy TLS secrets to the Percona namespace. These allow mutual TLS authentication during streaming replication:

for secret in crunchy-source-cluster-cert crunchy-source-replication-cert; do
  kubectl get secret "${secret}" -n <CRUNCHY_NS> -o json | \
    yq '{"apiVersion": .apiVersion, "kind": .kind, "data": .data,
         "metadata": {"name": .metadata.name}, "type": .type}' -o yaml | \
    kubectl -n $MIGRATION_NS apply -f -
done

If both clusters are in the same namespace, skip this step. The secrets are already accessible.

 

Step 4. Deploy the Percona PG Operator

The Crunchy PGO operator can stay in the same or a different namespace.

kubectl apply -n $MIGRATION_NS --server-side \
  -f https://raw.githubusercontent.com/percona/percona-postgresql-operator/refs/tags/v3.0.0/deploy/bundle.yaml

Wait until the operator deployment is ready:

kubectl wait deployment percona-postgresql-operator \
  -n $MIGRATION_NS \
  --for=condition=Available \
  --timeout=120s

 

Step 5. Create the Percona cluster in standby mode

Note: The kubectl apply below pulls the CR manifest from the migration-from-crunchy-guide branch of the operator repo, which is the source for this guide’s examples. For production deployments, follow the official Percona Operator for PostgreSQL installation documentation and pin to a released version tag rather than a feature branch.

Apply examples/03-percona-standby-cluster.yaml:

kubectl apply -n $MIGRATION_NS \
  -f https://raw.githubusercontent.com/percona/percona-postgresql-operator/refs/heads/migration-from-crunchy-guide/e2e-tests/tests/migration-from-crunchy-standby/examples/03-percona-standby-cluster.yaml

The key settings that wire the Percona cluster to the Crunchy source:

standby:
  enabled: true
  repoName: repo1                             # restore initial base backup from this repo
  host: crunchy-source-ha.postgres-migration.svc.cluster.local
  port: 5432

secrets:
  customTLSSecret:
    name: crunchy-source-cluster-cert         # Crunchy CA for mutual TLS
  customReplicationTLSSecret:
    name: crunchy-source-replication-cert     # cert for _crunchyreplication user

The Percona operator will:

  1. Restore the base backup from the SeaweedFS bucket.
  2. Replay WAL from SeaweedFS until it catches up with the live Crunchy cluster.
  3. Switch to streaming replication from crunchy-source-ha.

Wait for the cluster to reach the ready state:

kubectl wait perconapgcluster/percona-standby \
  -n $MIGRATION_NS \
  --for=jsonpath='{.status.state}'=ready \
  --timeout=600s

Verify that data is replicating to the standby:

STANDBY_POD=$(kubectl get pod -n $MIGRATION_NS \
  -l postgres-operator.crunchydata.com/cluster=percona-standby,postgres-operator.crunchydata.com/data=postgres \
  -o jsonpath='{.items[0].metadata.name}')

kubectl -n $MIGRATION_NS exec "${STANDBY_POD}" -c database -- \
  psql -t -c "SELECT pg_is_in_recovery(), pg_last_wal_replay_lsn();"

Expected output: t (in recovery) and a non-null LSN.

 

Step 6. Verify replication lag before cutover

Query the Crunchy primary to confirm the Percona standby has caught up:

CRUNCHY_PRIMARY=$(kubectl get pod \
  -l postgres-operator.crunchydata.com/cluster=crunchy-source,postgres-operator.crunchydata.com/role=master \
  -n $MIGRATION_NS \
  -o jsonpath='{.items[0].metadata.name}')

kubectl -n $MIGRATION_NS exec "${CRUNCHY_PRIMARY}" -c database -- \
  psql -c "
    SELECT
        client_addr,
        state,
        pg_wal_lsn_diff(sent_lsn, replay_lsn) AS byte_lag,
        write_lag,
        flush_lag,
        replay_lag
    FROM pg_stat_replication;
  "

Proceed to the next step only when write_lag and replay_lag are NULL or under a few seconds.

 

Step 7. Cutover the Crunchy cluster

This is the only step that causes downtime. Stop accepting writes on the application side, then patch the Crunchy cluster into standby mode. Patroni steps down and archives the final WAL.

kubectl patch postgrescluster crunchy-source \
  -n $MIGRATION_NS \
  --type=merge \
  -p '{"spec": {"standby": {"enabled": true, "repoName": "repo1"}}}'

Verify demotion (poll until pg_is_in_recovery() returns t):

kubectl -n $MIGRATION_NS exec "${CRUNCHY_PRIMARY}" -c database -- \
  psql -t -c "SELECT pg_is_in_recovery();"

 

Step 8. (Optional) Shut down the Crunchy cluster

Once the Percona standby has replayed all WAL, shut down the Crunchy cluster to prevent split-brain:

kubectl patch postgrescluster crunchy-source \
  -n $MIGRATION_NS \
  --type=merge \
  -p '{"spec": {"shutdown": true}}'

kubectl wait pod \
  -l postgres-operator.crunchydata.com/cluster=crunchy-source,postgres-operator.crunchydata.com/data=postgres \
  -n $MIGRATION_NS \
  --for=delete \
  --timeout=120s || true

 

Step 9. Promote the Percona cluster

Confirm that the Percona standby has finished replaying all WAL (the LSN stops advancing):

kubectl -n $MIGRATION_NS exec "${STANDBY_POD}" -c database -- \
  psql -t -c "SELECT pg_last_wal_replay_lsn();"

Run this a few times. When the LSN is stable, replay is complete.

kubectl patch perconapgcluster percona-standby \
  -n $MIGRATION_NS \
  --type=merge \
  -p '{"spec": {"standby": {"enabled": false}}}'

Wait for the cluster to become ready and confirm it is writable:

kubectl wait perconapgcluster/percona-standby \
  -n $MIGRATION_NS \
  --for=jsonpath='{.status.state}'=ready \
  --timeout=480s

PERCONA_PRIMARY=$(kubectl get pod -n $MIGRATION_NS \
  -l postgres-operator.crunchydata.com/cluster=percona-standby,postgres-operator.crunchydata.com/role=primary \
  -o jsonpath='{.items[0].metadata.name}')

kubectl -n $MIGRATION_NS exec "${PERCONA_PRIMARY}" -c database -- \
  psql -t -c "SELECT pg_is_in_recovery();"

Expected output: f (the cluster is now the primary and accepts writes).

 

Step 10. Verify stanza creation

kubectl wait perconapgcluster/percona-standby \
  -n $MIGRATION_NS \
  --for=jsonpath='{.status.pgbackrest.repos[0].stanzaCreated}'=true \
  --timeout=300s

 

Step 11. Take a post-migration backup

Apply examples/04-post-migration-backup.yaml:

kubectl apply -n $MIGRATION_NS \
  -f https://raw.githubusercontent.com/percona/percona-postgresql-operator/refs/heads/migration-from-crunchy-guide/e2e-tests/tests/migration-from-crunchy-standby/examples/04-post-migration-backup.yaml

kubectl wait perconapgbackup/post-migration-backup \
  -n $MIGRATION_NS \
  --for=jsonpath='{.status.state}'=Succeeded \
  --timeout=600s

This creates a clean recovery point on the new timeline. All future PITR restores will use this backup as their starting point, independent of the old Crunchy WAL archive.

 

Reconnecting your application

Update your application’s connection string to point at the Percona cluster’s pgBouncer service:

kubectl get service -n $MIGRATION_NS \
  -l postgres-operator.crunchydata.com/cluster=percona-standby,postgres-operator.crunchydata.com/role=pgbouncer

This migration path works almost entirely out of the box. For users coming from the Crunchy Data PostgreSQL Operator, this method feels familiar because it leverages the same standby/replica mechanisms used for HA and disaster recovery. The key difference is that you can now use this familiar mechanism to migrate safely to the Percona PostgreSQL Operator, a fully open-source alternative running on a fully open-source storage layer.

 

Rollback

The standby method is the most rollback-friendly of the three. Until you take the post-migration backup, the Crunchy cluster still holds the original timeline. To roll back:

  1. Stop writes on the Percona side and patch the Percona cluster back into standby mode (spec.standby.enabled: true).
  2. Patch the Crunchy cluster out of standby mode and let Patroni promote it.
  3. Verify with pg_is_in_recovery() on both sides.
  4. Switch the application connection string back to the Crunchy pgBouncer service.

After Step 11 (post-migration backup), the timelines have diverged. From that point, the rollback story is the same as a fresh restore, and you should treat the Crunchy cluster as a historical reference, not a live target.

 

Troubleshooting

Percona standby not connecting to the Crunchy primary. Verify the crunchy-source-ha service resolves from within the Percona pod:

kubectl -n $MIGRATION_NS exec "${STANDBY_POD}" -c database -- \
  bash -c "getent hosts crunchy-source-ha.${MIGRATION_NS}.svc.cluster.local"

Replication authentication errors. The Percona standby authenticates as the _crunchyreplication PostgreSQL user using the certificate in crunchy-source-replication-cert. Verify the secret exists and matches what the Crunchy operator generated:

kubectl get secret crunchy-source-replication-cert -n $MIGRATION_NS

pgBackRest restore fails. Confirm both secrets contain identical credentials and that repo1-path is the same in both cluster specs (/crunchy-to-percona/repo1 in this guide). Mismatched paths cause an archive.info missing error. Verify the bucket is reachable:

kubectl run -i --rm s3-check \
  --image=perconalab/awscli \
  --restart=Never \
  -n $MIGRATION_NS \
  -- bash -c "
    AWS_ACCESS_KEY_ID=pgmigration \
    AWS_SECRET_ACCESS_KEY=pgmigration123 \
    AWS_DEFAULT_REGION=us-east-1 \
    aws --endpoint-url https://seaweedfs-all-in-one.${MIGRATION_NS}.svc.cluster.local:8443 \
        --no-verify-ssl \
        s3 ls s3://pg-migration
  "

Timeline history file (00000002.history) missing after promotion. This is a known issue with Crunchy PGO’s async archive mode. After promotion, push the history file synchronously:

kubectl -n $MIGRATION_NS exec "${PERCONA_PRIMARY}" -c database -- \
  bash -c "
    pgbackrest --stanza=db --no-archive-async \
      archive-push \"\${PGDATA}/pg_wal/00000002.history\" || true
  "

 

What’s next

This was the safest migration path. Part 3 will cover two simpler options:

  • Backup and restore. The simplest path. You take a Crunchy pgBackRest backup and the Percona cluster bootstraps from it. Cutover is the time between the final backup and pointing the application at the new cluster.
  • Persistent volume reuse. For when you want to skip the data copy entirely. The Percona cluster takes over the existing PGDATA volume, no restore step required.

Pick the method that fits your downtime budget, data size, and storage layout.

This post covers basic deployment patterns and simplified configuration examples. If your environment is more complex, uses custom images, includes Crunchy enterprise features like TDE, or otherwise needs tailored migration steps, contact the Percona team and we will help you plan and execute the move.

 

Try It Out

The post Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: Standby Cluster Method appeared first on Percona.

Apr
30
2026
--

Run an ALTER TABLE for a huge table in Aurora

Recently, we received an alert for one of our Managed Services customers indicating that the auto_increment value for the table was 80% of its maximum capacity. The column was INT UNSIGNED, which has a limit of 4,294,967,295.

At 80%, we have enough time to change it to BIGINT.…. Right? Let’s see.

So we used pt-online-schema-change to perform the alter.

It started running at a good pace but slowed over time.

 

Why?

Well, let’s look at the definition of the table:

mysql> show create table myschema.mytableG
*************************** 1. row ***************************
       Table: mytable
Create Table: CREATE TABLE `mytable` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `long_column` varchar(1000) NOT NULL,
  `state` tinyint unsigned NOT NULL,
  `created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `short_column` varchar(30) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_long_column` (`long_column`,`state`),
  KEY `idx_short_column` (`short_column`,`state`),
  KEY `idx_short_col2` (`short_column`)
) ENGINE=InnoDB AUTO_INCREMENT=4009973818 DEFAULT CHARSET=utf8mb3

NOTE1: The index on long_column is for a varchar column with a length of 1000; it may not be required, and an index prefix may be more helpful here.

NOTE2: The index idx_short_col2 is duplicated, as it is covered by the index idx_short_column.

Those changes require testing and are out of scope for this emergency, but they are worth mentioning.

 

Table size:

+---------------+------------+------------+---------+----------+---------+----------+--------+
| TABLE_SCHEMA  | TABLE_NAME | TABLE_ROWS | DATA_GB | INDEX_GB | FREE_GB | TOTAL_GB | ENGINE |
+---------------+------------+------------+---------+----------+---------+----------+--------+
| myschema      | mytable    | 3906921584 |    1118 |     1790 |       0 |     2907 | InnoDB |
+---------------+------------+------------+---------+----------+---------+----------+--------+

Look at the indexes being way bigger than the data.

mysql> SELECT database_name, table_name, index_name, ROUND(stat_value * @@innodb_page_size / 1024 / 1024, 2) AS size_in_mb FROM mysql.innodb_index_stats WHERE stat_name = 'size' AND index_name != 'PRIMARY' and database_name='myschema' and table_name='mytable' ORDER BY size_in_mb DESC;
+---------------+------------+-------------------+------------+
| database_name | table_name | index_name        | size_in_mb |
+---------------+------------+-------------------+------------+
| myschema      | mytable    | idx_long_column   | 1583538.95 |
| myschema      | mytable    | idx_short_column  |  126432.98 |
| myschema      | mytable    | idx_short_col2    |  122699.95 |
+---------------+------------+-------------------+------------+
3 rows in set (0.01 sec)

While the pt-online-schema-change runs, it copies the data to a new table. As the data is being copied, the secondary indexes must be maintained.

NOTE the huge index for a varchar(1000) that is ~1.5T in size. Maintaining such an index becomes increasingly expensive as the data size increases.

The pt-online-schema-change had been running for ~8 days, and its latest estimate was 53 more days, which we can’t afford, since the maximum value would be exceeded in ~15 days. 

Copying `myschema`.`mytable`:  12% 53+16:48:01 remain
Copying `myschema`.`mytable`:  12% 53+16:48:30 remain
Copying `myschema`.`mytable`:  12% 53+16:48:59 remain
Copying `myschema`.`mytable`:  12% 53+16:49:26 remain
Copying `myschema`.`mytable`:  12% 53+16:49:53 remain
Copying `myschema`.`mytable`:  12% 53+16:50:19 remain
Copying `myschema`.`mytable`:  12% 53+16:50:49 remain
Copying `myschema`.`mytable`:  12% 53+16:51:17 remain
Copying `myschema`.`mytable`:  12% 53+16:51:45 remain

 

So what do we do now?

We suggested canceling the pt-online-schema-change and creating an Aurora blue-green deployment.

Then perform the direct ALTER on the green cluster. And finally, when ready, do the failover.

 

Sounds good, doesn’t it?

 

First, we need to ensure that the new cluster (green) has the replica_type_conversions  parameter in its cluster parameter group to “ALL_NON_LOSSY, ALL_UNSIGNED” in order to be able to replicate from an int unsigned column to a bigint unsigned column.

So we tried that, it started too fast ~0.036% per minute, that’s 2 days. That’s great!

We left the process running over the weekend, but we noticed it started to slow down again… By Monday, it was advancing at ~0.01% every 5 mins, which gives an ETA of 34 days. 

Why? 

Again, using the direct ALTER MySQL copies the data to a temp table, and the bigger the data, the harder it is to maintain the indexes. 

Again, unacceptable.

Note that with the above 2 approaches, we lost ~12 days of precious time, and the deadline for auto_increment exhaustion was approaching.

Then we thought: What if we drop the secondary indexes, do the alter, and then add the indexes back?

In theory, it should be faster, as:

  • Dropping the indexes is a metadata-only operation with ONLINE DDL.
  • Altering the column datatype from INT to BIGINT is not an ONLINE operation, but the fact that it doesn’t have to update secondary indexes during row copying to a new temporary table prevents the slowdown.
  • Adding back the secondary indexes is an ONLINE DDL operation:

 

“Online DDL support for adding secondary indexes means that you can generally speed the overall process of creating and loading a table and associated indexes by creating the table without secondary indexes, then adding secondary indexes after the data is loaded.”

https://dev.mysql.com/doc/refman/8.4/en/innodb-online-ddl-operations.html

So let’s do this:

The deletion of the indexes was really quick, as expected (metadata-only operation):

mysql> ALTER TABLE myschema.mytable DROP INDEX idx_long_column, DROP INDEX idx_short_column, DROP INDEX idx_short_col2;
Query OK, 0 rows affected (49.40 sec)
Records: 0  Duplicates: 0  Warnings: 0

 

Then the change of the datatype:

mysql> ALTER TABLE myschema.mytable CHANGE COLUMN id id bigint unsigned NOT NULL AUTO_INCREMENT;
Query OK, 4058047205 rows affected (13 hours 9 min 10.62 sec)
Records: 4058047205  Duplicates: 0  Warnings: 0

 

Looks very promising!!!

 

The final step, add back the indexes:

mysql> ALTER TABLE myschema.mytable ADD INDEX `idx_long_column` (`long_column`,`state`), ADD INDEX `idx_short_column` (`short_column`,`state`), ADD INDEX `short_col2` (`short_column`);
ERROR 1878 (HY000): Temporary file write failure.

 

Why?

Well, the INPLACE operation uses the tmp dir to write sort files. In Aurora, there are certain limits for the temporary space based on the instance type

In a regular MySQL instance, we can modify the innodb_tmpdir to another location with enough disk space; however, in Aurora, the parameter is not modifiable, which could have made the whole process easier.

Even with a larger instance type, it’s hard to create the 1.5T index without breaking open the piggy bank.

 

Last resort, add the indexes back with the COPY algorithm:

mysql> ALTER TABLE myschema.mytable ALGORITHM=COPY, ADD INDEX `idx_long_column` (`long_column`,`state`), ADD INDEX `idx_short_column` (`short_column`,`state`), ADD INDEX `idx_short_col2` (`short_column`);
Query OK, 4147498819 rows affected (6 days 1 hour 55 min 57.00 sec)
Records: 4147498819  Duplicates: 0  Warnings: 0

 

Why does it work? Because ALTER TABLE using the COPY algorithm uses the datadir as the destination for the temporary table, the rows are copied there. It doesn’t have the limitation of the temporary directory mentioned above.

We were able to make it on time about 4 days before the auto_increment exhaustion, preventing downtime.

 

In retrospective we could have used the following approach to avoid the use of the blue/green deployment:

  1. Perform a pt-online-schema-change on the main table, dropping the indexes, and changing the column type to bigint. ( with –no-swap-tables –no-drop-old-table –no-drop-new-table –no-drop-triggers).
  2. Add the secondary indexes using the direct alter with the COPY algorithm in the _new table.
  3. Once the alter finishes, swap the tables and drop the triggers.

 

Conclusion:

What initially looked like an easy task with pt-online-schema-change, ended up being more complex. 

You need to check the data definition, the index sizes, the Aurora limits, and how the different algorithms work to make a decision on the best way to proceed with those tasks, specially on situations like these where you have the pressure of the auto_increment being exhausted and there’s risk of downtime if it is not done on time.

And of course, monitor auto_increment exhaustion for your tables, and use a reasonable threshold that gives you enough time to plan and change the table definition. You can use Percona Monitoring and Management for this, specifically on the MySQL > MySQL Table Details dashboard.

The post Run an ALTER TABLE for a huge table in Aurora appeared first on Percona.

Apr
30
2026
--

Run an ALTER TABLE for a huge table in Aurora

Recently, we received an alert for one of our Managed Services customers indicating that the auto_increment value for the table was 80% of its maximum capacity. The column was INT UNSIGNED, which has a limit of 4,294,967,295.

At 80%, we have enough time to change it to BIGINT.…. Right? Let’s see.

So we used pt-online-schema-change to perform the alter.

It started running at a good pace but slowed over time.

 

Why?

Well, let’s look at the definition of the table:

mysql> show create table myschema.mytableG
*************************** 1. row ***************************
       Table: mytable
Create Table: CREATE TABLE `mytable` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `long_column` varchar(1000) NOT NULL,
  `state` tinyint unsigned NOT NULL,
  `created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `short_column` varchar(30) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_long_column` (`long_column`,`state`),
  KEY `idx_short_column` (`short_column`,`state`),
  KEY `idx_short_col2` (`short_column`)
) ENGINE=InnoDB AUTO_INCREMENT=4009973818 DEFAULT CHARSET=utf8mb3

NOTE1: The index on long_column is for a varchar column with a length of 1000; it may not be required, and an index prefix may be more helpful here.

NOTE2: The index idx_short_col2 is duplicated, as it is covered by the index idx_short_column.

Those changes require testing and are out of scope for this emergency, but they are worth mentioning.

 

Table size:

+---------------+------------+------------+---------+----------+---------+----------+--------+
| TABLE_SCHEMA  | TABLE_NAME | TABLE_ROWS | DATA_GB | INDEX_GB | FREE_GB | TOTAL_GB | ENGINE |
+---------------+------------+------------+---------+----------+---------+----------+--------+
| myschema      | mytable    | 3906921584 |    1118 |     1790 |       0 |     2907 | InnoDB |
+---------------+------------+------------+---------+----------+---------+----------+--------+

Look at the indexes being way bigger than the data.

mysql> SELECT database_name, table_name, index_name, ROUND(stat_value * @@innodb_page_size / 1024 / 1024, 2) AS size_in_mb FROM mysql.innodb_index_stats WHERE stat_name = 'size' AND index_name != 'PRIMARY' and database_name='myschema' and table_name='mytable' ORDER BY size_in_mb DESC;
+---------------+------------+-------------------+------------+
| database_name | table_name | index_name        | size_in_mb |
+---------------+------------+-------------------+------------+
| myschema      | mytable    | idx_long_column   | 1583538.95 |
| myschema      | mytable    | idx_short_column  |  126432.98 |
| myschema      | mytable    | idx_short_col2    |  122699.95 |
+---------------+------------+-------------------+------------+
3 rows in set (0.01 sec)

While the pt-online-schema-change runs, it copies the data to a new table. As the data is being copied, the secondary indexes must be maintained.

NOTE the huge index for a varchar(1000) that is ~1.5T in size. Maintaining such an index becomes increasingly expensive as the data size increases.

The pt-online-schema-change had been running for ~8 days, and its latest estimate was 53 more days, which we can’t afford, since the maximum value would be exceeded in ~15 days. 

Copying `myschema`.`mytable`:  12% 53+16:48:01 remain
Copying `myschema`.`mytable`:  12% 53+16:48:30 remain
Copying `myschema`.`mytable`:  12% 53+16:48:59 remain
Copying `myschema`.`mytable`:  12% 53+16:49:26 remain
Copying `myschema`.`mytable`:  12% 53+16:49:53 remain
Copying `myschema`.`mytable`:  12% 53+16:50:19 remain
Copying `myschema`.`mytable`:  12% 53+16:50:49 remain
Copying `myschema`.`mytable`:  12% 53+16:51:17 remain
Copying `myschema`.`mytable`:  12% 53+16:51:45 remain

 

So what do we do now?

We suggested canceling the pt-online-schema-change and creating an Aurora blue-green deployment.

Then perform the direct ALTER on the green cluster. And finally, when ready, do the failover.

 

Sounds good, doesn’t it?

 

First, we need to ensure that the new cluster (green) has the replica_type_conversions  parameter in its cluster parameter group to “ALL_NON_LOSSY, ALL_UNSIGNED” in order to be able to replicate from an int unsigned column to a bigint unsigned column.

So we tried that, it started too fast ~0.036% per minute, that’s 2 days. That’s great!

We left the process running over the weekend, but we noticed it started to slow down again… By Monday, it was advancing at ~0.01% every 5 mins, which gives an ETA of 34 days. 

Why? 

Again, using the direct ALTER MySQL copies the data to a temp table, and the bigger the data, the harder it is to maintain the indexes. 

Again, unacceptable.

Note that with the above 2 approaches, we lost ~12 days of precious time, and the deadline for auto_increment exhaustion was approaching.

Then we thought: What if we drop the secondary indexes, do the alter, and then add the indexes back?

In theory, it should be faster, as:

  • Dropping the indexes is a metadata-only operation with ONLINE DDL.
  • Altering the column datatype from INT to BIGINT is not an ONLINE operation, but the fact that it doesn’t have to update secondary indexes during row copying to a new temporary table prevents the slowdown.
  • Adding back the secondary indexes is an ONLINE DDL operation:

 

“Online DDL support for adding secondary indexes means that you can generally speed the overall process of creating and loading a table and associated indexes by creating the table without secondary indexes, then adding secondary indexes after the data is loaded.”

https://dev.mysql.com/doc/refman/8.4/en/innodb-online-ddl-operations.html

So let’s do this:

The deletion of the indexes was really quick, as expected (metadata-only operation):

mysql> ALTER TABLE myschema.mytable DROP INDEX idx_long_column, DROP INDEX idx_short_column, DROP INDEX idx_short_col2;
Query OK, 0 rows affected (49.40 sec)
Records: 0  Duplicates: 0  Warnings: 0

 

Then the change of the datatype:

mysql> ALTER TABLE myschema.mytable CHANGE COLUMN id id bigint unsigned NOT NULL AUTO_INCREMENT;
Query OK, 4058047205 rows affected (13 hours 9 min 10.62 sec)
Records: 4058047205  Duplicates: 0  Warnings: 0

 

Looks very promising!!!

 

The final step, add back the indexes:

mysql> ALTER TABLE myschema.mytable ADD INDEX `idx_long_column` (`long_column`,`state`), ADD INDEX `idx_short_column` (`short_column`,`state`), ADD INDEX `short_col2` (`short_column`);
ERROR 1878 (HY000): Temporary file write failure.

 

Why?

Well, the INPLACE operation uses the tmp dir to write sort files. In Aurora, there are certain limits for the temporary space based on the instance type

In a regular MySQL instance, we can modify the innodb_tmpdir to another location with enough disk space; however, in Aurora, the parameter is not modifiable, which could have made the whole process easier.

Even with a larger instance type, it’s hard to create the 1.5T index without breaking open the piggy bank.

 

Last resort, add the indexes back with the COPY algorithm:

mysql> ALTER TABLE myschema.mytable ALGORITHM=COPY, ADD INDEX `idx_long_column` (`long_column`,`state`), ADD INDEX `idx_short_column` (`short_column`,`state`), ADD INDEX `idx_short_col2` (`short_column`);
Query OK, 4147498819 rows affected (6 days 1 hour 55 min 57.00 sec)
Records: 4147498819  Duplicates: 0  Warnings: 0

 

Why does it work? Because ALTER TABLE using the COPY algorithm uses the datadir as the destination for the temporary table, the rows are copied there. It doesn’t have the limitation of the temporary directory mentioned above.

We were able to make it on time about 4 days before the auto_increment exhaustion, preventing downtime.

 

In retrospective we could have used the following approach to avoid the use of the blue/green deployment:

  1. Perform a pt-online-schema-change on the main table, dropping the indexes, and changing the column type to bigint. ( with –no-swap-tables –no-drop-old-table –no-drop-new-table –no-drop-triggers).
  2. Add the secondary indexes using the direct alter with the COPY algorithm in the _new table.
  3. Once the alter finishes, swap the tables and drop the triggers.

 

Conclusion:

What initially looked like an easy task with pt-online-schema-change, ended up being more complex. 

You need to check the data definition, the index sizes, the Aurora limits, and how the different algorithms work to make a decision on the best way to proceed with those tasks, specially on situations like these where you have the pressure of the auto_increment being exhausted and there’s risk of downtime if it is not done on time.

And of course, monitor auto_increment exhaustion for your tables, and use a reasonable threshold that gives you enough time to plan and change the table definition. You can use Percona Monitoring and Management for this, specifically on the MySQL > MySQL Table Details dashboard.

The post Run an ALTER TABLE for a huge table in Aurora appeared first on Percona.

Apr
30
2026
--

Continued Commitment to Percona XtraDB Cluster

At Percona, our priority has always been to provide the open source database solutions that our users can count on for the long term. Percona XtraDB Cluster (PXC) is a core part of that promise, delivering the high availability, scalability, and data integrity that mission-critical MySQL deployments depend on.

MariaDB has announced that September 30, 2026 will be the end-of-life date for continued maintenance and regular binary releases of MySQL Galera Cluster. We want to be clear about what this means for the organizations that rely on PXC: nothing is changing. Our commitment to PXC and the community that runs it is as strong as ever.

What is ending upstream is precisely what we already have in place. For anyone looking for an alternative path forward, PXC is the natural place to land.

What PXC users can count on

  • Our open Galera fork: Percona maintains its own Galera repository, open today and staying that way. We track upstream Galera releases, carry the fixes our customers need, and keep the codebase fully available for the community. PXC is built on this work, on terms we control.
  • Regular releases at the current cadence: Binary releases, bug fixes, and security patches continue to ship on the same terms and schedule our users have come to expect. You can review our full release history and release notes on the Percona documentation site.
  • Long-term support: PXC remains fully supported under our existing long-term support terms. If your organization is planning three to five years ahead, PXC is a safe foundation for those plans.
  • Compatibility and ecosystem integration: Strong binary compatibility with MySQL and Percona Server for MySQL, tight integration with Percona XtraBackup and Percona Monitoring and Management, and continued support across Kubernetes and traditional deployment environments.

What we’re continuing to invest in

Our engineering teams remain committed to making PXC better, focused on the things that make it a trusted choice: performance, stability, security, and a smooth operator experience. That work continues at pace. The PXC you depend on today will keep getting better, and the PXC you are evaluating for tomorrow will be ready when you need it.

Talk to us

If you have specific questions about your PXC deployment, your upgrade path, or your long-term high availability strategy, we’d love to hear from you. Reach out to your Percona contact, post a question in the Percona community forums, or connect with our team directly. High availability is too important to leave to uncertainty, and we are here to make sure you have the clarity and the support you need.

The post Continued Commitment to Percona XtraDB Cluster appeared first on Percona.

Apr
29
2026
--

Orchestrator’s Next Chapter: What It Means for Percona Customers

Last week, ProxySQL announced that they are taking over the maintenance and development of Orchestrator, the MySQL high-availability and topology management tool originally authored by Shlomi Noach. You can read their announcement here: Announcing the future of Orchestrator.

We want to briefly share Percona’s position on the news.

We welcome this

Orchestrator became the de facto standard for MySQL topology management and automated failover, and it has been a foundational tool in the ecosystem for over a decade. When the upstream project was archived, many operators were left running internal forks. A revived project under active development, with a stated roadmap and continued Apache 2.0 licensing, is good news for the MySQL community, and we’re glad to see ProxySQL step up to take it on. Thanks are due to Shlomi Noach for creating Orchestrator in the first place, and to everyone who contributed to it over the years.

A small clarification on Percona’s role

The ProxySQL announcement kindly credited Percona alongside GitHub for “stewardship over the years.” To be accurate: Percona has never been a maintainer of the upstream Orchestrator project. What we have done, and will continue to do, is support our customers who rely on it. That includes operational guidance, troubleshooting, and carrying internal patches where a customer situation requires it. The upstream project itself has always lived with Shlomi and later with the team at GitHub.

Nothing changes for Percona customers

If you are a Percona customer running Orchestrator today, your support experience is unchanged. We will continue helping you operate it in production, diagnose issues, and plan around its role in your high-availability stack. That commitment is steady regardless of where the upstream project lives.

Orchestrator’s maintenance also matters to us beyond support engagements. Percona Operator for MySQL uses Orchestrator to manage asynchronous topologies, so our own product depends on the project staying healthy. That’s part of why we plan to coordinate closely with the ProxySQL team as the next chapter unfolds.

Coordinating with the ProxySQL team

We plan to open coordination conversations with the ProxySQL team to make sure that operators running Orchestrator today, including our customers, have a smooth path as the project evolves. We wish the ProxySQL team well in this next chapter and look forward to supporting the community alongside them.

If you’re a Percona customer, reach out to your account team with any questions about your Orchestrator deployment. If you’re running Orchestrator outside of a Percona engagement and want to talk through support options, get in touch with our MySQL team.

 

The post Orchestrator’s Next Chapter: What It Means for Percona Customers appeared first on Percona.

Apr
21
2026
--

Percona Operator for MySQL 1.1.0: PITR, Incremental Backups, and Compression

The latest release of the Percona Operator for MySQL, 1.1.0, is here. It brings point-in-time recovery, incremental backups, zstd backup compression, configurable asynchronous replication retries, and a set of stability fixes. This post walks through the highlights and how they help your MySQL deployments on Kubernetes.

 

Percona Operator for MySQL 1.1.0

Running stateful databases on Kubernetes means your backup and recovery story has to be airtight. A full nightly backup is fine, until the DBA drops a table at 2 PM and you’re looking at 14 hours of lost work. Or until your storage bill grows faster than your actual data because every backup is a full copy.

Percona Operator for MySQL 1.1.0 addresses exactly these pain points. This release lands point-in-time recoveryincremental backups, and backup compression: three features that together give you finer recovery control, faster backup jobs, and meaningfully smaller storage footprints. It also brings configurable asynchronous replication retries and a set of stability fixes that harden everyday operations.

This is a community-driven release. Nearly every headline feature in 1.1.0 traces back to user feedback: issues raised on forums.percona.com, JIRA tickets filed by operators in production, and recurring questions from teams running MySQL on Kubernetes at scale. The operator is fully open source, runs on any CNCF-conformant Kubernetes distribution (GKE, EKS, OpenShift, or bare metal), and costs nothing to run. Let’s walk through what’s new.

In this post, you’ll learn about:

  • Point-in-Time Recovery (Tech Preview)
  • Incremental Backups (Tech Preview)
  • Backup Compression with zstd
  • Asynchronous replication retry configuration
  • Other improvements


Point-in-Time Recovery (Tech Preview)

A backup restores your cluster to the moment the backup was taken, but incidents rarely respect your backup schedule. With point-in-time recovery now available in Tech Preview, you can restore your MySQL cluster to any specific timestamp or GTID position, not just to a backup snapshot.

The operator continuously collects binary logs and stores them alongside your full and incremental backups. When a restore is needed, it starts from the nearest full backup, applies incremental backups, and then replays binary logs forward to the exact point in time you specify. PITR works identically across asynchronous and group replication topologies, so you don’t need to restructure your setup to take advantage of it.

A timestamp-based restore targets the exact moment before an incident:

apiVersion: ps.percona.com/v1
kind: PerconaServerMySQLRestore
metadata:
  name: restore-pitr-example
spec:
  clusterName: cluster1
  backupName: backup-20260418
  pitr:
    type: date
    date: "2026-04-18 13:45:00"
    #Restore with GTID
#    type: gtid
#    gtid: a3e5ff70-83e2-11ef-8e57-7a62caf7e1e3:1-36

When you need finer precision than timestamp-based recovery (for example, replaying right up to the transaction immediately before a bad UPDATE), use pitr.type: gtid and specify the exact GTID position.

This is especially useful after an accidental DROP TABLE or a bad application deploy mid-day: you recover to the moment just before the event, not to last night’s snapshot.

See the documentation for the full configuration reference.

Note: PITR is marked Tech Preview in 1.1.0 and is not recommended for production workloads yet. Try it in staging and share your feedback on the community forum.

 

Incremental Backups (Tech Preview)

Full backups work, but they come with a cost: every job copies your entire dataset, consuming time, I/O, and storage whether or not much has changed since the last run. Incremental backups solve this by capturing only the changes since the previous backup.

The Operator integrates incremental backup support, powered by Percona XtraBackup, across all supported backup storage backends (S3-compatible, GCS, Azure Blob Storage). Both scheduled and on-demand backup jobs can run incrementally. When you trigger a restore, the Operator reconstructs the full state by chaining the base backup with the subsequent incremental sets, so you don’t manage that complexity manually.

This helps when you need:

    • Faster daily backup jobs on large datasets that change slowly
    • Lower storage and egress costs per backup cycle
    • Tighter recovery windows without sacrificing backup frequency
    • Less I/O pressure on the primary during backup jobs

The backup manifest lives in deploy/backup/backup.yaml. Note the commented type and incrementalBaseBackupName fields: they are exactly how you switch a backup to incremental mode and point it at a previous backup as its base.

apiVersion: ps.percona.com/v1
kind: PerconaServerMySQLBackup
metadata:
  finalizers:
    - percona.com/delete-backup
  name: backup1
spec:
  clusterName: ps-cluster1
  storageName: minio
  type: incremental

Set type: full to take a base backup, then for each subsequent incremental set type: incremental.

Note: Incremental backups are also marked Tech Preview in 1.1.0. You can learn more about this feature in a separate blog post: Incremental backups in Percona Kubernetes Operator for MySQL

 

Backup Compression with zstd

Even without incremental backups, you can now shrink your full backup size significantly. The operator adds support for zstd compression, which compresses backup data with Percona XtraBackup before it streams to object storage.

Smaller transfers mean faster uploads, lower egress costs, and less object storage consumption, especially relevant when your cluster is in a different region from your storage bucket. The operator handles decompression transparently during restore, so your recovery workflow stays the same.

You can enable compression globally by configuring XtraBackup in mysql.configuration on the Custom Resource:

spec:
  mysql:
    configuration: |
      [xtrabackup]
      compress=zstd

Or enable it per on-demand backup via containerOptions:

apiVersion: ps.percona.com/v1
kind: PerconaServerMySQLBackup
metadata:
  name: backup1-compressed
  finalizers:
    - percona.com/delete-backup
spec:
  clusterName: ps-cluster1
  storageName: s3-us-west
  containerOptions:
    args:
      xtrabackup:
        - "--compress"

Full details are in the compressed backups documentation. Percona XtraBackup’s zstd compression reference covers the algorithm-level tradeoffs if you want to tune further. One known limitation in 1.1.0: lz4 compression is not yet supported pending an upstream resolution.

 

Asynchronous Replication Retry Configuration

In asynchronous replication topologies, transient network issues can stall replication threads on a MySQL Pod. Previously, reconnection behavior was fixed. Now you can tune it via the Custom Resource using two environment variables:

  • ASYNC_SOURCE_RETRY_COUNT: the number of reconnection attempts before the replica gives up
  • ASYNC_SOURCE_CONNECT_RETRY: the delay in seconds between reconnection attempts
spec:
  mysql:
    env:
      - name: ASYNC_SOURCE_RETRY_COUNT
        value: "10"
      - name: ASYNC_SOURCE_CONNECT_RETRY
        value: "30"

This is useful in environments with higher network latency or less reliable connectivity between zones. You can give the replica more time to recover without manual intervention.

A related improvement (K8SPS-69): the readiness probe now fails if replication threads stop on a MySQL Pod. This prevents Kubernetes from routing traffic to a replica that has quietly fallen behind, a common source of stale reads that were difficult to detect without custom monitoring.

 

Other Improvements

Operational polish shipped alongside the headline features:

    • Readiness probe catches stopped replication (K8SPS-69): the readiness probe now fails when replication threads stop, so Kubernetes stops routing traffic to replicas that have quietly fallen behind.
    • Automatic PVC removal on async replication restore (K8SPS-215): old PVCs are cleaned up automatically when restoring in async replication mode, one less manual step after a restore.
    • Scheduled backups paused on unhealthy clusters (K8SPS-435): backups no longer kick off against a degraded cluster, preventing partial or corrupted backup sets.
    • Structured error handling (K8SPS-595): invalid storage configurations now surface as structured error messages instead of Operator panics.
    • Status events reclassified (K8SPS-601): normal status transitions emit as Normal event types instead of warnings, cutting noise in kubectl describe output and alerting pipelines.
    • HAProxy file descriptor handling (K8SPS-666): file descriptor management in the HAProxy container is optimized so connection counts are no longer silently capped on busy clusters.

The release also ships improved documentation: OpenShift installation instructions now include the full OLM procedure, an Operator upgrade tutorial for OpenShift has been added, and Helm documentation covers customized parameters and custom release naming.

 

Conclusion

Percona Operator for MySQL 1.1.0 delivers meaningful improvements to every phase of the database lifecycle on Kubernetes. PITR and incremental backups in Tech Preview give you a path toward granular recovery without full-backup overhead. Compression with zstd reduces your storage and egress costs immediately. Configurable async replication retries and a batch of stability fixes harden the Operator for production workloads at scale. These features are in this release because the community asked for them.

We encourage you to read the full release notes and try the new features. Feedback is welcome on the GitHub repository, the Community Forum, or JIRA.

 

Try It Out

The post Percona Operator for MySQL 1.1.0: PITR, Incremental Backups, and Compression appeared first on Percona.

Apr
20
2026
--

Deploying Cross-Site Replication in Percona Operator for MySQL (PXC)

Having a separate DR cluster for production databases is a modern day requirement or necessity for tech and other related businesses that rely heavily on their database systems. Setting up such a [DC -> DR] topology for Percona XtraDB Cluster (PXC), which is a virtually- synchronous cluster, can be a bit challenging in a complex Kubernetes environment.

Here, Percona Operator for MySQL comes in handy, with a minimal number of steps to configure such a topology, which ensures a remote side backup or a disaster recovery solution.

So without taking much time, let’s see how the overall setup and configurations look from a practical standpoint.

 

PXC Cross-Site/Disaster Recovery
PXC Cross-Site/Disaster Recovery

 

DC Configuration

1) Here we have a three-node PXC cluster running on the DC side.

shell> kubectl get pods -n pxc
NAME                                               READY   STATUS      RESTARTS   AGE
cluster1-haproxy-0                                 2/2     Running     0          23h
cluster1-haproxy-1                                 2/2     Running     0          23h
cluster1-haproxy-2                                 2/2     Running     0          23h
cluster1-pxc-0                                     3/3     Running     0          23h
cluster1-pxc-1                                     3/3     Running     0          7h37m
cluster1-pxc-2                                     3/3     Running     0          7h18m
percona-xtradb-cluster-operator-6756dbf588-vxjxt   1/1     Running     0          24h
xb-backup1-hlz2p                                   0/1     Completed   0          21h
xb-cron-cluster1-fs-pvc-2026480026-372f8-2gfhr     0/1     Completed   0          13h

2) There are some configuration options which have to be enabled in a custom resource file[cr.yaml] to allow cross-site replication.

  • Expose all source PXC nodes so they can be communicated from outside or DR cluster.
expose:
      	enabled: true
      	Type: LoadBalancer

  • Define a dedicated replication channel and enable the source option.
replicationChannels:
    - name: pxc1_to_pxc2
      isSource: true

  • Finally, applying the custom resource changes.
shell> kubectl apply -f cr.yaml

3) Now we will notice some “EXTERNAL IP” details for each PXC node. This is the endpoint that DR node [cluster1-pxc-0] will use to connect to DC.

shell> kubectl get svc
NAME                              TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                                          AGE
cluster1-haproxy                  ClusterIP      34.118.227.249   <none>          3306/TCP,3309/TCP,33062/TCP,33060/TCP,8404/TCP   4h1m
cluster1-haproxy-replicas         ClusterIP      34.118.225.41    <none>          3306/TCP                                         4h1m
cluster1-pxc                      ClusterIP      None             <none>          3306/TCP,33062/TCP,33060/TCP                     4h1m
cluster1-pxc-0                    LoadBalancer   34.118.234.140   34.29.145.138   3306:30425/TCP                                   4h1m
cluster1-pxc-1                    LoadBalancer   34.118.239.132   34.30.233.0     3306:31340/TCP                                   4h1m
cluster1-pxc-2                    LoadBalancer   34.118.236.64    35.225.0.19     3306:30642/TCP                                   4h1m
cluster1-pxc-unready              ClusterIP      None             <none>          3306/TCP,33062/TCP,33060/TCP                     4h1m
percona-xtradb-cluster-operator   ClusterIP      34.118.235.168   <none>          443/TCP                                          4h11m

At this point, we are done with the DC setup. Next, we will take a backup from Source which we later used to build the DR.

 

Backup

  • Defining access key/secrets to connect to the GCP/S3 bucket.
cat backup-secret-s3.yaml

apiVersion: v1
kind: Secret
metadata:
  name: my-cluster-name-backup-s3
type: Opaque
data:
  AWS_ACCESS_KEY_ID: <KEY>
  AWS_SECRET_ACCESS_KEY: <SECRET>

  • In the custom resource file [cr.yaml] , we also need to define the bucket , secret file and endpoint/region details.
backup:

 storages:
   s3-us-west:
      type: s3
      verifyTLS: true

    s3:
      bucket: <bucket>
      credentialsSecret: my-cluster-name-backup-s3
      region: us-west-2
      endpointUrl: https://storage.googleapis.com

shell> kubectl apply -f cr.yaml

  • Finally, we can take the backup by creating a [backup.yaml] file with below details.
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBClusterBackup
metadata:
#  finalizers:
#    - percona.com/delete-backup
  name: backup1
spec:
  pxcCluster: cluster1
  storageName:  s3-us-west

shell> kubectl apply -f cr.yaml

  • We can verify the successful backup as follows.
kubectl get pxc-backup
NAME      CLUSTER    STORAGE      DESTINATION                                     STATUS      COMPLETED   AGE
backup1   cluster1   s3-us-west   s3://<bucket>/cluster1-2026-04-07-15:55:46-full   Succeeded   125m        127m

As the backup is also ready, we can now move to the DR setup part.

 

DR Configuration

Below we have a similar PXC setup as having in DC in a separate Node/ K8s Cluster.

kubectl get pods -n pxc-dr
NAME                                               READY   STATUS      RESTARTS   AGE
cluster1-haproxy-0                                 2/2     Running     0          35h
cluster1-haproxy-1                                 2/2     Running     0          35h
cluster1-haproxy-2                                 2/2     Running     0          35h
cluster1-pxc-0                                     3/3     Running     0          35h
cluster1-pxc-1                                     3/3     Running     0          35h
cluster1-pxc-2                                     3/3     Running     0          35h
percona-xtradb-cluster-operator-6756dbf588-2wc5m   1/1     Running     0          38h
prepare-job-restore1-cluster1-8h4vn                0/1     Completed   0          35h
restore-job-restore1-cluster1-trfg6                0/1     Completed   0          35h
xb-cron-cluster1-fs-pvc-2026480025-372f8-wv6bt     0/1     Completed   0          28h
xb-cron-cluster1-fs-pvc-2026490025-372f8-gxd59     0/1     Completed   0          4h48m

First, we need to restore the backup on the DR server.

Data Restoration

  • Here we will create the [backup-secret-s3.yaml] file which contains the GCP/S3 credentials.
apiVersion: v1
kind: Secret
metadata:
  name: my-cluster-name-backup-s3
type: Opaque
data:
  AWS_ACCESS_KEY_ID: <KEY>
  AWS_SECRET_ACCESS_KEY: <SECRET>

shell> kubectl apply -f backup-secret-s3.yaml

  • Next, we will create a [restore.yaml] file while mentioning the backup source and other useful information.
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBClusterRestore
metadata:
  name: restore1
#  annotations:
#    percona.com/headless-service: "true"
spec:
  pxcCluster: cluster1
  backupSource:
#    verifyTLS: true
    destination: s3://<bucket>/cluster1-2026-04-07-15:55:46-full
    s3:
      bucket: <bucket>
      credentialsSecret: my-cluster-name-backup-s3
      endpointUrl: https://storage.googleapis.com/

shell> kubectl apply -f restore.yaml

  • Once the restoration is finished successfully, we will see the status below.
shell> kubectl get pxc-restore
NAME       CLUSTER    STATUS      COMPLETED   AGE
restore1   cluster1   Succeeded               27m

Now we can do the remaining DR changes in the custom resource file [cr.yaml]. Basically, we need to add the replication channel and all source EXTERNAL-IPs. This cross-DC replication supports Automatic Asynchronous Replication Connection Failover feature, so in case any of the DC node is down, the Replica can connect and resume from other available DC nodes.

replicationChannels:
    - name: pxc1_to_pxc2
      isSource: false
      sourcesList:
      - host: 34.29.145.138  
        port: 3306
        weight: 100

      - host: 34.30.233.0
        port: 3306
        weight: 100

      - host: 35.225.0.19
        port: 3306
        weight: 100

shell> kubectl apply -f cr.yaml

For backup and restoration on the PXC operator, the manuals below can be referenced further.

 

Replication

Initially, when we check the replication status, we can notice the following error. This is because with [caching_sha2_password] authentication, it should be a secure SSL/TLS communication, or else we can use SOURCE_PUBLIC_KEY_PATH/GET_SOURCE_PUBLIC_KEY  which basicaly enables the RSA key pair-based password exchange by requesting the public key from the source. 

shell> kubectl exec -it cluster1-pxc-0  -- sh
shell> mysql -uroot -p

mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Connecting to source
                  Source_Host: 35.225.0.19
                  Source_User: replication
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: 
          Read_Source_Log_Pos: 4
               Relay_Log_File: cluster1-pxc-0-relay-bin-pxc1_to_pxc2.000001
                Relay_Log_Pos: 4
        Relay_Source_Log_File: 
           Replica_IO_Running: Connecting
          Replica_SQL_Running: Yes
...

Error:

Last_IO_Error: Error connecting to source 'replication@35.225.0.19:3306'. This was attempt 2/3, with a delay of 60 seconds between attempts. Message: Access denied for user 'replication'@'35.225.0.19.' (using password: YES)

Once we passed “GET_SOURCE_PUBLIC_KEY” in the “CHANGE REPLICATION” command the  error is resolved and DR successfully able to communicate with the DC.

mysql> STOP REPLICA;
mysql> STOP REPLICA IO_THREAD FOR CHANNEL 'pxc1_to_pxc2';
mysql> CHANGE REPLICATION SOURCE TO SOURCE_USER='replication', SOURCE_PASSWORD='password', GET_SOURCE_PUBLIC_KEY=1 FOR CHANNEL 'pxc1_to_pxc2';
mysql> START REPLICA;

Note  – The Replication user will be auto-created on the DC node. So, with the help of below command we can get the decoded password for “replication” user.

shell> kubectl get secret cluster1-secrets -o jsonpath="{.data.replication}" | base64 --decode

mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 35.225.0.19
                  Source_User: replication
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: binlog.000006
          Read_Source_Log_Pos: 3047027
               Relay_Log_File: cluster1-pxc-0-relay-bin-pxc1_to_pxc2.000001
                Relay_Log_Pos: 150132
        Relay_Source_Log_File: binlog.000006
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
...

The other PXC DR nodes will sync as usual with the Galera Synchronous replication process. 

Source Failover

The asynchronous connection failover is already enabled on the DR as we defined initially in the custom resource file. The “External IPs”  shows different here because they changed in this testing scenario.

mysql> select * from performance_schema.replication_asynchronous_connection_failover;
+--------------+---------------+------+-------------------+--------+--------------+
| CHANNEL_NAME | HOST          | PORT | NETWORK_NAMESPACE | WEIGHT | MANAGED_NAME |
+--------------+---------------+------+-------------------+--------+--------------+
| pxc1_to_pxc2 | 34.29.145.138 | 3306 |                   |    100 |              |
| pxc1_to_pxc2 | 34.45.151.96  | 3306 |                   |    100 |              |
| pxc1_to_pxc2 | 34.71.57.38   | 3306 |                   |    100 |              |
+--------------+---------------+------+-------------------+--------+--------------+
3 rows in set (0.00 sec)

Now, in case the existing Source DC[cluster1-pxc-2] is down, the DR will connect to one of the other available DC nodes based on the “Weight” and chronological order [pxc-2, pxc-1, pxc-0 etc].

  • Here, we temporarily take down the Source DC[cluster1-pxc-2] node.
kubectl get pods -n pxc
NAME                                               READY   STATUS      RESTARTS     AGE
cluster1-haproxy-0                                 2/2     Running     0            2d3h
cluster1-haproxy-1                                 2/2     Running     0            2d3h
cluster1-haproxy-2                                 2/2     Running     0            2d3h
cluster1-pxc-0                                     3/3     Running     0            2d3h
cluster1-pxc-1                                     3/3     Running     0            35h
cluster1-pxc-2                                     2/3     Running     1 (6s ago)   34h
percona-xtradb-cluster-operator-6756dbf588-vxjxt   1/1     Running     0            2d3h
xb-backup1-hlz2p                                   0/1     Completed   0            2d1h
xb-cron-cluster1-fs-pvc-2026480026-372f8-2gfhr     0/1     Completed   0            41h
xb-cron-cluster1-fs-pvc-2026490026-372f8-mgfpv     0/1     Completed   0            17h

  • The DR replication breaks as it can’t reach the DC [cluster1-pxc-2].
mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Reconnecting after a failed source event read
                  Source_Host: 34.71.57.38
                  Source_User: replication
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: binlog.000012
          Read_Source_Log_Pos: 198
               Relay_Log_File: cluster1-pxc-0-relay-bin-pxc1_to_pxc2.000002
                Relay_Log_Pos: 369
        Relay_Source_Log_File: binlog.000012
           Replica_IO_Running: Connecting
          Replica_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Source_Log_Pos: 198
              Relay_Log_Space: 602
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Source_SSL_Allowed: No
           Source_SSL_CA_File: 
           Source_SSL_CA_Path: 
              Source_SSL_Cert: 
            Source_SSL_Cipher: 
               Source_SSL_Key: 
        Seconds_Behind_Source: NULL
Source_SSL_Verify_Server_Cert: Yes
                Last_IO_Errno: 2003
                Last_IO_Error: Error reconnecting to source 'replication@34.71.57.38:3306'. This was attempt 2/3, with a delay of 60 seconds between attempts. Message: Can't connect to MySQL server on '34.71.57.38:3306' (111)

  • Once it reaches the “source_retry_count” and “source_connect_retry”, the Replica connects to another Source DC[cluster1-pxc-1].
mysql> show replica status\G;
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 34.45.151.96
                  Source_User: replication
                  Source_Port: 3306
                Connect_Retry: 60
              Source_Log_File: binlog.000007
          Read_Source_Log_Pos: 198
               Relay_Log_File: cluster1-pxc-0-relay-bin-pxc1_to_pxc2.000003
                Relay_Log_Pos: 369
        Relay_Source_Log_File: binlog.000007
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
...

Quick Summary

In this blog post, we walk through the steps to configure Cross-Site Replication in the Percona PXC operator. Although we have used the operator native Xtrabackup to feed the data to the DR via the restore process, we can also use logical backup options like (mysqldump, mydumper, etc.) to accomplish the same goals. 

Using an “Asynchronous Replication” process to sync DR could lead to delays or replication lag due to its flow, or, more importantly, when working across data centres, where network latency is a big factor. However, adding a DR(PXC) cluster to DC(PXC) directly via synchronous replication could be more impactful or lead to flow control issues if any of the DR nodes struggle or experience performance/saturation issues. So, it’s equally important to consider all aspects or challenges before deploying in production.

The post Deploying Cross-Site Replication in Percona Operator for MySQL (PXC) appeared first on Percona.

Apr
01
2026
--

Percona Operator for PostgreSQL 2.9.0: PostgreSQL 18 Default, PVC Snapshot Backups, LDAP Support, and More!

We are excited to announce Percona Operator for PostgreSQL 2.9.0! In this release, we bring significant improvements across database lifecycle management, security, backup/restore, and operational observability, making it easier than ever to run production PostgreSQL on Kubernetes.

Here’s a deep dive into what’s new.

 

Percona Operator for PostgreSQL 2.9.0

PostgreSQL 18 Is Now the Default

Starting with this release, PostgreSQL 18 is the default version for new cluster deployments. In addition, PostgreSQL 18 delivers improved query planning, better parallelism, enhanced logical replication, and many security hardening improvements.

If you’re still running PostgreSQL 13, please note that it has reached end-of-life and is no longer supported. We strongly recommend upgrading to a supported version (14 through 18) as soon as possible.

The following PostgreSQL versions are supported in Percona Operator for PostgreSQL 2.9.0: 14.22-1, 15.17-1, 16.13-1, 17.9-1, and 18.3-1. Other versions may also work, but they have not been officially tested or validated.

 

Major Version Upgrades Are Now Generally Available (GA)

The major upgrade workflow for PostgreSQL clusters has graduated to General Availability (GA). After extensive testing across upgrade paths and Kubernetes environments, this feature is now production-ready. Specifically, major upgrades allow you to move your cluster to a newer PostgreSQL major version, for example, from PostgreSQL 16 to 17 or 18, with minimal disruption (in-place upgrade), all managed by the Operator. As a result, this removes the need for manual pg_upgrade procedures and complex migration scripts, letting you keep your clusters up to date with confidence.

 

PVC Snapshot Backups: Faster Backups and Restores (Tech Preview)

One of the most exciting additions in 2.9.0 is PVC (Persistent Volume Claim) snapshot support for backups and restores. Rather than streaming data through pgBackRest, PVC snapshots leverage your storage layer’s native snapshot capability, dramatically reducing backup time, especially for large databases.

Key benefits of PVC snapshot backups:

  • Significantly faster backup and restore operations for large volumes
  • Point-in-time recovery is supported when combined with pgBackRest WAL archiving
  • Currently supports cold (offline) backups only. We will add a hot snapshot (online) in a future release.
  • Requires enabling the BackupSnapshots feature gate

Combined with pgBackRest’s full, differential, and incremental backup types, PVC snapshots give you the flexibility to build a layered backup strategy, routine incrementals for day-to-day protection, and near-instant snapshots before high-risk operations like major upgrades or schema changes, all managed by the same Operator.

This feature is in Tech Preview in 2.9.0. We encourage users to test it in non-production environments and share feedback via the Community Forum or GitHub issues.

 

WAL Lag Detection for Standby Clusters

Managing standby clusters just got smarter. The Percona Operator for PostgreSQL now supports WAL lag detection: when a standby cluster falls behind the primary by more than a configurable threshold, the Operator automatically marks its pods as unready, and the cluster enters an “Initializing” state with a StandbyLagging condition.

As a result, stale standbys from serving traffic silently, improving the reliability of high-availability setups and giving operators clear visibility into replication health. Your team no longer has to guess why something looks wrong. The Operator automatically detects lagging replicas.

 

LDAP Authentication Support

Security teams can now enforce centralized user authentication for PostgreSQL through their corporate LDAP directory. The Operator’s new LDAP support allows you to configure PostgreSQL to authenticate users against an LDAP server directly, without requiring manual pg_hba.conf management.

Two authentication methods are supported:

  • Simple bind – the user’s distinguished name (DN) is constructed from a template and used directly to bind
  • Bind-and-search – the Operator first binds with a service account, searches for the user’s DN, then re-binds with the user’s credentials

LDAP integration enables teams to enforce the same identity governance policies for their Kubernetes-hosted PostgreSQL clusters as they do for the rest of their infrastructure. This means no more manually editing pg_hba.conf, no per-cluster credential management, and no separate offboardin, when a user is removed from LDAP, their access to PostgreSQL is gone instantly, across every cluster.

 

Automated TLS Certificate Management via cert-manager

Managing TLS certificates for PostgreSQL clusters is now fully automated with cert-manager integration. With this integration, the Operator can automatically request, renew, and rotate TLS certificates for cluster communication, eliminating the need to manually manage certificate expiry or rotation scripts.

Highlights:

  • cert-manager automatically renews certificates 30 days before expiration
  • Configurable validity periods let teams align certificate lifecycle with their security policies

Overall, this makes it straightforward to enforce encrypted-in-transit policies across all PostgreSQL clusters without operational overhead.

 

Official PostGIS Docker Image

Geospatial workloads on Kubernetes just became easier to manage. Furthermore, this release introduces an official PostGIS Docker image maintained by Percona, providing a supported and regularly patched path for running PostGIS 3.5.5 alongside PostgreSQL. With this addition, geospatial PostgreSQL deployments are now a first-class citizen in the Percona Operator ecosystem.

 

Operational Improvements:

pprof Profiling for Troubleshooting

When diagnosing performance issues in the Operator itself, you can now enable pprof profiling by setting the PPROF_BIND_ADDRESS environment variable on the Operator pod. Doing so exposes Go’s built-in profiling endpoints, enabling CPU and memory analysis without an Operator restart.

Custom DNS Suffix Configuration

Running PostgreSQL clusters inside vcluster or environments with custom DNS configurations? The new clusterServiceDNSSuffix option lets you specify the DNS suffix used for service discovery, ensuring correct name resolution in non-standard Kubernetes networking setups.

Volume Mounting for Sidecar Containers

Sidecar containers can now mount PersistentVolumeClaims, Secrets, and ConfigMaps directly. The result is a wider range of sidecar use cases, from log exporters needing persistent storage to monitoring agents requiring configuration secrets.

Configurable Operator Leader Election

If you’ve ever seen the Operator crash-loop during a Kubernetes API server blip, for example, when your cluster autoscales and the API server briefly throttles requests, this one is for you.

In previous versions, the Operator’s leader election lease settings were hardcoded. When the API server became slow or temporarily unreachable, the lease wasn’t renewed in time, the Operator lost leadership, and immediately restarted, often in a loop. There was no way to tune the timeouts to match your infrastructure.

v2.9.0 fixes this with four new environment variables:

  • Use the PGO_CONTROLLER_LEASE_DURATION, PGO_CONTROLLER_RENEW_DEADLINE, PGO_CONTROLLER_RETRY_PERIOD – environment variables to adjust timing for lease acquisition and renewal.
  • Use the PGO_CONTROLLER_LEADER_ELECTION_ENABLED – environment variable to turn on or off leader election for single-replica deployments
  • Use the PGO_CONTROLLER_LEASE_NAME – environment variable to use a custom Lease resource for a leader lock.

Configurable wal_level

You can now configure the wal_level PostgreSQL parameter via the Operator. Particularly useful for clusters without logical replication, setting wal_level = replica reduces unnecessary WAL overhead and improves write performance.

 

Deprecations and Breaking Changes

PMM2 Support Deprecated

Support for PMM2 is now deprecated as it approaches end-of-life. PMM2 will be removed in a future release (two releases from now). Therefore, we strongly encourage all users to migrate to PMM3 to continue receiving monitoring support and new features.

PostgreSQL 13 Removed

PostgreSQL 13 has reached upstream end-of-life and this release drops it. If you are still running PostgreSQL 13 clusters, please upgrade to a supported version before updating the Operator.

pg_stat_monitor Disabled by Default

The pg_stat_monitor is disabled by default now to prevent potential memory issues in production clusters. If you rely on this extension, you can re-enable it explicitly via the custom resource.

 

Conclusion

Percona Operator for PostgreSQL v2.9.0 is a enterprise level features release that continues to push the boundaries of what’s possible when running PostgreSQL on Kubernetes. From making PostgreSQL 18 the new default to introducing PVC snapshot-based backups that can complete in seconds regardless of database size, this release squarely targets speed, security, and operational confidence.

We encourage you to try Percona Operator for PostgreSQL v2.9.0, explore the full release notes, and share your feedback with the community. As always, you can reach us through the Community Forum or file issues on GitHub.

 

The post Percona Operator for PostgreSQL 2.9.0: PostgreSQL 18 Default, PVC Snapshot Backups, LDAP Support, and More! appeared first on Percona.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com