May
12
2022
--

Percona Operator for MongoDB and Kubernetes MCS: The Story of One Improvement

Percona Operator for MongoDB and Kubernetes MCS

Percona Operator for MongoDB supports multi-cluster or cross-site replication deployments since version 1.10. This functionality is extremely useful if you want to have a disaster recovery deployment or perform a migration from or to a MongoDB cluster running in Kubernetes. In a nutshell, it allows you to use Operators deployed in different Kubernetes clusters to manage and expand replica sets.Percona Kubernetes Operator

For example, you have two Kubernetes clusters: one in Region A, another in Region B.

  • In Region A you deploy your MongoDB cluster with Percona Operator. 
  • In Region B you deploy unmanaged MongoDB nodes with another installation of Percona Operator.
  • You configure both Operators, so that nodes in Region B are added to the replica set in Region A.

In case of failure of Region A, you can switch your traffic to Region B.

Migrating MongoDB to Kubernetes describes the migration process using this functionality of the Operator.

This feature was released in tech preview, and we received lots of positive feedback from our users. But one of our customers raised an internal ticket, which was pointing out that cross-site replication functionality does not work with Multi-Cluster Services. This started the investigation and the creation of this ticket – K8SPSMDB-625

This blog post will go into the deep roots of this story and how it is solved in the latest release of Percona Operator for MongoDB version 1.12.

The Problem

Multi-Cluster Services or MCS allows you to expand network boundaries for the Kubernetes cluster and share Service objects across these boundaries. Someone calls it another take on Kubernetes Federation. This feature is already available on some managed Kubernetes offerings,  Google Cloud Kubernetes Engine (GKE) and AWS Elastic Kubernetes Service (EKS). Submariner uses the same logic and primitives under the hood.

MCS Basics

To understand the problem, we need to understand how multi-cluster services work. Let’s take a look at the picture below:

multi-cluster services

  • We have two Pods in different Kubernetes clusters
  • We add these two clusters into our MCS domain
  • Each Pod has a service and IP-address which is unique to the Kubernetes cluster
  • MCS introduces new Custom Resources –
    ServiceImport

    and

    ServiceExport

    .

    • Once you create a
      ServiceExport

      object in one cluster,

      ServiceImport

      object appears in all clusters in your MCS domain.

    • This
      ServiceImport

        object is in

      svc.clusterset.local

      domain and with the network magic introduced by MCS can be accessed from any cluster in the MCS domain

Above means that if I have an application in the Kubernetes cluster in Region A, I can connect to the Pod in Kubernetes cluster in Region B through a domain name like

my-pod.<namespace>.svc.clusterset.local

. And it works from another cluster as well.

MCS and Replica Set

Here is how cross-site replication works with Percona Operator if you use load balancer:

MCS and Replica Set

All replica set nodes have a dedicated service and a load balancer. A replica set in the MongoDB cluster is formed using these public IP addresses. External node added using public IP address as well:

  replsets:
  - name: rs0
    size: 3
    externalNodes:
    - host: 123.45.67.89

All nodes can reach each other, which is required to form a healthy replica set.

Here is how it looks when you have clusters connected through multi-cluster service:

Instead of load balancers replica set nodes are exposed through Cluster IPs. We have ServiceExports and ServiceImports resources. All looks good on the networking level, it should work, but it does not.

The problem is in the way the Operator builds MongoDB Replica Set in Region A. To register an external node from Region B to a replica set, we will use MCS domain name in the corresponding section:

  replsets:
  - name: rs0
    size: 3
    externalNodes:
    - host: rs0-4.mongo.svc.clusterset.local

Now our rs.status() will look like this:

"name" : "my-cluster-rs0-0.mongo.svc.cluster.local:27017"
"role" : "PRIMARY"
...
"name" : "my-cluster-rs0-1.mongo.svc.cluster.local:27017"
"role" : "SECONDARY"
...
"name" : "my-cluster-rs0-2.mongo.svc.cluster.local:27017"
"role" : "SECONDARY"
...
"name" : "rs0-4.mongo.svc.clusterset.local:27017"
"role" : "UNKNOWN"

As you can see, Operator formed a replica set out of three nodes using

svc.cluster.local

domain, as it is how it should be done when you expose nodes with

ClusterIP

Service type. In this case, a node in Region B cannot reach any node in Region A, as it tries to connect to the domain that is local to the cluster in Region A. 

In the picture below, you can easily see where the problem is:

The Solution

Luckily we have a Special Interest Group (SIG), a Kubernetes Enhancement Proposal (KEP) and multiple implementations for enabling Multi-Cluster Services. Having a KEP is great since we can be sure the implementations from different providers (i.e GCP, AWS) will follow the same standard more or less.

There are two fields in the Custom Resource that control MCS in the Operator:

spec:
  multiCluster:
    enabled: true
    DNSSuffix: svc.clusterset.local

Let’s see what is happening in the background with these flags set.

ServiceImport and ServiceExport Objects

Once you enable MCS by patching the CR with

spec.multiCluster.enabled: true

, the Operator creates a

ServiceExport

object for each service. These ServiceExports will be detected by the MCS controller in the cluster and eventually a

ServiceImport

for each

ServiceExport

will be created in the same namespace in each cluster that has MCS enabled.

As you see, we made a decision and empowered the Operator to create

ServiceExport

objects. There are two main reasons for doing that:

  • If any infrastructure-as-a-code tool is used, it would require additional logic and level of complexity to automate the creation of required MCS objects. If Operator takes care of it, no additional work is needed. 
  • Our Operators take care of the infrastructure for the database, including Service objects. It just felt logical to expand the reach of this functionality to MCS.

Replica Set and Transport Encryption

The root cause of the problem that we are trying to solve here lies in the networking field, where external replica set nodes try to connect to the wrong domain names. Now, when you enable multi-cluster and set

DNSSuffix

(it defaults to

svc.clusterset.local

), Operator does the following:

  • Replica set is formed using MCS domain set in
    DNSSuffix

     

  • Operator generates TLS certificates as usual, but adds
    DNSSuffix

    domains into the picture

With this approach, the traffic between nodes flows as expected and is encrypted by default.

Replica Set and Transport Encryption

Things to Consider

MCS APIs

Please note that the operator won’t install MCS APIs and controllers to your Kubernetes cluster. You need to install them by following your provider’s instructions prior to enabling MCS for your PSMDB clusters. See our docs for links to different providers.

Operator detects if MCS is installed in the cluster by API resources. The detection happens before controllers are started in the operator. If you installed MCS APIs while the operator is running, you need to restart the operator. Otherwise, you’ll see an error like this:

{
  "level": "error",
  "ts": 1652083068.5910048,
  "logger": "controller.psmdb-controller",
  "msg": "Reconciler error",
  "name": "cluster1",
  "namespace": "psmdb",
  "error": "wrong psmdb options: MCS is not available on this cluster",
  "errorVerbose": "...",
  "stacktrace": "..."
}

ServiceImport Provisioning Time

It might take some time for

ServiceImport

objects to be created in the Kubernetes cluster. You can see the following messages in the logs while creation is in progress: 

{
  "level": "info",
  "ts": 1652083323.483056,
  "logger": "controller_psmdb",
  "msg": "waiting for service import",
  "replset": "rs0",
  "serviceExport": "cluster1-rs0"
}

During testing, we saw wait times up to 10-15 minutes. If you see your cluster is stuck in initializing state by waiting for service imports, it’s a good idea to check the usage and quotas for your environment.

DNSSuffix

We also made a decision to automatically generate TLS certificates for Percona Server for MongoDB cluster with

*.clusterset.local

domain, even if MCS is not enabled. This approach simplifies the process of enabling MCS for a running MongoDB cluster. It does not make much sense to change the

DNSSuffix

 field, unless you have hard requirements from your service provider, but we still allow such a change. 

If you want to enable MCS with a cluster deployed with an operator version below 1.12, you need to update your TLS certificates to include

*.clusterset.local

SANs. See the docs for instructions.

Conclusion

Business relies on applications and infrastructure that serves them more than ever nowadays. Disaster Recovery protocols and various failsafe mechanisms are routine for reliability engineers, not an annoying task in the backlog. 

With multi-cluster deployment functionality in Percona Operator for MongoDB, we want to equip users to build highly available and secured database clusters with minimal effort.

Percona Operator for MongoDB is truly open source and provides users with a way to deploy and manage their enterprise-grade MongoDB clusters on Kubernetes. We encourage you to try this new Multi-Cluster Services integration and let us know your results on our community forum. You can find some scripts that would help you provision your first MCS clusters on GKE or EKS here.

There is always room for improvement and a time to find a better way. Please let us know if you face any issues with contributing your ideas to Percona products. You can do that on the Community Forum or JIRA. Read more about contribution guidelines for Percona Operator for MongoDB in CONTRIBUTING.md.

Dec
14
2021
--

High Availability and Disaster Recovery Recipes for PostgreSQL on Kubernetes

High Availability and Disaster Recovery PostgreSQL on Kubernetes

Percona Distribution for PostgreSQL Operator allows you to deploy and manage highly available and production-grade PostgreSQL clusters on Kubernetes with minimal manual effort. In this blog post, we are going to look deeper into High Availability, Disaster Recovery, and Scaling of PostgreSQL clusters.

High Availability

Our default custom resource manifest deploys a highly available (HA) PostgreSQL cluster. Key components of HA setup are:

  • Kubernetes Services that point to pgBouncer and replica nodes
  • pgBouncer – a lightweight connection pooler for PostgreSQL
  • Patroni – HA orchestrator for PostgreSQL
  • PostgreSQL nodes – we have one primary and 2 replica nodes in hot standby by default

high availability postgresql

Kubernetes Service is the way to expose your PostgreSQL cluster to applications or users. We have two services:

  • clusterName-pgbouncer

    – Exposing your PostgreSQL cluster through pgBouncer connection pooler. Both reads and writes are sent to the Primary node. 

  • clusterName-replica

    – Exposes replica nodes directly. It should be used for reads only. Also, keep in mind that connections to this service are not pooled. We are working on a better solution, where the user would be able to leverage both connection pooling and read-scaling through a single service.

By default we use ClusterIP service type, but you can change it in

pgBouncer.expose.serviceType

or

pgReplicas.hotStandby.expose.serviceType,

respectively.

Every PostgreSQL container has Patroni running. Patroni monitors the state of the cluster and in case of Primary node failure switches the role of the Primary to one of the Replica nodes. PgBouncer always knows where Primary is.

As you see we distribute PostgreSQL cluster components across different Kubernetes nodes. This is done with Affinity rules and they are applied by default to ensure that single node failure does not cause database downtime.

Multi-Datacenter with Multi-AZ

Good architecture design is to run your Kubernetes cluster across multiple datacenters. Public clouds have a concept of availability zones (AZ) which are data centers within one region with a low-latency network connection between them. Usually, these data centers are at least 100 kilometers away from each other to minimize the probability of regional outage. You can leverage multi-AZ Kubernetes deployment to run cluster components in different data centers for better availability.

Multi-Datacenter with Multi-AZ

To ensure that PostgreSQL components are distributed across availability zones, you need to tweak affinity rules. Now it is only possible through editing Deployment resources directly:

$ kubectl edit deploy cluster1-repl2
…
-            topologyKey: kubernetes.io/hostname
+            topologyKey: topology.kubernetes.io/zone

Scaling

Scaling PostgreSQL to meet the demand at peak hours is crucial for high availability. Our Operator provides you with tools to scale PostgreSQL components both horizontally and vertically.

Vertical Scaling

Scaling vertically is all about adding more power to a PostgreSQL node. The recommended way is to change resources in the Custom Resource (instead of changing them in Deployment objects directly). For example, change the following in the

cr.yaml

to get 256 MBytes of RAM for all PostgreSQL Replica nodes:

  pgReplicas:
    hotStandby:
      resources:
        requests:
-         memory: "128Mi"
+         memory: "256Mi"

Apply

cr.yaml

:

$ kubectl apply -f cr.yaml

Use the same approach to tune other components in their corresponding sections.

You can also leverage Vertical Pod Autoscaler (VPA) to react to load spikes automatically. We create a Deployment resource for Primary and each Replica node. VPA objects should target these deployments. The following example will track one of the replicas Deployment resources of cluster1 and scale automatically:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: pxc-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment
    name:     cluster1-repl1  
    namespace:  pgo
  updatePolicy:
    updateMode: "Auto"

Please read more about VPA and its capabilities in its documentation.

Horizontal Scaling

Adding more replica nodes or pgBouncers can be done by changing size parameters in the Custom Resource. Do the following change in the default

cr.yaml

:

  pgReplicas:
    hotStandby:
-      size: 2
+      size: 3

Apply the change to get one more PostgreSQL Replica node:

$ kubectl apply -f cr.yaml

Starting from release 1.1.0 it is also possible to scale our cluster using kubectl scale command. Execute the following to have two PostgreSQL replica nodes in cluster1:

$ kubectl scale --replicas=2 perconapgcluster/cluster1
perconapgcluster.pg.percona.com/cluster1 scaled

In the latest release, it is not possible to use Horizontal Pod Autoscaler (HPA) yet and we will have it supported in the next one. Stay tuned.

Disaster Recovery

It is important to understand that Disaster Recovery (DR) is not High Availability. DR’s goal is to ensure business continuity in the case of a massive disaster, such as a full region outage. Recovery in such cases can be of course automated, but not necessarily – it strictly depends on the business requirements.

Disaster Recovery postgresql

Backup and Restore

I think it is the most common Disaster Recover protocol – take the backup, store it in some 3rd party premises, restore to another datacenter if needed.

This approach is simple, but comes with a long recovery time, especially if the database is big. Use this method only if it passes your Recovery Time Objectives (RTO).

Recovery Time Objectives

Our Operator handles backup and restore for PostgreSQL clusters. The disaster recovery is built around pgBackrest and looks like the following:

  1. Configure pgBackrest to upload backups to S3 or GCS (see our documentation for details).
  2. Create the backup manually (through pgTask) or ensure that a scheduled backup was created. 
  3. Once the Main cluster fails, create the new cluster in the Disaster Recovery data center. The cluster must be running in standby mode and pgBackrest must be pointing to the same repository as the main cluster:
spec:
  standby: true
  backup:
  # same config as on original cluster

Once data is recovered, the user can turn off standby mode and switch the application to DR cluster.

Continuous Restoration

This approach is quite similar to the above: pgBackrest instances continuously synchronize data between two clusters through object storage. This approach minimizes RTO and allows you to switch the application traffic to the DR site almost immediately. 

Continuous Restoration postgresql

Configuration here is similar to the previous case, but we always run a second PostgreSQL cluster in the Disaster Recovery data center. In case of main site failure just turn off the standby mode:

spec:
  standby: false

You can use a similar setup to migrate the data to and from Kubernetes. Read more about it in the Migrating PostgreSQL to Kubernetes blog post.

Conclusion

Kubernetes Operators provide ready-to-use service, and in the case of Percona Distribution for PostgreSQL Operator, the user gets a production-grade, highly available database cluster. In addition, the Operator provides day-2 operation capabilities and automates day-to-day routine.

We encourage you to try out our operator. See our GitHub repository and check out the documentation.

Found a bug or have a feature idea? Feel free to submit it in JIRA.

For general questions please raise the topic in the community forum

Are you a developer and looking to contribute? Please read our CONTRIBUTING.md and send the Pull.

Oct
08
2021
--

Disaster Recovery for MongoDB on Kubernetes

Disaster Recovery for MongoDB on Kubernetes

Disaster Recovery for MongoDB on KubernetesAs per the glossary, Disaster Recovery (DR) protocols are an organization’s method of regaining access and functionality to its IT infrastructure in events like a natural disaster, cyber attack, or even business disruptions related to the COVID-19 pandemic. When we talk about data, storing backups on remote servers is enough to pass DR compliance checks for some companies. But for others, Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) are extremely tight and require more than just a backup/restore procedure.

In this blog post, we are going to show you how to set up MongoDB on two distant Kubernetes clusters with Percona Distribution for MongoDB Operator to meet the toughest DR requirements.

What to Expect

Here is what we are going to do:

  1. Setup two Kubernetes clusters
  2. Deploy Percona Distribution for MongoDB Operator on both of them. The Disaster Recovery site will run a MongoDB cluster in unmanaged mode.
  3. We are going to simulate the failure and perform a failover to DR site

In the 1.10.0 version of the Operator, we have added the Technology Preview of the new feature which enables users to deploy unmanaged MongoDB nodes and connect them to existing Replica Sets.

MongoDB Kubernetes

Set it All Up

We are not going to cover the configuration of the Kubernetes clusters, but in our tests, we relied on two Google Kubernetes Engine (GKE) clusters deployed in different regions. Read more about GKE here.

Prepare Main Site

We have shared all the resources for this blog post in this GitHub repo. As a first step we are going to deploy the operator on the Main site:

$ kubectl apply -f bundle.yaml

Deploy the MongoDB managed cluster with

cr-main.yaml

:

$ kubectl apply -f cr-main.yaml

It is important to understand that we will need to expose ReplicaSet nodes through a dedicated service. This includes Config Servers. This is required to ensure that ReplicaSet nodes on Main and DR can reach each other. So it is like a full mesh:

ReplicaSet nodes

To get there, cr-main.yaml has the following changes:

spec:
  replsets:
  - rs0:
    expose:
      enabled: true
      exposeType: LoadBalancer
  sharding:
    configsvrReplSet:
      expose:
        enabled: true
        exposeType: LoadBalancer

We are using the LoadBalancer Kubernetes Service object as it is just simpler for us, but there are other options – ClusterIP, NodePort. It is also possible to utilize 3rd party tools like Submariner to implement a private connection.

If you have an already running MongoDB cluster in Kubernetes, you can expose the ReplicaSets without downtime by changing these variables.

Prepare Disaster Recovery Site

The configuration of the Disaster Recovery site could be broken down into the following steps:

  1. Copy the Secrets from the Main cluster.
    1. system users secrets
    2. SSL keys – both used for external connections and internal replication traffic
  2. Tune Custom Resource:
    1. run nodes in unmanaged mode – Operator does not control replicaset configuration and secrets generation
    2. expose ReplicaSets (the same way we do it on the Main cluster)
    3. disable backups – backups can be only taken on the cluster managed by the Operator

Copy the Secrets

System user’s credentials are stored by default in my-cluster-name-secrets Secret object and defined in spec.secrets.users. Apply this secret in the DR cluster with kubectl apply -f yaml-with-secrets. If you don’t have it in your source code repository or if you rely on the Operator to generate it, you can get the secret from Kubernetes itself, remove the unnecessary metadata and apply.

On main execute:

$ kubectl get secret my-cluster-name-secrets -o yaml > my-cluster-secrets.yaml

Now remove the following lines from metadata:

annotations
creationTimestamp
resourceVersion
selfLink
uid

Save the file and apply it to the DR cluster.

The procedure to copy SSL keys is almost the same as for users. The difference is the names of the Secret objects – they are usually called <CLUSTER_NAME>-ssl and <CLUSTER_NAME>-ssl-internal. It is also possible to specify them in secrets.ssl and secrets.sslInternal in the Custom Resource. Copy these two keys from Main to DR and reference them in the CR.

Tune Custom Resource

cr-replica.yaml will have the following changes:

  secrets:
    users: my-cluster-name-secrets
    ssl: replica-cluster-ssl
    sslInternal: replica-cluster-ssl-internal

  replsets:
  - name: rs0
    size: 3
    expose:
      enabled: true
      exposeType: LoadBalancer

  sharding:
    enabled: true
    configsvrReplSet:
      size: 3
      expose:
        enabled: true
        exposeType: LoadBalancer

  backup:
    enabled: false

Once the Custom Resource is applied, the services are going to be created.  We will need the IP addresses of each ReplicaSet node to configure the DR site.

$ kubectl get services
NAME                  TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)           AGE
replica-cluster-cfg-0    LoadBalancer   10.111.241.213   34.78.119.1       27017:31083/TCP   5m28s
replica-cluster-cfg-1    LoadBalancer   10.111.243.70    35.195.138.253    27017:31957/TCP   4m52s
replica-cluster-cfg-2    LoadBalancer   10.111.246.94    146.148.113.165   27017:30196/TCP   4m6s
...
replica-cluster-rs0-0    LoadBalancer   10.111.241.41    34.79.64.213      27017:31993/TCP   5m28s
replica-cluster-rs0-1    LoadBalancer   10.111.242.158   34.76.238.149     27017:32012/TCP   4m47s
replica-cluster-rs0-2    LoadBalancer   10.111.242.191   35.195.253.107    27017:31209/TCP   4m22s

Add External Nodes to Main

At this step, we are going to add unmanaged nodes to the Replica Set on the Main site. In cr-main.yaml we should add externalNodes under replsets.[] and sharding.configsvrReplSet:

  replsets:
  - name: rs0
    externalNodes:
    - host: 34.79.64.213
      priority: 1
      votes: 1
    - host: 34.76.238.149
      priority: 1
      votes: 1
    - host: 35.195.253.107
      priority: 0
      votes: 0

  sharding:
    configsvrReplSet:
      externalNodes:
      - host: 34.78.119.1
        priority: 1
        votes: 1
      - host: 35.195.138.253
        priority: 1
        votes: 1
      - host: 146.148.113.165
        priority: 0
        votes: 0

Please note that we add three nodes, but only two are voters. We do this to avoid split-brain situations and do not start the primary election if the DR site is down or there is a network disruption between the Main and DR sites.

Failover

Once all the configuration above is applied, the situation will look like this:

Failover

We have three voters in the main cluster and two voters in the replica cluster. That means replica nodes won’t have the majority in case of main cluster failure and they won’t be able to elect a new primary. Therefore we need to step in and perform a manual failover.

Let’s kill the main cluster:

gcloud compute instances list | 
grep my-main-gke-demo | 
awk '{print $1}' | 
xargs gcloud compute instances delete --zone europe-west3-b

gcloud container node-pools delete \
--zone europe-west3-b \
--cluster my-main-gke-demo \
Default-pool

I deleted the nodes and the node pool of the main Kubernetes cluster so now the cluster is in an unhealthy state. Let’s see what mongos on the DR site says when we try to read or write through it (psmdb-tester can be found in the git repo as well):

% ./psmdb-tester
2021/09/03 18:19:19 Successfully connected and pinged 34.141.3.189:27017
2021/09/03 18:19:40 read failed: (FailedToSatisfyReadPreference) Encountered non-retryable error during query :: caused by :: Could not find host matching read preference { mode: "primary" } for set cfg
2021/09/03 18:19:49 write failed: (FailedToSatisfyReadPreference) Could not find host matching read preference { mode: "primary" } for set cfg
Disaster Recovery MongoDB

Normally, we can only alter the replica set configuration from the primary node but in this kind of situation where you don’t have a primary and only have a few surviving members, MongoDB allows us to force the reconfiguration from any alive member.

Let’s connect to one of the secondary nodes in the replica cluster and perform the failover:

kubectl exec -it psmdb-client-7b9f978649-pjb2k -- mongo 'mongodb://clusterAdmin:<pass>@replica-cluster-rs0-0.replica.svc.cluster.local/admin?ssl=false'
...
rs0:SECONDARY> cfg = rs.config()
rs0:SECONDARY> cfg.members = [cfg.members[3], cfg.members[4], cfg.members[5]]
rs0:SECONDARY> rs.reconfig(cfg, {force: true})

Note that the indexes of surviving members may differ in your environment. You should check rs.status() and rs.config() outputs first. The main idea is to repopulate config members with only surviving members.

After the reconfiguration, the replica set will have just three members and two of them will have votes and a majority. So, they’ll be able to select a new primary. After performing the same process on the cfg replica set, we will be able to read and write through mongos again:

% ./psmdb-tester
2021/09/03 18:41:48 Successfully connected and pinged 34.141.3.189:27017
2021/09/03 18:41:49 read succeed
2021/09/03 18:41:50 read succeed
2021/09/03 18:41:51 read succeed
2021/09/03 18:41:52 read succeed
2021/09/03 18:41:53 read succeed
2021/09/03 18:41:54 read succeed
2021/09/03 18:41:55 read succeed
2021/09/03 18:41:56 read succeed
2021/09/03 18:41:57 read succeed
2021/09/03 18:41:58 read succeed
2021/09/03 18:41:58 write succeed

Once the replica cluster has become the primary, you should reconfigure all clients that connect to the old main cluster and point them to the DR site.

Conclusion

Disaster Recovery is important for business continuity. The goal of administrators and SREs is to have a plan in place. With the new release of Percona Distribution for MongoDB Operator, setting up DR is fast, automated, and enables IT teams to meet RTO and RPO requirements.

We encourage you to try out our operator. See our GitHub repository and check out the documentation.

Found a bug or have a feature idea? Feel free to submit it in JIRA.

For general questions please raise the topic in the community forum.

You are a developer and looking to contribute? Please read our CONTRIBUTING.md and send the Pull Request.

Percona Distribution for MongoDB Operator

The Percona Distribution for MongoDB Operator simplifies running Percona Server for MongoDB on Kubernetes and provides automation for day-1 and day-2 operations. It’s based on the Kubernetes API and enables highly available environments. Regardless of where it is used, the Operator creates a member that is identical to other members created with the same Operator. This provides an assured level of stability to easily build test environments or deploy a repeatable, consistent database environment that meets Percona expert-recommended best practices.

Complete the 2021 Percona Open Source Data Management Software Survey

Have Your Say!

Jan
19
2021
--

StackPulse announces $28M investment to help developers manage outages

When a system outage happens, chaos can ensue as the team tries to figure out what’s happening and how to fix it. StackPulse, a new startup that wants to help developers manage these crisis situations more efficiently, emerged from stealth today with a $28 million investment.

The round actually breaks down to a previously unannounced $8 million seed investment and a new $20 million Series A. GGV led the A round, while Bessemer Venture Partners led the seed and also participated in the A. Glenn Solomon at GGV and Amit Karp at Bessemer will join the StackPulse board.

Nobody is immune to these outages. We’ve seen incidents from companies as varied as Amazon and Slack in recent months. The biggest companies like Google, Facebook and Amazon employ site reliability engineers and build customized platforms to help remediate these kinds of situations. StackPulse hopes to put this kind of capability within reach of companies, whose only defense is the on-call developers.

Company co-founder and CEO Ofer Smadari says that in the midst of a crisis with signals coming at you from Slack and PagerDuty and other sources, it’s hard to figure out what’s happening. StackPulse is designed to help sort out the details to get you back to equilibrium as quickly as possible.

First off, it helps identify the severity of the incident. Is it a false alarm or something that requires your team’s immediate attention or something that can be put off for a later maintenance cycle? If there is something going wrong that needs to be fixed right now, StackPulse can not only identify the source of the problem, but also help fix it automatically, Smadari explained.

After the incident has been resolved, it can also help with a post-mortem to figure out what exactly went wrong by pulling in all of the alert communications and incident data into the platform.

As the company emerges from stealth, it has some early customers, and 35 employees based in Portland, Oregon and Tel Aviv. Smadari says that he hopes to have 100 employees by the end of this year. As he builds the organization, he is thinking about how to build a diverse team for a diverse customer base. He believes that people with diverse backgrounds build a better product. He adds that diversity is a top level goal for the company, which already has an HR leader in place to help.

Glenn Solomon from GGV, who will be joining the company board, saw a strong founding team solving a big problem for companies and wanted to invest. “When they described the vision for the product they wanted to build, it made sense to us,” he said.

Customers are impatient with down time and Solomon sees developers on the front line trying to solve these issues. “Performance is more important than ever. When there is downtime, it’s damaging to companies,” he said. He believes StackPulse can help.

Sep
02
2020
--

Transposit scores $35M to build data-driven runbooks for faster disaster recovery

Transposit is a company built by engineers to help engineers, and one big way to help them is to get systems up and running faster when things go wrong — as they always will at some point. Transposit has come up with a way to build runbooks for faster disaster recovery, while using data to update them in an automated fashion.

Today, the company announced a $35 million Series B investment led by Altimeter Capital, with participation from existing investors Sutter Hill Ventures, SignalFire and Unusual Ventures. Today’s investment brings the total raised to $50.4 million, according to the company.

Company CEO Divanny Lamas and CTO and founder Tina Huang see technology issues as less an engineering problem and more as a human problem, because it’s humans who have to clean up the messes when things go wrong. Huang says forgetting the human side of things is where she thinks technology has gone astray.

“We know that the real superpower of the product is that we focus on the human and the user side of things. And as a result, we’re building an engineering culture that I think is somewhat differentiated,” Huang told TechCrunch.

Transposit is a platform that at its core helps manage APIs, connections to other programs, so it starts with a basic understanding of how various underlying technologies work together inside a company. This is essential for a tool that is trying to help engineers in a moment of panic figure out how to get back to a working state.

When it comes to disaster recovery, there are essentially two pieces: getting the systems working again, then figuring out what happened. For the first piece, the company is building data-driven runbooks. By being data-driven, they aren’t static documents. Instead, the underlying machine learning algorithms can look at how the engineers recovered and adjust accordingly.

Transposit diaster recovery dashboard

Image Credits: Transposit

“We realized that no one was focusing on what we realize is the root problem here, which is how do I have access to the right set of data to make it easier to reconstruct that timeline, and understand what happened? We took those two pieces together, this notion that runbooks are a critical piece of how you spread knowledge and spread process, and this other piece, which is the data, is critical,” Huang said.

Today the company has 26 employees, including Huang and Lamas, who Huang brought on board from Splunk last year to be CEO. The company is somewhat unique having two women running the organization, and they are trying to build a diverse workforce as they build their company to 50 people in the next 12 months.

The current make-up is 47% female engineers, and the goal is to remain diverse as they build the company, something that Lamas admits is challenging to do. “I wish I had a magic answer, or that Tina had a magic answer. The reality is that we’re just very demanding on recruiters. And we are very insistent that we have a diverse pipeline of candidates, and are constantly looking at our numbers and looking at how we’re doing,” Lamas said.

She says being diverse actually makes it easier to recruit good candidates. “People want to work at diverse companies. And so it gives us a real edge from a kind of culture perspective, and we find that we get really amazing candidates that are just tired of the status quo. They’re tired of the old way of doing things and they want to work in a company that reflects the world that they want to live in,” she said.

The company, which launched in 2016, took a few years to build the first piece, the underlying API platform. This year it added the disaster recovery piece on top of that platform, and has been running its beta since the beginning of the summer. They hope to add additional beta customers before making it generally available later this year.

Jul
07
2020
--

OwnBackup lands $50M as backup for Salesforce ecosystem thrives

OwnBackup has made a name for itself primarily as a backup and disaster recovery system for the Salesforce ecosystem, and today the company announced a $50 million investment.

Insight Partners led the round, with participation from Salesforce Ventures and Vertex Ventures. This chunk of money comes on top of a $23 million round from a year ago, and brings the total raised to more than $100 million, according to the company.

It shouldn’t come as a surprise that Salesforce Ventures chipped in when the majority of the company’s backup and recovery business involves the Salesforce ecosystem, although the company will be looking to expand beyond that with the new money.

“We’ve seen such growth over the last two and a half years around the Salesforce ecosystem, and the other ISV partners like Veeva and nCino that we’ve remained focused within the Salesforce space. But with this funding, we will expand over the next 12 months into a few new ecosystems,” company CEO Sam Gutmann told TechCrunch.

In spite of the pandemic, the company continues to grow, adding 250 new customers last quarter, bringing it to over 2,000 customers and 250 employees, according to Gutmann.

He says that raising the round, which closed at the beginning of May, had some hairy moments as the pandemic began to take hold across the world and worsen in the U.S. For a time, he began talking to new investors in case his existing ones got cold feet. As it turned out, when the quarterly numbers came in strong, the existing ones came back and the round was oversubscribed, Gutmann said.

“Q2 frankly was a record quarter for us, adding over 250 new accounts, and we’re seeing companies start to really understand how critical this is,” he said.

The company plans to continue hiring through the pandemic, although he says it might not be quite as aggressively as they once thought. Like many companies, even though they plan to hire, they are continually assessing the market. At this point, he foresees growing the workforce by about another 50 people this year, but that’s about as far as he can look ahead right now.

Gutmann says he is working with his management team to make sure he has a diverse workforce right up to the executive level, but he says it’s challenging. “I think our lower ranks are actually quite diverse, but as you get up into the leadership team, you can see on the website unfortunately we’re not there yet,” he said.

They are instructing their recruiting teams to look for diverse candidates whether by gender or ethnicity, and employees have formed a diversity and inclusion task force with internal training, particularly for managers around interviewing techniques.

He says going remote has been difficult, and he misses seeing his employees in the office. He hopes to have at least some come back before the end of the summer and slowly add more as we get into the fall, but that will depend on how things go.

May
20
2020
--

FireHydrant lands $8M Series A for disaster management tool

When I spoke to Robert Ross, CEO and co-founder at FireHydrant, we had a technology adventure. First the audio wasn’t working correctly on Zoom, then Google Meet. Finally we used cell phones to complete the interview. It was like a case study in what FireHydrant is designed to do — help companies manage incidents and recover more quickly when things go wrong with their services.

Today the company announced an $8 million Series A from Menlo Ventures and Work-Bench. That brings the total raised to $9.5 million, including the $1.5 million seed round we reported on last April.

In the middle of a pandemic with certain services under unheard of pressure, understanding what to do when your systems crash has become increasingly important. FireHydrant has literally developed a playbook to help companies recover faster.

These run books are digital documents that are unique to each company and include what to do to help manage the recovery process. Some of that is administrative. For example, certain people have to be notified by email, a Jira ticket has to be generated and a Slack channel opened to provide a communications conduit for the team.

While Ross says you can’t define the exact recovery process itself because each incident tends to be unique, you can set up an organized response to an incident and that can help you get to work on the recovery much more quickly. That ability to manage an incident can be a difference maker when it comes to getting your system back to a steady state.

Ross is a former site reliability engineer (SRE) himself. He has experienced the kinds of problems his company is trying to solve, and that background was something that attracted investor Matt Murphy from Menlo Ventures.

“I love his authentic perspective, as a former SRE, on the problem and how to create something that would make the SRE function and processes better for all. That value prop really resonated with us in a time when the shift to online is accelerating and remote coordination between people tasked with identifying and fixing problems is at all time high in terms of its importance. Ultimately we’re headed toward more and more automation in problem resolution and FH helps pave the way,” Murphy told TechCrunch.

It’s not easy being an early-stage company in the current climate, but Ross believes his company has created something that will resonate, perhaps even more right now. As he says, every company has incidents, and how you react can define you as a company. Having tooling to help you manage that process helps give you structure at a time you need it most.

Oct
31
2019
--

How you react when your systems fail may define your business

Just around 9:45 a.m. Pacific Time on February 28, 2017, websites like Slack, Business Insider, Quora and other well-known destinations became inaccessible. For millions of people, the internet itself seemed broken.

It turned out that Amazon Web Services was having a massive outage involving S3 storage in its Northern Virginia datacenter, a problem that created a cascading impact and culminated in an outage that lasted four agonizing hours.

Amazon eventually figured it out, but you can only imagine how stressful it might have been for the technical teams who spent hours tracking down the cause of the outage so they could restore service. A few days later, the company issued a public post-mortem explaining what went wrong and which steps they had taken to make sure that particular problem didn’t happen again. Most companies try to anticipate these types of situations and take steps to keep them from ever happening. In fact, Netflix came up with the notion of chaos engineering, where systems are tested for weaknesses before they turn into outages.

Unfortunately, no tool can anticipate every outcome.

It’s highly likely that your company will encounter a problem of immense proportions like the one that Amazon faced in 2017. It’s what every startup founder and Fortune 500 CEO worries about — or at least they should. What will define you as an organization, and how your customers will perceive you moving forward, will be how you handle it and what you learn.

We spoke to a group of highly-trained disaster experts to learn more about preventing these types of moments from having a profoundly negative impact on your business.

It’s always about your customers

Reliability and uptime are so essential to today’s digital businesses that enterprise companies developed a new role, the Site Reliability Engineer (SRE), to keep their IT assets up and running.

Tammy Butow, principal SRE at Gremlin, a startup that makes chaos engineering tools, says the primary role of the SRE is keeping customers happy. If the site is up and running, that’s generally the key to happiness. “SRE is generally more focused on the customer impact, especially in terms of availability, uptime and data loss,” she says.

Companies measure uptime according to the so-called “five nines,” or 99.999 percent availability, but software engineer Nora Jones, who most recently led Chaos Engineering and Human Factors at Slack, says there is often too much of an emphasis on this number. According to Jones, the focus should be on the customer and the impact that availability has on their perception of you as a company and your business’s bottom line.

Someone needs to be calm and just keep asking the right questions.

“It’s money at the end of the day, but also over time, user sentiment can change [if your site is having issues],” she says. “How are they thinking about you, the way they talk about your product when they’re talking to their friends, when they’re talking to their family members. The nines don’t capture any of that.”

Robert Ross, founder and CEO at FireHydrant, an SRE as a Service platform, says it may be time to rethink the idea of the nines. “Maybe we need to change that term. Maybe we can popularize something like ‘happiness level objectives’ or ‘happiness level agreements.’ That way, the focus is on our products.”

When things go wrong

Companies go to great lengths to prevent disasters to avoid disappointing their customers and usually have contingencies for their contingencies, but sometimes, no matter how well they plan, crises can spin out of control. When that happens, SREs need to execute, which takes planning, too; knowing what to do when the going gets tough.

Apr
02
2019
--

FireHydrant lands $1.5M seed investment to bring order to IT disaster recovery

FireHydrant, an NYC startup, wants to help companies recover from IT disasters more quickly, and understand why they happened — with the goal of preventing similar future scenarios from happening again. Today, the fledgling startup announced a $1.5 million seed investment from Work-Bench, a New York City venture capital firm that invests in early-stage enterprise startups.

In addition to the funding, the company announced it was opening registration for its FireHydrant incident management platform. The product has been designed with Google’s Site Reliability Engineering (SRE) methodology in mind, but company co-founder and CEO Bobby Ross says the tool is designed to help anyone understand the cause of a disaster, regardless of what happened, and whether they practice SRE or not.

“I had been involved in several fire fighting scenarios — from production databases being dropped to Kubernetes upgrades gone wrong — and every incident had a common theme: ?absolute chaos?,” Ross wrote in a blog post announcing the new product.

The product has two main purposes, according to Ross. It helps you figure out what’s happening as you attempt to recover from an ongoing disaster scenario, and once you’ve put out the fire, it lets you do a post-mortem to figure out exactly what happened with the hope of making sure that particular disaster doesn’t happen again.

As Ross describes it, a tool like PagerDuty can alert you that there’s a problem, but FireHydrant lets you figure out what specifically is going wrong and how to solve it. He says that the tool works by analyzing change logs, as a change is often the primary culprit of IT incidents. When you have an incident, FireHydrant will surface that suspected change, so you can check it first.

“We’ll say, hey, you had something change recently in this vicinity where you have an alert going off. There is a high likelihood that this change was actually causing your incident. And we actually bubble that up and mark it as a suspect,” Ross explained.

Screenshot: FireHydrant

Like so many startups, the company developed from a pain point the founders were feeling. The three founders were responsible for solving major outages at companies like Namely, DigitalOcean, CoreOS and Paperless Post.

But the actual idea for the company came about almost accidentally. In 2017, Ross was working on a series of videos and needed a way to explain what he was teaching. “I began writing every line of code with live commentary, and soon FireHydrant started to take the shape of what I envisioned as an SRE while at Namely, and I started to want it more than the video series. 40 hours of screencasts recorded later, I decided to stop recording and focus on the product…,” Ross wrote in the blog post.

Today it integrates with PagerDuty, GitHub and Slack, but the company is just getting started with the three founders, all engineers, working on the product and a handful of beta customers. It is planning to hire more engineers to keep building out the product. It’s early days, but if this tool works as described, it could go a long way toward solving the fire-fighting issues that every company faces at some point.

Jan
08
2019
--

Amazon reportedly acquired Israeli disaster recovery service CloudEndure for around $200M

Amazon has reportedly acquired Israeli disaster recovery startup CloudEndure. Neither company has responded to our request for confirmation, but we have heard from multiple sources that the deal has happened. While some outlets have been reporting the deal was worth $250 million, we are hearing it’s closer to $200 million.

The company provides disaster recovery for cloud customers. You may be thinking that disaster recovery is precisely why we put our trust in cloud vendors. If something goes wrong, it’s the vendor’s problem — and you would be right to make this assumption, but nothing is simple. If you have a hybrid or multi-cloud scenario, you need to have ways to recover your data in the event of a disaster like weather, a cyberattack or political issue.

That’s where a company like CloudEndure comes into play. It can help you recover and get back and running in another place, no matter where your data lives, by providing a continuous backup and migration between clouds and private data centers. While CloudEndure currently works with AWS, Azure and Google Cloud Platform, it’s not clear if Amazon would continue to support these other vendors.

The company was backed by Dell Technologies Capital, Infosys and Magma Venture Partners, among others. Ray Wang, founder and principal analyst at Constellation Research, says Infosys recently divested its part of the deal and that might have precipitated the sale. “So much information is sitting in the cloud that you need backups and regions to make sure you have seamless recovery in the event of a disaster,” Wang told TechCrunch.

While he isn’t clear what Amazon will do with the company, he says it will test just how open it is. “If you have multi-cloud and want your on-prem data backed up, or if you have backup on one cloud like AWS and want it on Google or Azure, you could do this today with CloudEndure,” he said. “That’s why I’m curious if they’ll keep supporting Azure or GCP,” he added.

CloudEndure was founded in 2012 and has raised just over $18 million. Its most recent investment came in 2016 when it raised $6 million, led by Infosys and Magma.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com