Google announces EPYC-based Tau virtual machines for Cloud

Google this morning announced the launch of Tau, a new family of virtual machines built on AMD’s third-gen EPYC processor. According to the company, the new x86-compatible system offers a 42% price-performance boost over standard VMs. Google notably first started utilizing AMD EPYC processors for Cloud back in 2017, while Amazon Cloud’s offerings date back to 2018.

Google claims the Tau family “leapfrogs” existing cloud VMs. The systems come in a variety of configurations, ranging up to 60vCPUs per VM, and 4GB of memory per vCPU. Networking bandwidth goes up to 32 Gbps, and they can be coupled with a variety of different network attached storage.

“Customers across every industry are dealing with more demanding and data-intensive workloads and looking for strategic ways to speed up performance and reduce costs,” Google Cloud CEO Thomas Kurian said in a press release.  “Our work with key strategic partners like AMD has allowed us to broaden our offerings and deliver customers the best price performance for compute-heavy, business-critical applications– all on the cleanest cloud in the industry.”

Image Credits: Google

Google has already signed up some high-profile customers for an early trial, including Twitter, Snap and DoIT.

“High performance at the right price point is a critical consideration as we work to serve the global public conversation,” Twitter Platform Lead Nick Tornow said in a blog post. “We are excited by initial tests that show potential for double digit performance improvement. We are collaborating with Google Cloud to more deeply evaluate benefits on price and performance for specific compute workloads that we can realize through use of the new Tau VM family.”

Image Credits: Google

The Tau VMs will be arriving for Google Cloud in Q3 of this year. The company has already opened the system up to clients for pre-registration. Pricing is dependent on the configuration. For example, a 32vCPU VM sporting 128GB RAM will run around $1.35 an hour.


How to Generate an Ansible Inventory with Terraform

Ansible Inventory with Terraform MongoDB

Ansible Inventory with Terraform MongoDBCreating and maintaining an inventory file is one of the common tasks we have to deal with when working with Ansible. When dealing with a large number of hosts, it can be complex to handle this task manually.

There are some plugins available to automatically generate inventory files by interacting with most cloud providers’ APIs. In this post, however, I am going to show you how to generate and maintain an inventory file directly from within the Terraform code.

To illustrate this, we will generate an Ansible inventory file to deploy a MongoDB sharded cluster on the Google Cloud Platform. If you want to know more about Ansible provisioning, you might want to check out my talk at Percona Live Online 2021.

The Ansible Inventory

The format of the inventory file is closely related to the actual Ansible code that is used to deploy the software. The structure we need is defined as follows:

tst-mongo-cfg00 mongodb_primary=True 

tst-mongo-shard00svr0 mongodb_primary=True 

tst-mongo-shard01svr0 mongodb_primary=True 


We have to specify groups for each shard’s replicaset (shardN), as well as the config servers (cfg) and mongos routers (mongos).

The mongodb_primary tag is used to designate a primary server by giving it higher priority in the replicaset configuration.

Now that we know the structure we need, let’s start by provisioning our hardware resources with Terraform.

Creating Instances in GCP

Here is an example of a Terraform script to generate some instances for the shards. The Terraform count directive is used to create many copies of a given resource, so we define it dynamically.

resource "google_compute_instance" "shard" {
  name = "tst-mongo-shard0${floor(count.index / var.shardsvr_replicas )}svr${count.index % var.shardsvr_replicas}"
  machine_type = var.shardsvr_type
  zone  = data.google_compute_zones.available.names[count.index % var.shardsvr_replicas]
  count = var.shard_count * var.shardsvr_replicas
  labels = { 
    ansible-group = floor(count.index / var.shardsvr_replicas ),
    ansible-index = count.index % var.shardsvr_replicas,
  boot_disk {
    initialize_params {
    image = lookup(var.centos_amis, var.region)
  network_interface {
    network = google_compute_network.vpc-network.id
    subnetwork = google_compute_subnetwork.vpc-subnet.id

The above code shuffles the instances through the available zones within a region for high availability. As you can see, there are some expressions used to generate the name and the labels. I will get to this shortly. We can use similar scripts for creating the instances for the config servers and the mongos nodes.

This is an example of the variables file:

variable "centos_amis" {
  description = "CentOS AMIs by region"
  default = {
    northamerica-northeast1 = "centos-7-v20210420"

variable "shardsvr_type" {
	default = "e2-small"
	description = "instance type of the shard server"

variable "region" {
  type    = string
  default = "us-east1"

variable "shard_count" {
  default = "2"
  description = "Number of shards"

variable "shardsvr_replicas" {
  default = "3"
	description = "How many replicas per shard"

Identifying the Resources

We need an easy way to identify our resources to work with them effectively. One way is to define labels on our Terraform code that is used to create the instances. We could also use instance tagging instead.

For the shards, the key part is that we need to generate a group for each shard dynamically, depending on how many shards we are provisioning. We need some extra information:

  • the group names (i.e. shard0, shard1) for each host
  • which member within a group we are working with

We define two labels for the purpose of iterating through the resources:

labels = { 
    ansible-group = floor(count.index / var.shardsvr_replicas ),
    ansible-index = count.index % var.shardsvr_replicas,

For the ansible-group, the variable shardsvr_replicas defines how many members each shard’s replica set has (e.g. 3). So for 2 shards, the expression above gives us the following output:

floor(count.index / 2 )

These values will be useful for matching each host with the corresponding group.

Now let’s see how to generate the list of servers per group. For the ansible-index, the expression gives us the following output:

count.index % 3

For mongos and config servers, it is easier:

labels = { 
    ansible-group = "mongos"

labels = { 
    ansible-group = "cfg"

So with the above, we should have every piece of information we need.

Creating an Output File

Terraform provides the local_file resource to generate a file. We will use this to render our inventory file, based on the contents of the inventory.tmpl template.

We have to pass the values we need as arguments to the template, as it is not possible to reference outside variables from within it. In addition to the counters we had defined, we need the actual hostnames and the total number of shards. This is what it looks like:

resource "local_file" "ansible_inventory" {
  content = templatefile("inventory.tmpl",
     ansible_group_shards = google_compute_instance.shard.*.labels.ansible-group,
     ansible_group_index = google_compute_instance.shard.*.labels.ansible-index,
     hostname_shards = google_compute_instance.shard.*.name,
     ansible_group_cfg = google_compute_instance.cfg.*.labels.ansible-group,
     hostname_cfg = google_compute_instance.cfg.*.name,
     ansible_group_mongos = google_compute_instance.mongos.*.labels.ansible-group,
     hostname_mongos = google_compute_instance.mongos.*.name,
     number_of_shards = range(var.shard_count)
  filename = "inventory"

The output will be a file called inventory, stored on the machine where we are running Terraform.

Template Generation

On the template, we need to loop through the different groups and print the information we have in the proper format.

To figure out whether to print the mongodb_primary attribute, we test the loop index for the first value and print the empty string otherwise. For the actual shards, we can generate the group name easily using our previously defined variable, and check if a host should be included in the group.

Anything between %{} contains a directive, while the ${} are used to substitute the arguments we fed the template. The ~ is used to remove the extra newlines.

%{ for index, group in ansible_group_cfg ~}
${ hostname_cfg[index] } ${ index == 0 ? "mongodb_primary=True" : "" }
%{ endfor ~}

%{ for shard_index in number_of_shards ~}
%{ for index, group in ansible_group_shards ~}
${ group == tostring(shard_index) && ansible_group_index[index] == "0" ? join(" ", [ hostname_shards[index], "mongodb_primary=True\n" ]) : "" ~} 
${ group == tostring(shard_index) && ansible_group_index[index] != "0" ? join("", [ hostname_shards[index], "\n" ]) : "" ~} 
%{ endfor ~}
%{ endfor ~}

%{ for index, group in ansible_group_mongos ~}
%{ endfor ~}

Final Words

Terraform and Ansible together is a powerful combination for infrastructure provisioning and management. We can automate everything from hardware deployment to software installation.

The ability to generate a file from Terraform is quite handy for the purpose of creating our Ansible inventory. Since it is responsible for the actual provisioning, Terraform has all the information we need, although it can be a bit tricky to get the templating right.


Google’s Anthos multicloud platform gets improved logging, Windows container support and more

Google today announced a sizable update to its Anthos multicloud platform that lets you build, deploy and manage containerized applications anywhere, including on Amazon’s AWS and (in preview) on Microsoft Azure.

Version 1.7 includes new features like improved metrics and logging for Anthos on AWS, a new Connect gateway to interact with any cluster right from Google Cloud and a preview of Google’s managed control plane for Anthos Service Mesh. Other new features include Windows container support for environments that use VMware’s vSphere platform and new tools for developers to make it easier for them to deploy their applications to any Anthos cluster.

Today’s update comes almost exactly two years after Google CEO Sundar Pichai originally announced Anthos at its Cloud Next event in 2019 (before that, Google called this project the “Google Cloud Services Platform,” which launched three years ago). Hybrid and multicloud, it’s fair to say, takes a key role in the Google Cloud roadmap — and maybe more so for Google than for any of its competitors. Recently, Google brought on industry veteran Jeff Reed to become the VP of Product Management in charge of Anthos.

Reed told me that he believes that there are a lot of factors right now that are putting Anthos in a good position. “The wind is at our back. We bet on Kubernetes, bet on containers — those were good decisions,” he said. Increasingly, customers are also now scaling out their use of Kubernetes and have to figure out how to best scale out their clusters and deploy them in different environments — and to do so, they need a consistent platform across these environments. He also noted that when it comes to bringing on new Anthos customers, it’s really those factors that determine whether a company will look into Anthos or not.

He acknowledged that there are other players in this market, but he argues that Google Cloud’s take on this is also quite different. “I think we’re pretty unique in the sense that we’re from the cloud, cloud-native is our core approach,” he said. “A lot of what we talk about in [Anthos] 1.7 is about how we leverage the power of the cloud and use what we call “an anchor in the cloud” to make your life much easier. We’re more like a cloud vendor there, but because we support on-prem, we see some of those other folks.” Those other folks being IBM/Red Hat’s OpenShift and VMware’s Tanzu, for example. 

The addition of support for Windows containers in vSphere environments also points to the fact that a lot of Anthos customers are classical enterprises that are trying to modernize their infrastructure, yet still rely on a lot of legacy applications that they are now trying to bring to the cloud.

Looking ahead, one thing we’ll likely see is more integrations with a wider range of Google Cloud products into Anthos. And indeed, as Reed noted, inside of Google Cloud, more teams are now building their products on top of Anthos themselves. In turn, that then makes it easier to bring those services to an Anthos-managed environment anywhere. One of the first of these internal services that run on top of Anthos is Apigee. “Your Apigee deployment essentially has Anthos underneath the covers. So Apigee gets all the benefits of a container environment, scalability and all those pieces — and we’ve made it really simple for that whole environment to run kind of as a stack,” he said.

I guess we can expect to hear more about this in the near future — or at Google Cloud Next 2021.


Deploying a MongoDB Proof of Concept on Google Cloud Platform

Deploy MongoDB Google Cloud PlatformRecently, I needed to set up a Proof of Concept (POC) and wanted to do it on Google Cloud Platform (GCP).  After documenting the process, it seemed it might be helpful for others looking for the most basic guide possible to get a Mongo server up and running on GCP.  The process below will set up the latest version of Percona Server for MongoDB on a Virtual Machine (VM) in GCP.  This will be a minimal install for which to do further work.  I will also be utilizing the free account on GCP to do this.

The first step will be setting up your SSH access to the node.  On my Mac, I ran the following command which should work equally well on Linux:

ssh-keygen -t rsa -f ~/.ssh/gcp -C [USERNAME]

I named my key “gcp” in the example above but you can use an existing key or generate a new one with whatever name you want.

From there, you will want to login to the GCP console in a browser and do some simple configuration.  The first step will be to create a project and then add an instance.  You will also choose a Region and Zone.  And for our final basic configuration of our VM, choose the type of machine you want.  For my testing, an e2-medium is sufficient.  I will also accept default disk size and type.

configuration of our VM

Next, edit the instance details and go to the SSH Keys section and add your SSH key.  Your key will be a lot longer but will look something like the below:

Save out the details and take note of the public IP of the node.  Of course, you will want to test logging in using your key to ensure you can get into the server.  I tested my access with the below command, replacing your key name (gcp in my case), username, and public IP:

ssh -i ~/.ssh/gcp [USERNAME]@[PUBLIC IP]

Our next step will be to install Percona Server for MongoDB.  We will do this as painlessly as possible using Percona’s RPMs.  We will start by setting up the repo:

sudo yum install https://repo.percona.com/yum/percona-release-latest.noarch.rpm
sudo percona-release enable psmdb-44 release

With the repo configured, we will install MongoDB with the following command:

sudo yum install percona-server-mongodb

You will likely want to enable the service:

sudo systemctl enable mongod

By default, MongoDB does not enable authentication to access it.  If you want to do this, you can use the following command to setup access:

sudo /usr/bin/percona-server-mongodb-enable-auth.sh

Here’s more information on enabling authentication on Percona Server for MongoDB.

Again, this is the most basic installation of Percona Server for MongoDB on the Google Cloud Platform.  This guide was created for those looking for the basic introduction to both platforms and just want to get their proverbial hands dirty with a basic POC.

Our Percona Distribution for MongoDB is the only truly open-source solution powerful enough for enterprise applications. It’s free to use, so try it today!


Google Cloud joins the FinOps Foundation

Google Cloud today announced that it is joining the FinOps Foundation as a Premier Member.

The FinOps Foundation is a relatively new open-source foundation, hosted by the Linux Foundation, that launched last year. It aims to bring together companies in the “cloud financial management” space to establish best practices and standards. As the term implies, “cloud financial management” is about the tools and practices that help businesses manage and budget their cloud spend. There’s a reason, after all, that there are a number of successful startups that do nothing else but help businesses optimize their cloud spend (and ideally lower it).

Maybe it’s no surprise that the FinOps Foundation was born out of Cloudability’s quarterly Customer Advisory Board meetings. Until now, CloudHealth by VMware was the Foundation’s only Premiere Member among its vendor members. Other members include Cloudability, Densify, Kubecost and SoftwareOne. With Google Cloud, the Foundation has now signed up its first major cloud provider.

“FinOps best practices are essential for companies to monitor, analyze and optimize cloud spend across tens to hundreds of projects that are critical to their business success,” said Yanbing Li, vice president of Engineering and Product at Google Cloud. “More visibility, efficiency and tools will enable our customers to improve their cloud deployments and drive greater business value. We are excited to join FinOps Foundation, and together with like-minded organizations, we will shepherd behavioral change throughout the industry.”

Google Cloud has already committed to sending members to some of the Foundation’s various Special Interest Groups (SIGs) and Working Groups to “help drive open-source standards for cloud financial management.”

“The practitioners in the FinOps Foundation greatly benefit when market leaders like Google Cloud invest resources and align their product offerings to FinOps principles and standards,” said J.R. Storment, executive director of the FinOps Foundation. “We are thrilled to see Google Cloud increase its commitment to the FinOps Foundation, joining VMware as the second of three dedicated Premier Member Technical Advisory Council seats.”


Google Cloud launches a new support option for mission critical workloads

Google Cloud today announced the launch of a new support option for its Premium Support customers that run mission-critical services on its platform. The new service, imaginatively dubbed Mission Critical Services (MCS), brings Google’s own experience with Site Reliability Engineering to its customers. This is not Google completely taking over the management of these services, though. Instead, the company describes it as a “consultative offering in which we partner with you on a journey toward readiness.”

Initially, Google will work with its customers to improve — or develop — the architecture of their apps and help them instrument the right monitoring systems and controls, as well as help them set and raise their service-level objectives (a key feature in the Site Reliability Engineering philosophy).

Later, Google will also provide ongoing check-ins with its engineers and walk customers through tune-ups architecture reviews. “Our highest tier of engineers will have deep familiarity with your workloads, allowing us to monitor, prevent, and mitigate impacts quickly, delivering the fastest response in the industry. For example, if you have any issues–24-hours-a-day, seven-days-a-week–we’ll spin up a live war room with our experts within five minutes,” Google Cloud’s VP for Customer Experience, John Jester, explains in today’s announcement.

This new offering is another example of how Google Cloud is trying to differentiate itself from the rest of the large cloud providers. Its emphasis today is on providing the high-touch service experiences that were long missing from its platform, with a clear emphasis on the needs of large enterprise customers. That’s what Thomas Kurian promised to do when he became the organization’s CEO and he’s clearly following through.


Early Stage is the premier ‘how-to’ event for startup entrepreneurs and investors. You’ll hear first-hand how some of the most successful founders and VCs build their businesses, raise money and manage their portfolios. We’ll cover every aspect of company-building: Fundraising, recruiting, sales, product market fit, PR, marketing and brand building. Each session also has audience participation built-in – there’s ample time included for audience questions and discussion. Use code “TCARTICLE at checkout to get 20 percent off tickets right here.


Google Cloud launches Apigee X, the next generation of its API management platform

Google today announced the launch of Apigee X, the next major release of the Apgiee API management platform it acquired back in 2016.

“If you look at what’s happening — especially after the pandemic started in March last year — the volume of digital activities has gone up in every kind of industry, all kinds of use cases are coming up. And one of the things we see is the need for a really high-performance, reliable, global digital transformation platform,” Amit Zavery, Google Cloud’s head of platform, told me.

He noted that the number of API calls has gone up 47 percent from last year and that the platform now handles about 2.2 trillion API calls per year.

At the core of the updates are deeper integrations with Google Cloud’s AI, security and networking tools. In practice, this means Apigee users can now deploy their APIs across 24 Google Cloud regions, for example, and use Google’s caching services in more than 100 edge locations.

Image Credits: Google

In addition, Apigee X now integrates with Google’s Cloud Armor firewall and its Cloud Identity Access Management platform. This also means that Apigee users won’t have to use third-party tools for their firewall and identity management needs.

“We do a lot of AI/ML-based anomaly detection and operations management,” Zavery explained. “We can predict any kind of malicious intent or any other things which might happen to those API calls or your traffic by embedding a lot of those insights into our API platform. I think [that] is a big improvement, as well as new features, especially in operations management, security management, vulnerability management and making those a core capability so that as a business, you don’t have to worry about all these things. It comes with the core capabilities and that is really where the front doors of digital front-ends can shine and customers can focus on that.”

The platform now also makes better use of Google’s AI capabilities to help users identify anomalies or predict traffic for peak seasons. The idea here is to help customers automate a lot of the standards automation tasks and, of course, improve security at the same time.

As Zavery stressed, API management is now about more than just managing traffic between applications. But more than just helping customers manage their digital transformation projects, the Apigee team is now thinking about what it calls ‘digital excellence.’ “That’s how we’re thinking of the journey for customers moving from not just ‘hey, I can have a front end,’ but what about all the excellent things you want to do and how we can do that,” Zavery said.

“During these uncertain times, organizations worldwide are doubling-down on their API strategies to operate anywhere, automate processes, and deliver new digital experiences quickly and securely,” said James Fairweather, Chief Innovation Officer at Pitney Bowes. “By powering APIs with new capabilities like reCAPTCHA Enterprise, Cloud Armor (WAF), and Cloud CDN, Apigee X makes it easy for enterprises like us to scale digital initiatives, and deliver innovative experiences to our customers, employees and partners.”


Google Cloud Platform: MySQL at Scale with Reliable HA Webinar Q&A

MySQL at Scale with Reliable HA

Earlier in November, we had a chance to present the “Google Cloud Platform: MySQL at Scale with Reliable HA.” We discussed different approaches to hosting MySQL in Google Cloud Platform with the available options’ pros and cons. This webinar was recorded and can be viewed here at any time. We had several great questions, which we would like to address and elaborate on the answers given during the webinar.

MySQL at Scale with Reliable HA

Q: What is your view on Cloud SQL High Availability in Google Cloud?

A: Google Cloud SQL provides High Availability through regional instances. If your Cloud SQL database is regional, it means that there’s a standby instance in another zone within the same region. Both instances (primary and standby) are kept synced through synchronous replication on the persistent disk level. Thanks to this approach, in case of an unexpected failover, no data is lost. The biggest disadvantage of this approach is that you have to pay for standby resources even though you can’t use the standby instance for any traffic, which means you double your costs with no performance benefits. Failover typically takes more than 30 seconds.

To sum up, High Availability in Google Cloud SQL is reliable but can be expensive, and failover time is not always enough for critical applications.


Q: How would one migrate from Google Cloud SQL to AWS RDS?

A: The easiest way to migrate if you can afford downtime is stopping the write workload to the Cloud SQL instance, taking a logical backup (mysql or mydumper), restoring it on AWS RDS, and then moving the entire workload to AWS RDS. In most cases, it’s not enough. The situation is more complex when you want to make it with no (or minimal) downtime.

To avoid downtime, you need to establish replication between your Cloud SQL (source) and RDS instances (replica). Cloud SQL can be used as a source instance for external replicas, as described in this documentation. You can take a logical backup from running a Cloud SQL instance (e.g., using mydumper), restore it to RDS and establish the replication between Cloud SQL and RDS. Using an external source for RDS is described here. It’s typically a good idea to use a VPN connection between both cloud regions to ensure your connection is secure and the database is not exposed to the public internet. Once replication is established, the steps are as follows:

  • Stop write traffic on Google Cloud SQL instance
  • Wait for the replication to catch up (synch all binlogs)
  • Make RDS instance writable and stop Cloud SQL -> RDS replication
  • Move write traffic to the RDS instance
  • Decommission your Cloud SQL instance

AWS DMS service can also be used as an intermediary in this operation.


Q: Is replication possible cross-cloud, e.g., Google Cloud SQL to AWS RDS, AWS RDS to Google Cloud SQL? If GCP is down, will RDS act as a primary and vice versa?

A: In general, replication between clouds is possible (see the previous question). Both Google Cloud SQL and AWS RDS can act as source and replica, including external instances as a part of your replication topology. High-availability solutions, though, in both cases, are very specific for a cloud provider implementation, and they can’t cooperate. So it’s not possible to automatically failover from RDS to GCP and vice versa. For such setups, we would recommend custom installation on Google Compute Instance and AWS EC2 with Percona Managed Database Services – if you don’t want to manage such a complex setup on your own.

Q: How did you calculate IOPS and throughput for the storage options?

A: We did not calculate the presented values in any way. Those are taken directly from Google Cloud Platform Documentation.

Q: How does GCP achieve synchronous replication?

A: Synchronous replication is possible only between the source and respective standby instance; it’s impossible to have synchronous replication between the primary and your read replicas. Each instance has its own persistent disk. Those disks are kept in sync – so replication happens on the storage layer, not the database layer. There are no implementation details about how it works available.

Q: Could you explain how to keep the primary instance available and writable during the maintenance window?

A: It’s not possible to guarantee the primary instance availability. Remember that even if you choose your maintenance window when you can accept downtime, it may or may not be followed (it’s just a preference). Maintenance events can happen at any point in time if they’re critical and may not be finished during the assigned window. If that’s not possible to accept by your application, we recommend designing a highly-available solution, e.g., with Percona XtraDB Cluster on Google Compute Engine instances instead. Such a solution won’t have such maintenance window problems.


Google acquires Actifio to step into the area of data management and business continuity

In the same week that Amazon is holding its big AWS confab, Google is also announcing a move to raise its own enterprise game with Google Cloud. Today the company announced that it is acquiring Actifio, a data management company that helps companies with data continuity to be better prepared in the event of a security breach or other need for disaster recovery. The deal squares Google up as a competitor against the likes of Rubrik, another big player in data continuity.

The terms of the deal were not disclosed in the announcement; we’re looking and will update as we learn more. Notably, when the company was valued at over $1 billion in a funding round back in 2014, it had said it was preparing for an IPO (which never happened). PitchBook data estimated its value at $1.3 billion in 2018, but earlier this year it appeared to be raising money at about a 60% discount to its recent valuation, according to data provided to us by Prime Unicorn Index.

The company was also involved in a patent infringement suit against Rubrik, which it also filed earlier this year.

It had raised around $461 million, with investors including Andreessen Horowitz, TCV, Tiger, 83 North, and more.

With Actifio, Google is moving into what is one of the key investment areas for enterprises in recent years. The growth of increasingly sophisticated security breaches, coupled with stronger data protection regulation, has given a new priority to the task of holding and using business data more responsibly, and business continuity is a cornerstone of that.

Google describes the startup as as a “leader in backup and disaster recovery” providing virtual copies of data that can be managed and updated for storage, testing, and more. The fact that it covers data in a number of environments — including SAP HANA, Oracle, Microsoft SQL Server, PostgreSQL, and MySQL, virtual machines (VMs) in VMware, Hyper-V, physical servers, and of course Google Compute Engine — means that it also gives Google a strong play to work with companies in hybrid and multi-vendor environments rather than just all-Google shops.

“We know that customers have many options when it comes to cloud solutions, including backup and DR, and the acquisition of Actifio will help us to better serve enterprises as they deploy and manage business-critical workloads, including in hybrid scenarios,” writes Brad Calder, VP, engineering, in the blog post. :In addition, we are committed to supporting our backup and DR technology and channel partner ecosystem, providing customers with a variety of options so they can choose the solution that best fits their needs.”

The company will join Google Cloud.

“We’re excited to join Google Cloud and build on the success we’ve had as partners over the past four years,” said Ash Ashutosh, CEO at Actifio, in a statement. “Backup and recovery is essential to enterprise cloud adoption and, together with Google Cloud, we are well-positioned to serve the needs of data-driven customers across industries.”


Using Volume Snapshot/Clone in Kubernetes

Volume snapshot and clone Kubernetes

Volume snapshot and clone KubernetesOne of the most exciting storage-related features in Kubernetes is Volume snapshot and clone. It allows you to take a snapshot of data volume and later to clone into a new volume, which opens a variety of possibilities like instant backups or testing upgrades. This feature also brings Kubernetes deployments close to cloud providers, which allow you to get volume snapshots with one click.

Word of caution: for the database, it still might be required to apply fsfreeze and FLUSH TABLES WITH READ LOCK or



It is much easier in MySQL 8 now, because as with atomic DDL, MySQL 8 should provide crash-safe consistent snapshots without additional locking.

Let’s review how we can use this feature with Google Cloud Kubernetes Engine and Percona Kubernetes Operator for XtraDB Cluster.

First, the snapshot feature is still beta, so it is not available by default. You need to use GKE version 1.14 or later and you need to have the following enabled in your GKE: https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/gce-pd-csi-driver#enabling_on_a_new_cluster.

It is done by enabling “Compute Engine persistent disk CSI Driver“.

Now we need to create a Cluster using storageClassName: standard-rwo for PersistentVolumeClaims. So the relevant part in the resource definition looks like this:

        storageClassName: standard-rwo
        accessModes: [ "ReadWriteOnce" ]
            storage: 11Gi

Let’s assume we have cluster1 running:

NAME                                               READY   STATUS    RESTARTS   AGE
cluster1-haproxy-0                                 2/2     Running   0          49m
cluster1-haproxy-1                                 2/2     Running   0          48m
cluster1-haproxy-2                                 2/2     Running   0          48m
cluster1-pxc-0                                     1/1     Running   0          50m
cluster1-pxc-1                                     1/1     Running   0          48m
cluster1-pxc-2                                     1/1     Running   0          47m
percona-xtradb-cluster-operator-79d786dcfb-btkw2   1/1     Running   0          5h34m

And we want to clone a cluster into a new cluster, provisioning with the same dataset. Of course, it can be done using backup into a new volume, but snapshot and clone allow for achieving this much easier. There are still some additional required steps, I will list them as a Cheat Sheet.

1. Create VolumeSnapshotClass (I am not sure why this one is not present by default)

apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshotClass
        name: onesc
driver: pd.csi.storage.gke.io
deletionPolicy: Delete

2. Create snapshot

apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshot
  name: snapshot-for-newcluster
  volumeSnapshotClassName: onesc
    persistentVolumeClaimName: datadir-cluster1-pxc-0

3. Clone into a new volume

Here I should note that we need to use the following as volume name convention used by Percona XtraDB Cluster Operator, it is:


Where CLUSTERNAME is the name used when we create clusters. So now we can clone snapshot into a volume:


Where newcluster is the name of the new cluster.

apiVersion: v1
kind: PersistentVolumeClaim
  name: datadir-newcluster-pxc-0
    name: snapshot-for-newcluster
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  storageClassName: standard-rwo
    - ReadWriteOnce
      storage: 11Gi

Important: the volume spec in storageClassName and accessModes and storage size should match the original volume.

After volume claim created, now we can start newcluster, however, there is still a caveat; we need to use:

forceUnsafeBootstrap: true

Because otherwise, Percona XtraDB Cluster will think the data from the snapshot was not after clean shutdown (which is true) and will refuse to start.

There is still some limitation to this approach, which you may find inconvenient: the volume can be cloned in only the same namespace, so it can’t be easily transferred from the PRODUCTION namespace into the QA namespace.

Though it still can be done but will require some extra steps and admin Kubernetes privileges, I will show how in the following blog posts.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com