May
15
2023
--

Proof of Concept: Horizontal Write Scaling for MySQL With Kubernetes Operator

horizontal write scaling kubernetes

Historically MySQL is great in horizontal READ scale. The scaling, in that case, is offered by the different number of Replica nodes, no matter if using standard asynchronous replication or synchronous replication.

However, those solutions do not offer the same level of scaling for writes operation.

Why? Because the solutions still rely on writing in one single node that works as Primary. Also, in the case of multi-Primary, the writes will be distributed by transaction. In both cases, when using virtually-synchronous replication, the process will require certification from each node and local (by node) write; as such, the number of writes is NOT distributed across multiple nodes but duplicated.

The main reason behind this is that MySQL is a relational database system (RDBMS), and any data that is going to be written in it must respect the RDBMS rules. In short, any data that is written must be consistent with the data present. To achieve that, the data needs to be checked with the existing through defined relations and constraints. This action is something that can affect very large datasets and be very expensive. Think about updating a table with millions of rows that refer to another table with another million rows.

An image may help:

data model for ecommerce

Every time I will insert an order, I must be sure that all the related elements are in place and consistent.

This operation is quite expensive but our database can run it in a few milliseconds or less, thanks to several optimizations that allow the node to execute most of them in memory with no or little access to mass storage.

The key factor is that the whole data structure resides in the same location (node), facilitating the operations.

Once we have understood that, it will also become clear why we cannot have relational data split in multiple nodes and have to distribute writes by table. If I have a node that manages only the items, another one the orders, and another one the payments, I will need to have my solution able to deal with distributed transactions, each of which needs to certify and verify other nodes’ data.

This level of distribution will seriously affect the efficiency of the operation, which will increase the response time significantly. This is it. Nothing is impossible; however, the performances will be so impacted that each operation may take seconds instead of milliseconds or a fraction of it unless lifting some of the rules breaks the relational model.

MySQL, as well as other RDBMS, are designed to work respecting the model and cannot scale in any way by fragmenting and distributing a schema, so what can be done to scale?

The alternative is to split a consistent set of data into fragments. What is a consistent set of data? It all depends on the kind of information we are dealing with. Keeping in mind the example above, where we have a shop online serving multiple customers, we need to identify which is the most effective way to split the data.

For instance, if we try to split the data by Product Type (Books, CD/DVD, etc.), we will have a huge duplication of data related to customers/orders/shipments and so on, and all this data is also quite dynamic given I will have customers constantly ordering things.

Why duplicate the data? Because if I do not duplicate that data, I will not know if a customer has already bought or not that specific item, or I will have to ask again about the shipment address and so on. This also means that whenever a customer buys something or puts something on the wish list, I have to reconcile the data in all my nodes/clusters.

On the other hand, if I choose to split my data by country of customer’s residence, the only data I will have to duplicate and keep in sync is the one related to the products, of which the most dynamic one will be the number of items in stock. This, of course, is unless I can organize my products by country as well, which is a bit unusual nowadays but not impossible.

Another possible case is if I am a health organization and I manage several hospitals. As for the example above, it will be easier to split my data by hospital, given most of the data related to patients is bound to the hospital itself, as well as treatments and any other element related to hospital management. In contrast, it will make no sense to split by patient’s country of residence.

This technique of splitting the data into smaller pieces is called sharding and is currently the only way we have to scale RDBMS horizontally. 

In the MySQL open source ecosystem, we have only two consolidated ways to perform sharding — Vitess and ProxySQL. The first one is a complete solution that takes ownership of your database and manages almost any aspect of its operations in a sharded environment and includes a lot of specific features for DBAs to deal with daily operations like table modifications, backup, and more.

While this may look great, it also has some strings attached, including the complexity and proprietary environment. That makes Vitess a good fit for “complex” sharding scenarios where other solutions may not be enough.

ProxySQL does not have a sharding mechanism “per se,” but given the way it works and the features it has, it allows us to build simple sharding solutions.

It is important to note that most of the DBA operations will still be on DBA to be executed, with incremented complexity given the sharding environment.

There is a third option which is application-aware sharding.

This solution sees the application aware of the need to split the data into smaller fragments and internally point the data to different “connectors” that are connected to multiple data sources.

In this case, the application is aware of a customer’s country and will redirect all the operations related to him to the datasource responsible for the specific fragment.

Normally this solution requires a full code redesign and could be quite difficult to achieve when it is injected after the initial code architecture definition.

On the other hand, if done at design, it is probably the best solution because it will allow the application to define the sharding rules and can also optimize the different data sources using different technologies for different uses.

One example could be using an RDBMS for most of the Online transaction processing (OLTP) data shared by country and having the products as distributed memory cache with a different technology. At the same time, all the data related to orders, payments, and customer history can be consolidated in a data warehouse used to generate reporting.    

As said, the last one is probably the most powerful, scalable, and difficult to design, and unfortunately, it represents probably less than 5% of the solution currently deployed. 

As well, very few cases are in need to have a full system/solution to provide scalability with sharding.

By experience, most of the needs for horizontal scaling fell in the simple scenario, where there is the need to achieve sharding and data separation, very often with sharding-nothing architecture. In shared-nothing, each shard can live in a totally separate logical schema instance / physical database server/data center/continent. There is no ongoing need to retain shared access (from between shards) to the other unpartitioned tables in other shards.

The POC

Why this POC?

Over the years, I have faced a lot of customers talking about scaling their database solution and looking at very complex sharding as Vitess as the first and only way to go.

This without even considering if their needs were driving them there for real.

In my experience and in talking with several colleagues, I am not alone when analyzing the real needs. After discussing with all the parties impacted, only a very small percentage of customers were in real need of complex solutions. Most of the others were just trying to avoid a project that will implement simple shared-nothing solutions. Why? Because apparently, it is simpler to migrate data to a platform that does all for you than accept a bit of additional work and challenge at the beginning but keep a simple approach. Also, going for the last shining things always has its magic.

On top of that, with the rise of Kubernetes and MySQL Operators, a lot of confusion started to circulate, most of which was generated by the total lack of understanding that a database and a relational database are two separate things. That lack of understanding of the difference and the real problems attached to an RDBMS had brought some to talk about horizontal scaling for databases, with a concerning superficiality and without clarifying if they were talking about RDBMS or not. As such, some clarification is long due as well as putting back the KISS principle as the main focus.

Given that, I thought that refreshing how ProxySQL could help in building a simple sharding solution may help to clarify the issues, reset the expectations and show how we can do things in a simpler way.  (See my old post, MySQL Sharding with ProxySQL.)

To do so, I built a simple POC that illustrates how you can use Percona Operator for MySQL (POM) and ProxySQL to build a sharded environment with a good level of automation for some standard operations like backup/restore software upgrade and resource scaling.

Why ProxySQL?

In the following example, we mimic a case where we need a simple sharding solution, which means we just need to redirect the data to different data containers, keeping the database maintenance operations on us. In this common case, we do not need to implement a full sharding system such as Vitess.  

As illustrated above, ProxySQL allows us to set up a common entry point for the application and then redirect the traffic on the base of identified sharding keys. It will also allow us to redirect read/write traffic to the primary and read-only traffic to all secondaries. 

The other interesting thing is that we can have ProxySQL as part of the application pod, or as an independent service. Best practices indicate that having ProxySQL closer to the application will be more efficient, especially if we decide to activate the caching feature.  

Why POM?

Percona Operator for MySQL has three main solutions: Percona Operator for Percona XtraDB Cluster, Percona Operator for MySQL Group Replication, and Percona Operator for Percona Server for MySQL. The first two are based on virtually-synchronous replication and allow the cluster to keep the data state consistent across all pods, guaranteeing that the service will always offer consistent data. In the K8s context, we can see POM as a single service with native horizontal scalability for reads, while for writes, we will adopt the mentioned sharding approach. 

The other important aspect of using a POM-based solution is the automation it comes with. Deploying POM, you will be able to set automation for backups, software updates, monitoring (using Percona Monitoring and Management (PMM)), and last but not least, the possibility to scale UP or DOWN just by changing the needed resources. 

The elements used

kubernetes sharding

In our POC, I will use a modified version of sysbench (https://github.com/Tusamarco/sysbench) that has an additional field continent and I will use that as a sharding key. At the moment, and for the purpose of this simple POC, I will only have two shards.

As the diagram above illustrates here, we have a simple deployment but good enough to illustrate the sharding approach.

We have:

  • The application(s) node(s) — it is really up to you if you want to test with one application node or more. Nothing will change, as well as for the ProxySQL nodes, but just keep in mind that if you use more ProxySQL nodes is better to activate the internal cluster support or use consul to synchronize them.
  • Shard one is based on POM with PXC; it has the following:
  • Load balancer for service entry point
    • Entry point for r/w
    • Entry point for read only
  • Three Pods for Haproxy
    • Haproxy container
    • Pmm agent container
  • Three Pods with data nodes (PXC)
    • PXC cluster node container
    • Log streaming
    • Pmm container 
  • Backup/restore service 
  • Shard two is based on POM for Percona Server for MySQL and Group Replication (technical preview)
    • Load balancer for service entry point
      • Entry point for r/w
      • Entry point for read-only
    • Three Pods for MySQL Router (testing)
      • MySQL router container
    • Three Pods with data nodes (PS with GR)
      • PS -GR cluster node container
      • Log streaming
      • Pmm container 
    • Backup/restore on scheduler

Now you may have noticed that the representation of the nodes is different in size; this is not a mistake while drawing. It indicates that I have allocated more resources (CPU and Memory) to shard1 than shard2. Why? Because I can and I am simulating a situation where a shard2 gets less traffic, at least temporarily, as such I do not want to give it the same resources as shard1. I will eventually increase them if I see the need.

The settings

Data layer

Let us start with the easy one, the data layer configuration. Configuring the environment correctly is the key, and to do so, I am using a tool that I wrote specifically to calculate the needed configuration in a K8s POM environment. You can find it here (https://github.com/Tusamarco/mysqloperatorcalculator). 

Once you have compiled it and run it, you can simply ask what “dimensions” are supported, or you can define a custom level of resources, but you will still need to indicate the expected load level. In any case, please refer to the README in the repository with all the instructions.

The full cr.yaml for PXC shard1 is here, while the one for PS-GR is here

For Shard1: I asked for resources to cover traffic of type 2 (Light OLTP), configuration type 5 (2XLarge) 1000 connections.

For Shard2: I ask for resources to cover traffic of type 2 (Light OLTP), configuration type 2 (Small), 100 connections.     

Once you have the CRs defined, you can follow the official guidelines to set the environment up:

It is time now to see the ProxySQL settings.

ProxySQL and sharding rules

As mentioned before, we will test the load sharding by continent, and we know that ProxySQL will not provide additional help to automatically manage the sharded environment. 

Given that one way to do it is to create a DBA account per shard or to inject shard information in the commands while executing. I will use the less comfortable one just to prove if it works, the different DBA accounts. 

We will have two shards: the sharding key is the continent field, and the continents will be grouped as follows:

  • Shard one:
    • Asia
    • Africa
    • Antarctica
    • Europe
    • North America
  • Shard two:
    • Oceania
    • South America

The DBAs users:

  • dba_g1
  • dba_g2

The application user:

  • app_test

The host groups will be:

  • Shard one
    • 100 Read and Write
    • 101 Read only
  • Shard two
    • 200 Read and Write
    • 201 Read only

Once that is defined, we need to identify which query rules will serve us and how. What we want is to redirect all the incoming queries for:

  • Asia, Africa, Antarctica, Europe, and North America to shard1.
  • Oceania and South America to shard2
  • Split the queries in R/W and Read only
  • Prevent the execution of any query that does not have a shard key
  • Backup data at regular intervals and store it in a safe place

ProxySQL and sharding rules

Given the above, we first define the rules for the DBAs accounts.

We set the Hostgroup for each DBA and then if the query matches the sharding rule, we redirect it to the proper sharding. Otherwise, the HG will remain as set.

This allows us to execute queries like CREATE/DROP table on our shard without a problem but will allow us to send data where needed. 

For instance, the one below is the output of the queries that sysbench will run.

Prepare:

INSERT INTO windmills_test1 /*  continent=Asia */ (uuid,millid,kwatts_s,date,location,continent,active,strrecordtype) VALUES(UUID(), 79, 3949999,NOW(),'mr18n2L9K88eMlGn7CcctT9RwKSB1FebW397','Asia',0,'quq')

In this case, I have the application simply injecting a comment in the INSERT SQL declaring the shard key; given I am using the account dba_g1 to create/prepare the schemas, rules 32/32 will be used and given I have sett apply=1, ProxySQL will exit the query rules parsing and send the command to the relevant hostgroup.

Run:

SELECT id, millid, date,continent,active,kwatts_s FROM windmills_test1 WHERE id BETWEEN ? AND ? AND continent='South America'

SELECT SUM(kwatts_s) FROM windmills_test1 WHERE id BETWEEN ? AND ?  and active=1  AND continent='Asia'
SELECT id, millid, date,continent,active,kwatts_s  FROM windmills_test1 WHERE id BETWEEN ? AND ?  AND continent='Oceania' ORDER BY millid

SELECT DISTINCT millid,continent,active,kwatts_s   FROM windmills_test1 WHERE id BETWEEN ? AND ? AND active =1  AND continent='Oceania' ORDER BY millid

UPDATE windmills_test1 SET active=? WHERE id=?  AND continent='Asia'
UPDATE windmills_test1 SET strrecordtype=? WHERE id=?  AND continent='North America'

DELETE FROM windmills_test1 WHERE id=?  AND continent='Antarctica'

INSERT INTO windmills_test1 /* continent=Antarctica */ (id,uuid,millid,kwatts_s,date,location,continent,active,strrecordtype) VALUES (?, UUID(), ?, ?, NOW(), ?, ?, ?,?) ON DUPLICATE KEY UPDATE kwatts_s=kwatts_s+1

The above are executed during the tests.  In all of them, the sharding key is present, either in the WHERE clause OR as a comment. 

Of course, if I execute one of them without the sharding key, the firewall rule will stop the query execution, i.e.:

mysql> SELECT id, millid, date,continent,active,kwatts_s FROM windmills_test1 WHERE id BETWEEN ? AND ?;
ERROR 1148 (42000): It is impossible to redirect this command to a defined shard. Please be sure you Have the Continent definition in your query, or that you use a defined DBA account (dba_g{1/2})


Check
here for the full command list.

Setting up the dataset

Once the rules are set, it is time to set up the schemas and the data using sysbench (https://github.com/Tusamarco/sysbench). Remember to use windmills_sharding tests.  

The first operation is to build the schema on SHARD2 without filling it with data. This is a DBA action; as such, we will execute it using the dba_g2 account:

sysbench ./src/lua/windmills_sharding/oltp_read.lua  --mysql-host=10.0.1.96  --mysql-port=6033 --mysql-user=dba_g2 --mysql-password=xxx --mysql-db=windmills_large --mysql_storage_engine=innodb --db-driver=mysql --tables=4 --table_size=0 --table_name=windmills --mysql-ignore-errors=all --threads=1  prepare

Setting table_size and pointing to the ProxySQL IP/port will do, and I will have the following:

mysql> select current_user(), @@hostname;
+----------------+-------------------+
| current_user() | @@hostname        |
+----------------+-------------------+
| dba_g2@%       | ps-mysql1-mysql-0 |
+----------------+-------------------+
1 row in set (0.01 sec)

mysql> use windmills_large;
Database changed
mysql> show tables;
+---------------------------+
| Tables_in_windmills_large |
+---------------------------+
| windmills1                |
| windmills2                |
| windmills3                |
| windmills4                |
+---------------------------+
4 rows in set (0.01 sec)

mysql> select count(*) from windmills1;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.09 sec)

All set but empty.

Now let us do the same but with the other DBA user:

sysbench ./src/lua/windmills_sharding/oltp_read.lua  --mysql-host=10.0.1.96  --mysql-port=6033 --mysql-user=dba_g1 --mysql-password=xxx --mysql-db=windmills_large --mysql_storage_engine=innodb --db-driver=mysql --tables=4 --table_size=400 --table_name=windmills --mysql-ignore-errors=all --threads=1  prepare

If I do now the select above with user dba_g2:

mysql> select current_user(), @@hostname;select count(*) from windmills1;
+----------------+-------------------+
| current_user() | @@hostname        |
+----------------+-------------------+
| dba_g2@%       | ps-mysql1-mysql-0 |
+----------------+-------------------+
1 row in set (0.00 sec)

+----------+
| count(*) |
+----------+
|      113 |
+----------+
1 row in set (0.00 sec)

While If I reconnect and use dba_g1:

mysql> select current_user(), @@hostname;select count(*) from windmills1;
+----------------+--------------------+
| current_user() | @@hostname         |
+----------------+--------------------+
| dba_g1@%       | mt-cluster-1-pxc-0 |
+----------------+--------------------+
1 row in set (0.00 sec)

+----------+
| count(*) |
+----------+
|      287 |
+----------+
1 row in set (0.01 sec)

I can also check on ProxySQL to see which rules were utilized:

select active,hits,destination_hostgroup, mysql_query_rules.rule_id, match_digest, match_pattern, replace_pattern, cache_ttl, apply,flagIn,flagOUT FROM mysql_query_rules NATURAL JOIN stats.stats_mysql_query_rules ORDER BY mysql_query_rules.rule_id;

+------+-----------------------+---------+---------------------+----------------------------------------------------------------------------+-------+--------+---------+
| hits | destination_hostgroup | rule_id | match_digest        | match_pattern                                                              | apply | flagIN | flagOUT |
+------+-----------------------+---------+---------------------+----------------------------------------------------------------------------+-------+--------+---------+
| 3261 | 100                   | 20      | NULL                | NULL                                                                       | 0     | 0      | 500     |
| 51   | 200                   | 21      | NULL                | NULL                                                                       | 0     | 0      | 600     |
| 2320 | 100                   | 31      | NULL                | scontinents*(=|like)s*'*(Asia|Africa|Antarctica|Europe|North America)'* | 1     | 500    | 0       |
| 880  | 200                   | 32      | NULL                | scontinents*(=|like)s*'*(Oceania|South America)'*                       | 1     | 500    | 0       |
| 0    | 100                   | 34      | NULL                | scontinents*(=|like)s*'*(Asia|Africa|Antarctica|Europe|North America)'* | 1     | 600    | 0       |
| 0    | 200                   | 35      | NULL                | scontinents*(=|like)s*'*(Oceania|South America)'*                       | 1     | 600    | 0       |
| 2    | 100                   | 51      | NULL                | scontinents*(=|like)s*'*(Asia|Africa|Antarctica|Europe|North America)'* | 0     | 0      | 1001    |
| 0    | 200                   | 54      | NULL                | scontinents*(=|like)s*'*(Oceania|South America)'*                       | 0     | 0      | 1002    |
| 0    | 100                   | 60      | NULL                | NULL                                                                       | 0     | 50     | 1001    |
| 0    | 200                   | 62      | NULL                | NULL                                                                       | 0     | 60     | 1002    |
| 7    | NULL                  | 2000    | .                   | NULL                                                                       | 1     | 0      | NULL    |
| 0    | 100                   | 2040    | ^SELECT.*FOR UPDATE | NULL                                                                       | 1     | 1001   | NULL    |
| 2    | 101                   | 2041    | ^SELECT.*$          | NULL                                                                       | 1     | 1001   | NULL    |
| 0    | 200                   | 2050    | ^SELECT.*FOR UPDATE | NULL                                                                       | 1     | 1002   | NULL    |
| 0    | 201                   | 2051    | ^SELECT.*$          | NULL                                                                       | 1     | 1002   | NULL    |
+------+-----------------------+---------+---------------------+----------------------------------------------------------------------------+-------+--------+---------+

Running the application

Now that the data load test was successful let us do the real load following the indication as above but use 80 Tables and just a bit more records like 20000, nothing huge. 

Once the data is loaded, we will have the two shards with different numbers of records. If all went well, the shard2 should have ¼ of the total and shard1 ¾.

When the load is over, I have, as expected:

mysql> select current_user(), @@hostname;select count(*) as shard1 from windmills_large.windmills80;select /* continent=shard2 */ count(*) as shard2 from windmills_large.windmills80;
+----------------+--------------------+
| current_user() | @@hostname         |
+----------------+--------------------+
| dba_g1@%       | mt-cluster-1-pxc-0 |
+----------------+--------------------+
1 row in set (0.00 sec)

+--------+
| shard1 |
+--------+
|  14272 | ← Table windmills80 in SHARD1
+--------+
+--------+
| shard2 |
+--------+
|   5728 | ← Table windmills80 in SHARD2
+--------+

As you may have already noticed, I used a trick to query the other shard using the dba_g1 user, I just passed in the query the shard2 definition as a comment. That is all we need.

Let us execute the run command for writes in sysbench and see what happens.

The first thing we can notice while doing writes is the query distribution:

+--------+-----------+----------------------------------------------------------------------------+----------+--------+----------+----------+--------+---------+-------------+---------+
| weight | hostgroup | srv_host                                                                   | srv_port | status | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries |
+--------+-----------+----------------------------------------------------------------------------+----------+--------+----------+----------+--------+---------+-------------+---------+
| 10000  | 100       | ac966f7d46c04400fb92a3603f0e2634-193113472.eu-central-1.elb.amazonaws.com  | 3306     | ONLINE | 24	     | 0        | 138    | 66      | 25          | 1309353 |
| 100    | 101       | a5c8836b7c05b41928ca84f2beb48aee-1936458168.eu-central-1.elb.amazonaws.com | 3306     | ONLINE | 0	     | 0        | 0      | 0       | 0           |       0 |
| 10000  | 200       | a039ab70e9f564f5e879d5e1374d9ffa-1769267689.eu-central-1.elb.amazonaws.com | 3306     | ONLINE | 24	     | 1        | 129    | 66      | 25          |  516407 |
| 10000  | 201       | a039ab70e9f564f5e879d5e1374d9ffa-1769267689.eu-central-1.elb.amazonaws.com | 6447     | ONLINE | 0	     | 0        | 0      | 0       | 0           |       0 |
+--------+-----------+----------------------------------------------------------------------------+----------+--------+----------+----------+--------+---------+-------------+---------+

Where we can notice that the load in connection is evenly distributed, while the load is mainly going to shard1 as we expected, given we have an unbalanced sharding by design.

At the MySQL level, we had:

Questions

MySQL

Com Type

The final point is, what is the gain of using this sharding approach?

Well, we still need to consider the fact we are testing on a very small set of data. However, if we can already identify some benefits here, that will be an interesting result. 

Let’s see the write operations with 24 and 64 threads:

MySQL writes

MySQL latency

We get a gain of ~33% just using sharding, while for latency, we do not have a cost. On the contrary, also with a small load increase, we can see how the sharded solution performs better. Of course, we are still talking about a low number of rows and running threads but the gain is there. 

Backup

The backup and restore operation when using POM is completely managed by the operator (see instructions in the POM documentation https://docs.percona.com/percona-operator-for-mysql/pxc/backups.html and https://docs.percona.com/percona-operator-for-mysql/ps/backups.html). 

The interesting part is that we can have multiple kinds of backup solutions, like:

  • On-demand
  • Scheduled 
  • Full Point in time recovery with log streaming

Automation will allow us to set a schedule as simple as this:

schedule:
     - name: "sat-night-backup"
        schedule: "0 0 * * 6"
        keep: 3
        storageName: s3-eu-west
      - name: "daily-backup"
        schedule: "0 3 * * *"
        keep: 7
        storageName: s3-eu-west

Or, if you want to run the on-demand:

kubectl apply -f backup.yaml

Where the backup.yaml file has very simple information:

apiVersion: ps.percona.com/v1alpha1
kind: PerconaServerMySQLBackup
metadata:
  name: ps-gr-sharding-test-2nd-of-may
#  finalizers:
#    - delete-backup
spec:
  clusterName: ps-mysql1
  storageName: s3-ondemand

Using both methods, we will be able to soon have a good set of backups like:

POM (PXC)

cron-mt-cluster-1-s3-eu-west-20234293010-3vsve   mt-cluster-1   s3-eu-west    s3://mt-bucket-backup-tl/scheduled/mt-cluster-1-2023-04-29-03:00:10-full   Succeeded   3d9h        3d9h
cron-mt-cluster-1-s3-eu-west-20234303010-3vsve   mt-cluster-1   s3-eu-west    s3://mt-bucket-backup-tl/scheduled/mt-cluster-1-2023-04-30-03:00:10-full   Succeeded   2d9h        2d9h
cron-mt-cluster-1-s3-eu-west-2023513010-3vsve    mt-cluster-1   s3-eu-west    s3://mt-bucket-backup-tl/scheduled/mt-cluster-1-2023-05-01-03:00:10-full   Succeeded   33h         33h
cron-mt-cluster-1-s3-eu-west-2023523010-3vsve    mt-cluster-1   s3-eu-west    s3://mt-bucket-backup-tl/scheduled/mt-cluster-1-2023-05-02-03:00:10-full   Succeeded   9h          9h

POM (PS) *

NAME                             STORAGE       DESTINATION                                                                     STATE       COMPLETED   AGE
ps-gr-sharding-test              s3-ondemand   s3://mt-bucket-backup-tl/ondemand/ondemand/ps-mysql1-2023-05-01-15:10:04-full   Succeeded   21h         21h
ps-gr-sharding-test-2nd-of-may   s3-ondemand   s3://mt-bucket-backup-tl/ondemand/ondemand/ps-mysql1-2023-05-02-12:22:24-full   Succeeded   27m         27m

Note that as DBA, we still need to validate the backups with a restore procedure. That part is not automated (yet). 

*Note that Backup for POM PS is available only on demand, given the solution is still in technical preview.

When will this solution fit in?

As mentioned multiple times, this solution can cover simple cases of sharding; better if you have shared-nothing. 

It also requires work from the DBA side in case of DDL operations or resharding. 

You also need to be able to change some SQL code to be sure to have present the sharding key/information in any SQL executed.

When will this solution not fit in?

Several things could prevent you from using this solution. The most common ones are:

  • You need to query multiple shards at the same time. This is not possible with ProxySQL.
  • You do not have a DBA to perform administrative work and need to rely on an automated system.
  • Distributed transaction cross-shard.
  • No access to SQL code.

Conclusions

We do not have the Amletic dilemma about sharding or not sharding. 

When using an RDBMS like MySQL, if you need horizontal scalability, you need to shard. 

The point is there is no magic wand or solution; moving to sharding is an expensive and impacting operation. If you choose it at the beginning, before doing any application development, the effort can be significantly less. 

Doing sooner will also allow you to test proper solutions, where proper is a KISS solution. Always go for the less complex things, because in two years you will be super happy about your decision.  

If, instead, you must convert a current solution, then prepare for bloodshed, or at least for a long journey. 

In any case, we need to keep in mind a few key points:

  • Do not believe most of the articles on the internet that promise you infinite scalability for your database. If there is no distinction in the article between a simple database and an RDBMS, run away. 
  • Do not go for the last shiny things just because they shine. Test them and evaluate IF it makes sense for you. Better to spend a quarter testing now a few solutions than fight for years with something that you do not fully comprehend.  
  • Using containers/operators/Kubernetes does not scale per se; you must find a mechanism to have the solution scaling. There is absolutely NO difference with premises. What you may get is a good level of automation. However, that will come with a good level of complexity, and it is up to you to evaluate if it makes sense or not.  

As said at the beginning, for MySQL, the choice is limited. Vitess is the full complete solution, with a lot of coding to provide you with a complete platform to deal with your scaling needs.

However, do not be so fast to exclude ProxySQL as a possible solution. There are out there already many using it also for sharding. 

This small POC used a synthetic case, but it also shows that with just four rules, you can achieve a decent solution. A real scenario could be a bit more complex … or not. 

References

Vitess (https://vitess.io/docs/)

ProxySQL (https://proxysql.com/documentation/)

Firewalling with ProxySQL (https://www.tusacentral.com/joomla/index.php/mysql-blogs/197-proxysql-firewalling)

Sharding:

 

The Percona Kubernetes Operators automate the creation, alteration, or deletion of members in your Percona Distribution for MySQL, MongoDB, or PostgreSQL environment.

 

Learn More About Percona Kubernetes Operators

May
09
2023
--

Restoring a Snapshot of a Sharded MongoDB Cluster to a Kubernetes-Based MongoDB Environment

Kubernetes-Based MongoDB Environment

Many MongoDB clusters use storage-level snapshots to provide fast and reliable backups. In this blog post, you’ll learn how to restore such a snapshot from a traditional VM-based sharded MongoDB cluster to a freshly deployed Percona Operator for MongoDB cluster on Kubernetes.

Background story

I recently worked with a company running a large four-shard MongoDB Enterprise Server database cluster on VMs on premises that decided to migrate to the Google Cloud Platform. After careful consideration and pros-and-con evaluation, the customer chose to go with Percona Distribution for MongoDB run on Kubernetes – specifically using Percona Operator for MongoDB. There are four main factors that contributed to this decision:

  1. The total cost of ownership: K8s resources are significantly cheaper than running a popular DBaaS on the same cloud
  2. Ease of deployments: Percona Operator for MongoDB makes day-1 and day-2 operations a breeze
  3. Preference for open-source tools and cloud-native solutions: applications are already migrated to Kubernetes – running the entire solution that way simplifies operations. 
  4. This approach frees the deployment from cloud (and any other vendor) lock-in.

To start realistic compatibility and performance testing, we needed to restore the filesystem snapshot backup that was stored on NetApp volume in GCP into a Google Kubernetes Engine (GKE) cluster running Percona Operator for MongoDB. It’s not a trivial task due to the requirements for restoring a sharded cluster backup into a new cluster. I will show you why and how we did it.

Requirements

Let’s look at the overall requirements for the cluster where I want to restore the snapshot to:

  1. The same major (ideally minor, too) MongoDB version
  2. The same number of shards
  3. The same name of all shards
  4. The same name of the Config RS
  5. The same hostnames of all nodes (this is how Config RS connects to the specific shard)
  6. The same MongoDB configuration with regards to how files are stored on the disk

This seems to be straightforward; however, there are a couple of special considerations to make when it comes to the environment controlled by Percona Operator for MongoDB, specifically:

  • You can’t control the name of Config RS
  • You can’t change the hostnames and those will certainly be different within Kubernetes
  • Percona Operator needs specific users to be present in your cluster in order to control it – and those won’t be present in your backup.

All of the above makes it impossible to simply adjust the Operator configuration and copy all the files from your snapshot backup in the specific volumes.

Plan

The high-level plan consists of the following steps:

  1. Deploy the cluster on K8s
  2. Restore snapshot files
    1. Pause the cluster on K8s
    2. Mount storage volumes to a separate VM
    3. Copy snapshot files to respective volumes
  3. Prepare each replica set in the standalone mode (hostnames, sharding configuration, users)
  4. Start the cluster on K8s and initialize each replica set

The following approach is applicable regardless of what “flavor” of MongoDB your source environment uses. This can be MongoDB Enterprise Server, MongoDB Community Edition, or Percona Server for MongoDB.

Step one: Deploy the cluster on K8s

In order to deploy Percona Server for MongoDB (PSMDB) on the K8s cluster, follow the documentation. Before you execute the last step (deploying cr.yaml), however, make sure to adjust the following elements of the configuration. This will make the cluster “fit” the one that we took the backup from.

  1. Set spec.image to a specific image version. It needs to be the version that matches the major version of the source cluster, for example:

    spec:
      image: percona/percona-server-mongodb:5.0.11-10
  2. Create as many shard replica sets as in the source cluster. Copy the default replica set definition (entire section) in spec.replsets[]. For example, if your source cluster has two shards with the names “shard1” and “shard2” (names must match the ones from the source cluster):

    replsets:
     - name: shard1
       size: 3
       [...]
     - name: shard2
       size: 3
       [...]

    Unfortunately, you can’t set the name of Config RS. We’ll deal with that later.

  3. If your source cluster WiredTiger configuration is different than default, adjust the mongod configuration of each replica set to match it. Specifically, two MongoDB configuration items are critical: storage.directoryPerDB and storage.wiredTiger.engineConfig.directoryForIndexes. You can do it in the following way:

    replsets:
      - name: shard1
        size: 3
        configuration: |
          storage:
            directoryPerDB: true
          wiredTiger:
            engineConfig:
              directoryForIndexes: true
  4. Save changes and start the cluster using the modified cr.yaml file

Step two: Restore snapshot files

Your cluster should be started at this point and it is important that all PersistentVolumes required for it be created.  You can check your cluster state with kubectl get psmdb, see all deployed pods with kubectl get pods, or check PVs with kubectl get pv. In this step, you need to mount volumes of all your database nodes to an independent VM as we’ll make changes in MongoDB standalone mode. You need those VMs temporarily to perform the required operations.

  1. Pause the PSMDB cluster on K8s by setting spec.pause: true in your cr.yaml file. That will cause the entire cluster to stop and make it possible to mount volumes elsewhere.
  2. Check zones where your PerstistentVolumes were created. You can use kubectl describe pv pv_name or find it in the cloud console.
  3. Create a VM in each zone, then mount volumes corresponding to PersistentVolumes to a VM in its zone. It’s critical that you can easily identify the volume’s purpose (which replica set in your MongoDB cluster it refers to and which node – primary or secondary). As an example, this is how you can mount a volume to your VM using gcloud API:

    # attach disk to an instance
    gcloud compute instances attach-disk vm_name 
    --disk disk_name 
    --zone=us-central1-a
    
    # check volume on vm
    sudo lsblk
    
    # create a mountpoint
    sudo mkdir -p /dir_name (e.g. rs0-primary)
    
    # mount the volume
    sudo mount -o discard,defaults /dev/sdb /dir_name

     

  4. Delete all files from each volume mounted to your VMs – primaries, and secondaries.
  5. Copy files from the snapshot to the respective volume, directly to the main directory of the mounted volume. Do it just for volumes related to primary nodes (leave secondaries empty).
  6. Install Percona Server for MongoDB on each of your VMs. Check installation instructions and install the same version as your Mongo cluster on K8s. Don’t start the server yet!

Step three: Prepare each replica set in the standalone mode

Now, you need to start PSMDB for each volume with data (each primary of each replica set, including Config RS) separately. We will then log in to mongo shell and edit the cluster configuration manually so that when we bring volumes with data back to Kubernetes, it can start successfully.

Execute the steps below for each replica set, including Config RS:

  1. On the VM, when your Config RS primary volume is mounted, edit /etc/mongod.conf. Specifically, adjust storage.dbPath (to the directory where you mounted the volume), storage.directoryPerDB, and storage.wiredTiger.engineConfig.directoryForIndexes.
  2. Start PSMDB with sudo mongod –config /etc/mongod.conf
  3. Connect to PSMDB with mongo command (authentication is not required). Once you successfully log in, we can start making changes to the cluster configuration.
  4. Delete the local replica set configuration with the following commands:

    use local;
    db.dropDatabase();
  5. [This step applies only to Config RS]
    Replace shards configuration. Execute the following command for each shard separately (you can list them all with db.shards.find() ). Replace in the following string: shard_name, cluster_name, namespace_name with values specific to your cluster:

    use config;
    
    db.shards.updateOne(
      { "_id" : "shard_name" },
      { $set :
        { "host" : "shard_name/cluster_name-shard_name-0.cluster_name-shard_name.namespace_name.svc.cluster.local:27017,cluster_name-shard_name-1.cluster_name-shard_name.namespace_name.svc.cluster.local:27017,cluster_name-shard_name-2.cluster_name-shard_name.namespace_name.svc.cluster.local:27017"}
    });
  6. [This step applies to shard RS only] Clear shard metadata

    use admin
    db.system.version.deleteOne( { _id: "minOpTimeRecovery" } )
  7. [This step applies to shard RS only] Replace Config RS connection string in shardIdentity with the following command. Replace cluster_name, namespace_name with your values.

    use admin
    
    db.system.version.updateOne(
     { "_id" : "shardIdentity" },
     { $set :
       { "configsvrConnectionString" : "cfg/cluster_name-cfg-0.cluster_name-cfg.namespace_name.svc.cluster.local:27017,cluster_name-cfg-1.cluster_name-cfg.namespace_name.svc.cluster.local:27017,cluster_name-cfg-2.cluster_name-cfg.namespace_name.svc.cluster.local:27017"}
     });
  8. Percona MongoDB Operator requires system-level MongoDB users to be present in a database. We must create those users for the operator, as our backup doesn’t have them. If you haven’t changed default secrets.yaml during the deployment of the cluster, you can find default passwords either in the file or here in the documentation. To create required users and roles, use the Mongo commands below:

    use admin;
    //drop user in case they already exist in your backup
    db.dropUser("userAdmin");
    db.dropUser("clusterAdmin");
    db.dropUser("clusterMonitor");
    db.dropUser("backup");
    db.dropUser("databaseAdmin");
    
    //create missing role
    db.createRole({"role" : "explainRole",
                   "privileges" : [{"resource": {"db" : "","collection" : "system.profile"},
                    "actions" :   ["collStats","dbHash","dbStats","find","listCollections","listIndexes"]}],
                    roles: []});
    
    //create system users
    db.createUser({ user: "userAdmin",
                    pwd: "userAdmin123456",
                    roles: [ { role: "userAdminAnyDatabase", db: "admin" }]
                  });
    
    db.createUser({ user: "clusterAdmin",
                   pwd: "clusterAdmin123456",
                   roles: [ { role: "clusterAdmin", db: "admin" }]
                  });
    
    db.createUser({ user: "clusterMonitor",
                    pwd: "clusterMonitor123456",
                    roles: [{ role: "explainRole", db: "admin" },
                            { role: "read", db: "local" },
                            { role: "clusterMonitor", db: "admin" }]
                  });
    
    db.createUser({ user: "databaseAdmin",
                    pwd: "databaseAdmin123456",
                    roles: [{ role: "readWriteAnyDatabase", db: "admin" },
                            { role: "readAnyDatabase", db: "admin" },
                            { role: "dbAdminAnyDatabase", db: "admin" },
                            { role: "backup", db: "admin" },
                            { role: "restore", db: "admin" },
                            { role: "clusterMonitor", db: "admin" },]
                  });
    
    db.createUser({ user: "backup",
                    pwd: "backup123456",
                    roles: [{ role: "backup", db: "admin" },
                            { role: "restore", db: "admin" },
                            { role: "clusterMonitor", db: "admin" }]
                  });
  9. Shut down the server now using db.shutdownServer();
  10. The operator runs mongod process with “mongodb” user (not as a standard “mongod”!). Therefore, we need to fix permissions before we unmount the volume. Add the following line to your /etc/passwd file:

    mongodb:x:1001:0:Default Application User:/home/mongodb:/sbin/nologin
  11. Set permissions:

    cd /your_mountpoint
    sudo chown -R mongodb:1001 ./
  12. Unmount the volume and detach it from the VM. As an example, this is how you can unmount a volume from your VM using gcloud API:

    sudo umount /dir_name
    
    gcloud compute instances detach-disk vm_name 
    --disk disk_name 
    --zone=us-central1-a

     

Step four: Start the cluster on K8s and initialize each replica set

You’re ready to get back to K8s. Start the cluster (it will start with previously used volumes). It will be in a pending state because we broke (intentionally) replica sets. Only one pod per Replica Set will start. You must initialize replica sets one by one.

To unpause the PSMDB cluster set spec.pause: false in your cr.yaml file and apply it with kubectl. Then, repeat the steps below for all replica sets, starting with Config RS.

  1. Login to the shell of “pod 0” of the replica set with kubectl exec –stdin –tty cluster_name-cfg-0 — /bin/bash (for Config RS)
  2. Login to PSMDB with mongo command
  3. Authenticate as clusterAdmin using the following command (assuming you used default passwords):

    use admin;
    db.auth("clusterAdmin", "clusterAdmin123456");
  4. Initialize the replica set as below. Replace cluster_name and namespace_name and shard_name with your own values.

    # version for Config RS
    
    rs.initiate(
       {
         _id: "cfg",
          members: [
             { _id: 0, host : "cluster_name-cfg-0.cluster_name-cfg.namespace_name.svc.cluster.local:27017" },
             { _id: 1, host : "cluster_name-cfg-1.cluster_name-cfg.namespace_name.svc.cluster.local:27017" },
             { _id: 2, host : "cluster_name-cfg-2.cluster_name-cfg.namespace_name.svc.cluster.local:27017" },
          ]
       }
    );
    
    # version for other RS (shards)
    rs.initiate(
       {
         _id: "shard_name",
          members: [
             { _id: 0, host : "cluster_name-shard_name-0.cluster_name-shard_name.namespace_name.svc.cluster.local:27017" },
             { _id: 1, host : "cluster_name-shard_name-1.cluster_name-shard_name.namespace_name.svc.cluster.local:27017" },
             { _id: 2, host : "cluster_name-shard_name-2.cluster_name-shard_name.namespace_name.svc.cluster.local:27017" },
          ]
       }
    );
  5. After a few seconds, your node will become PRIMARY. You can check the health of the replica set using the rs.status() command. Remember that if your dataset is large, the initial synchronization process may take a long time (as with any MongoDB deployment).

That’s it! You now successfully restored a snapshot backup into Percona Server for MongoDB deployed on K8s with Percona Operator. To verify that you have successfully done that run kubectl get pods or kubectl get psmdb – the output should be similar to the one below.

$ kubectl get psmdb
NAME              ENDPOINT                                        STATUS   AGE
my-cluster-name   my-cluster-name-mongos.ns-0.svc.cluster.local   ready    156m

 

The Percona Kubernetes Operators automate the creation, alteration, or deletion of members in your Percona Distribution for MySQL, MongoDB, or PostgreSQL environment.

Learn More About Percona Kubernetes Operators

May
01
2023
--

Save Money in AWS RDS: Don’t Trust the Defaults

aws rds

Default settings can help you get started quickly – but they can also cost you performance and a higher cloud bill at the end of the month. Want to save money on your AWS RDS bill? I’ll show you some MySQL settings to tune to get better performance, and cost savings, with AWS RDS.

Recently I was engaged in a MySQL Performance Audit for a customer to help troubleshoot performance issues that led to downtime during periods of high traffic on their AWS RDS MySQL instances. During heavy loads, they would see messages about their InnoDB settings in the error logs:

[Note] InnoDB: page_cleaner: 1000ms intended loop took 4460ms. The settings might not be optimal. (flushed=140, during the time.)

This message is normally a side effect of a storage subsystem that is not capable of keeping up with the number of writes (e.g., IOPs) required by MySQL. This is “Hey MySQL, try to write less. I can’t keep up,” which is a common situation when innodb_io_capacity_max is set too high.

After some time of receiving these messages, eventually, they hit performance issues to the point that the server becomes unresponsive for a few minutes. After that, things went back to normal.

Let’s look at the problem and try to gather some context information.

Investigating AWS RDS performance issues

We had a db.m5.8xlarge instance type (32vCPU – 128GB of RAM) with a gp2 storage of 5TB, which should provide up to 10000 IOPS (this is the maximum capacity allowed by gp2), running MySQL 5.7. This is a pretty decent setup, and I don’t see many customers needing to write this many sustained IOPS.

The innodb_io_capacity_max parameter was set to 2000, so the hardware should be able to deliver that many IOPS without major issues. However, gp2 suffers from a tricky way of calculating credits and usage that may drive erroneous conclusions about the real capacity of the storage. Reviewing the CloudWatch graphics, we only had roughly 8-9k IOPS (reads and writes) used during spikes.

AWS RDS MySQL

writeops

While the IO utilization was quite high, there should be some room to get more IOPS, but we were still seeing errors. What caught my attention was the self-healing condition shown by MySQL after a few minutes.

Normally, the common solution that was actually discussed during our kick-off call was, “Well, there is always the chance to move to Provisioned IOPS, but that is quite expensive.” Yes, this is true, io2 volumes are expensive, and honestly, I think they should be used only where really high IO capacity at expected latencies is required, and this didn’t seem to be the case.

Otherwise, most of the environments can adapt to gp2/gp3 volumes; for that matter, you need to provision a big enough volume and get enough IOPS.

Finding the “smoking gun” with pt-mysql-summary

Not too long ago, my colleague Yves Trudeau and I worked on a series of posts debating how to configure an instance for write-intensive workloads. A quick look at the pt-mysql-summary output shows something really interesting when approaching the issue out of the busy period of load:

# InnoDB #####################################################
                  Version | 5.7.38
         Buffer Pool Size | 93.0G
         Buffer Pool Fill | 100%
        Buffer Pool Dirty | 1%
           File Per Table | ON
                Page Size | 16k
            Log File Size | 2 * 128.0M = 256.0M
          Log Buffer Size | 8M
             Flush Method | O_DIRECT
      Flush Log At Commit | 1
               XA Support | ON
                Checksums | ON
              Doublewrite | ON
          R/W I/O Threads | 4 4
             I/O Capacity | 200
       Thread Concurrency | 0
      Concurrency Tickets | 5000
       Commit Concurrency | 0
      Txn Isolation Level | REPEATABLE-READ
        Adaptive Flushing | ON
      Adaptive Checkpoint | 
           Checkpoint Age | 78M
             InnoDB Queue | 0 queries inside InnoDB, 0 queries in queue

 

Wait, what? 256M of redo logs and a Checkpoint Age of only 78M? That is quite conservative, considering a 93GB buffer pool size. I guess we should assume bigger redo logs for such a big buffer pool. Bingo! We have a smoking gun here.

Additionally, full ACID features were enabled, this is innodb_flush_log_at_trx_commit=1 and sync_binlog=1, which adds a lot of write overhead to every operation because, during the commit stage, everything is flushed to disk (or to gp2 in this case).

Considering a spike of load running a lot of writing queries, hitting the max checkpoint age in this setup is a very likely situation.

Basically, MySQL will perform flushing operations at a certain rate depending on several factors. This rate is normally close to innodb_io_capacity (200 by default); if the number of writes starts to approach to max checkpoint age, then the adaptive flushing algorithm will start to push up to innodb_io_capacity_max (2000 by default) to try to keep the free space in the redo logs far from the max checkpoint age limit.

If we keep pushing, we can eventually reach the max checkpoint age, which will drive the system to the synchronous state, meaning that a sort of furious flushing operations will happen beyond innodb_io_capacity_max and all writing operations will be paused (freezing writes) until there is free room in the redo logs to keep writing.

This was exactly what was happening on this server. We calculated roughly how many writes were being performed per hour, and then we recommended increasing the size of redo log files to 2x2GB each (4GB total). In practical terms, it was 3.7G due to some rounding that RDS does, so we got:

# InnoDB #####################################################
                  Version | 5.7.38
         Buffer Pool Size | 92.0G
         Buffer Pool Fill | 100%
        Buffer Pool Dirty | 2%
           File Per Table | ON
                Page Size | 16k
            Log File Size | 2 * 1.9G = 3.7G
          Log Buffer Size | 8M
             Flush Method | O_DIRECT

 

Then we also increased the innodb_io_capacity_max to 4000, so we let the adaptive flushing algorithm increase writes with some more room. Results in CloudWatch show we were right:

 

AWS RDS Cloud MySQL

The reduction during the last couple of weeks is more than 50% of IOPS, which is pretty decent now, and we haven’t changed the hardware at all. Actually, it was possible to reduce the storage size to 3TB and avoid moving to expensive io2 (provisioned IOPS) storage.

Conclusions

RDS normally works very well out of the box; most of the configurations are properly set for the type of instance provisioned. Still, I’ve found that the RDS default size of the redo logs being this small is silly, and people using a fully managed solution would expect not to worry about some common tuning.

MySQL 8.0 implemented innodb_dedicated_server that auto sizes innodb_log_file_size and innodb_log_files_in_group (now replaced by innodb_redo_log_capacity) as a function of InnoDB Buffer Pool size using a pretty simple, but effective, algorithm, and I guess it shouldn’t be hard for AWS team to implement it. We’ve done some research, and it seems RDS is not pushing this login into the 8.0 versions, which sounds strange to have such a default for innodb_redo_log_capacity

In the meantime, checking how RDS MySQL is configured with default parameters is something we all should review to avoid the typical “throwing more hardware solution” – and, by extension, spending more money.

Percona Consultants have decades of experience solving complex database performance issues and design challenges. They’ll work with you to understand your goals and objectives and provide the best, unbiased solutions for your database environment.

 

Learn more about Percona Consulting

 

A personalized Percona Database Performance Audit will help uncover potential performance killers in your current configuration.

 

Get your personalized audit

Apr
28
2023
--

Add More Security to Your Percona Server for MongoDB With AWS IAM integration!

MongoDB With AWS IAM Integration

Did you notice that Percona Server for MongoDB 6.0.5-4 was released just a few days ago? This time around, it introduced improvements to the way we handle master key rotation for data at rest encryption as well as AWS IAM integration.

One key to rule them all — improvements to master key rotation

With the improvements introduced in Percona Server for MongoDB 6.0.5-4, one key path can be used for all servers in a clustered environment. This allows us to use one vault key namespace for all nodes in a deployment while at the same time preserving key versions and allowing each node to perform key rotation without impact to the other nodes.

Changes introduced with Percona Server for MongoDB 6.0.5-4 now allow using the same key for all the members of a replica set if the user chooses so, without impact on functionality.

Why should you care about AWS IAM integration?

With all the systems users need to access daily, password management becomes a more pressing issue. The introduction of IAM systems to an enterprise has become somewhat of a security standard in large enterprises.

Our users approached us about integration with AWS IAM, commonly used in their organizations. It’s an integration missing from MongoDB Community Edition (CE) that is important for compliance with enterprise security policies of many companies. Integration with AWS IAM allows:

MongoDB AWS IAM integration

To set up this integration, follow the steps outlined in our documentation, and configure either the user or the role authentication. This will allow AWS Security Token Service (STS) to play the part of Identity Provider (IDP) in a SAML 2.0-based federation.

Your feedback matters!

We take pride in being open to feedback in Percona. Please do not hesitate to contact us via the community forums or this contact form.

What’s next

We are looking into the problems affecting large size datastores that are a pain to our users. Please let us know if there are any particular issues you are struggling with in MongoDB; we are always open to suggestions!

Learn more about Percona Server for MongoDB

Apr
20
2023
--

Using Encryption-at-Rest for PostgreSQL in Kubernetes

encryption-at-rest in PostgreSQL

Data-at-rest encryption is essential for compliance with regulations that require the protection of sensitive data. Encryption can help organizations comply with regulations and avoid legal consequences and fines. It is also critical for securing sensitive data and avoiding data breaches.

PostgreSQL does not natively support Transparent Data Encryption (TDE). TDE is a database encryption technique that encrypts data at the column or table level, as opposed to full-disk encryption (FDE), which encrypts the entire database.

As for FDE, there are multiple options available for PostgreSQL. In this blog post, you will learn:

  • how to leverage FDE on Kubernetes with Percona Operator for PostgreSQL
  • how to start using encrypted storage for already running cluster

Prepare

In most public clouds, block storage is not encrypted by default. To enable the encryption of the storage in Kubernetes, you need to modify the StorageClass resource. This will instruct Container Storage Interface (CSI) to provision encrypted storage volume on your block storage (AWS EBS, GCP Persistent Disk, Ceph, etc.).

The configuration of the storage class depends on your storage plugin. For example, in Google Kubernetes Engine (GKE), you need to create the key in Cloud Key Management Service (KMS) and set it in the StorageClass:

apiVersion: storage.k8s.io/v1beta1
kind: StorageClass
metadata:
  name: my-enc-sc
provisioner: pd.csi.storage.gke.io
parameters:
  type: pd-standard
  disk-encryption-kms-key: KMS_KEY_ID

Get

KMS_KEY_ID

by following the instructions in this document.

For AWS EBS, you just need to add an encrypted field; the key in AWS KMS will be generated automatically.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: my-enc-sc
provisioner: kubernetes.io/aws-ebs
parameters:
  encrypted: 'true'
  fsType: ext4
  type: gp2
volumeBindingMode: WaitForFirstConsumer

Read more about storage encryption in the documentation of your cloud provider or storage project of your choice.

Try it out

Once you have the StorageClass created, it is time to use it. I will use Percona Operator for PostgreSQL v2 (currently in tech preview) in my tests, but such an approach can be used with any Percona Operator.

Deploy the operator by following our installation instructions. I will use the regular kubectl way:

kubectl apply -f deploy/bundle.yaml --server-side

Create a new cluster with encrypted storage

To create the cluster with encrypted storage, you must set the correct storage class in the Custom Resource.

spec:
  ...
  instances:
  - name: instance1
  ...
    dataVolumeClaimSpec:
      storageClassName: my-enc-sc
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi

Apply the custom resource:

kubectl apply -f deploy/cr.yaml

The cluster should be up and running, backed by encrypted storage.

Encrypt storage for existing cluster

This task boils down to switching from one StorageClass to another. With version two of the Operator, we have a notion of instance groups. They are absolutely fantastic for testing new configurations, including compute and storage.

  1. Start with a regular cluster with two nodes – Primary and Replica. Storage is not encrypted. (0-fde-pg.yaml)
  2. Add another instance group with two nodes, but this time with encrypted storage (1-fde-pg.yaml). To do that, we change the spec.instances section:
  1.   - name: instance2
        replicas: 2
        dataVolumeClaimSpec:
          storageClassName: my-enc-sc
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 1Gi

    Wait for replication to complete and see the traffic hitting new nodes.

  2. Terminate nodes with unencrypted storage by removing the old instance group from the Custom Resource (2-fde-pg.yaml).

Now your cluster runs using encrypted storage.

Conclusion

It is quite interesting that PostgreSQL does not have built-in data-at-rest encryption. Peter Zaitsev wrote a blog post about it in the past – Why PostgreSQL Needs Transparent Database Encryption (TDE) – and why it is needed.

Storage-level encryption allows you to keep your data safe, but it has its limitations. The top limitations are:

  1. You can’t encrypt database objects granularly, only the whole storage.
  2. Also (1) does not allow you to encrypt different data with different keys, which might be the blocker for compliance and regulations.
  3. Physical backups, when files are copied from the disk, are not encrypted.

Even with these limitations, encrypting the data is highly recommended. Try out our operator and let us know what you think.

  • Please use this forum for general discussions.
  • Submit JIRA issue for bugs, and improvements of feature requests.
  • For commercial support, please use our contact page.

The Percona Kubernetes Operators automate the creation, alteration, or deletion of members in your Percona Distribution for MySQL, MongoDB, or PostgreSQL environment.

 

Learn More About Percona Kubernetes Operators

Mar
22
2023
--

Deploy Percona Monitoring and Management on Amazon EKS With eksctl and Helm

Deploy Percona Monitoring and Management on Amazon EKS

One of the installation methods we support for our database software is through Helm. We have a collection of Helm charts, in this repository, for the following Percona software:

Through this blog post, you will learn how to install Percona Monitoring and Management in a Kubernetes cluster on Amazon EKS using Helm and eksctl.

Requirements

For the installation of PMM with Helm, you will need:

  • Kubernetes 1.22+
  • Helm 3.2.0+
  • PV (Persistent Volume) provisioner support in the underlying infrastructure

If you want to install PMM in a Kubernetes cluster on Amazon EKS, you can use eksctl to create the required cluster. I published this blog post Creating a Kubernetes cluster on Amazon EKS with eksctl, in the Percona Community blog, where I explain how to use this tool.

For an easy way to deploy the Kubernetes cluster, check Percona My Database as a Service (MyDBaaS), a Percona Labs project. I also recommend checking this article Percona Labs Presents: Infrastructure Generator for Percona Database as a Service (DBaaS) on our blog, where the process of creating the cluster is described.

MyDBaaS will help you with cluster creation. It will generate the configuration file needed for eksctl, or it can deploy the cluster to AWS.

To use eksctl you must:

  • Install kubectl
  • Create a user with minimum IAM policies
  • Create an access key ID and secret access key for the user previously created
  • Install AWS CLI and configure authentication (
    aws configure

    , from the command line)

  • Install eksctl

Create a Kubernetes cluster

To create the cluster, you need to generate the configuration file for eksctl. Go to https://mydbaas.labs.percona.com/ and fill out the details of your cluster.

MyDBaaS - Kubernetes Cluster Configuration

Figure 1: MyDBaaS – Kubernetes Cluster Configuration

  1. Give your cluster a name. MyDbaas is the default value.
  2. Select the number of nodes. The value selected by default is three but you can create a cluster of up to five nodes.
  3. Write the instance type.
  4. Select the region. The default is us-west-2.

If you don’t know what instance type to use, go to the Instance Type Selector and select:

  • Number of CPUs
  • Memory size
  • Region
Instance Type Selector

Figure 2: MyDBaaS – Instance Type Selector

As stated on the website, this tool will only return the configuration file needed for eksctl. You can also provide your AWS credentials for the tool to deploy the EKS cluster.

After filling out the details, click on Submit and you will get the configuration file that will look like this:

addons:
- name: aws-ebs-csi-driver
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: MyDbaas
  region: us-east-1
nodeGroups:
- desiredCapacity: 2
  iam:
    withAddonPolicies:
      ebs: true
      efs: true
      fsx: true
  instanceType: m5.large
  maxSize: 5
  minSize: 1
  name: ng-1
  preBootstrapCommands:
  - echo 'OPTIONS="--default-ulimit nofile=1048576:1048576"' >> /etc/sysconfig/docker
  - systemctl restart docker
  volumeSize: 100
  volumeType: gp2

Then, create the cluster by running the following command:

eksctl create cluster -f cluster.yaml

While running, eksctl will create the cluster and all the necessary resources. It will take a few minutes to complete.

eksctl Running

Figure 3: eksctl Running

Cluster credentials can be found in

~/.kube/config

. Try

kubectl get nodes

to verify that this file is valid, as suggested by eksctl.

Install PMM with Helm

Once the cluster has been created and configured, you can install PMM using Helm.

  1. Add the repository
helm repo add percona https://percona.github.io/percona-helm-charts/

  1. Install PMM
helm install pmm --set service.type="LoadBalancer" percona/pmm

Once the PMM server runs, you must get its IP address. Run this command to get this value.

kubectl get services monitoring-service

You will get an output similar to the following.

NAME                 TYPE           CLUSTER-IP      EXTERNAL-IP                                                              PORT(S)                      AGE
monitoring-service   LoadBalancer   10.100.17.121   a57d50410ca2f4c9d9b029da8f44f73f-254463297.us-east-1.elb.amazonaws.com   443:30252/TCP,80:31201/TCP   100s

a57d50410ca2f4c9d9b029da8f44f73f-254463297.us-east-1.elb.amazonaws.com

 is the external IP and the one you need to access PMM from the browser.

Before accessing the dashboard of PMM, get the password.

export ADMIN_PASS=$(kubectl get secret pmm-secret --namespace default -o jsonpath='{.data.PMM_ADMIN_PASSWORD}' | base64 --decode)
echo $ADMIN_PASS

The value of

$ADMIN_PASS

  is the password you need to log into the dashboard. The default user is admin.

Go to the browser and paste the external IP in the address bar.

PMM running on Kubernetes

Figure 4: PMM running on Kubernetes

Now you have PMM running in the cloud on Amazon EKS.

I recommend you check this article Percona Monitoring and Management in Kubernetes is now in Tech Preview on our blog for more information about PMM in Kubernetes using Helm.

Conclusion

Through this blog post, you learned how to create a Kubernetes cluster with eksctl and deploy PMM using Helm with the help of Percona MyDBaaS. The process would be the same for any of the Percona software in the collection of Helm charts.

Learn more about Percona MyDBaaS!

Mar
20
2023
--

Comparisons of Proxies for MySQL

mysql proxy

With a special focus on Percona Operator for MySQL

Overview

HAProxy, ProxySQL, MySQL Router (AKA MySQL Proxy); in the last few years, I had to answer multiple times on what proxy to use and in what scenario. When designing an architecture, many components need to be considered before deciding on the best solution.

When deciding what to pick, there are many things to consider, like where the proxy needs to be, if it “just” needs to redirect the connections, or if more features need to be in, like caching and filtering, or if it needs to be integrated with some MySQL embedded automation.

Given that, there never was a single straight answer. Instead, an analysis needs to be done. Only after a better understanding of the environment, the needs, and the evolution that the platform needs to achieve is it possible to decide what will be the better choice.

However, recently we have seen an increase in the usage of MySQL on Kubernetes, especially with the adoption of Percona Operator for MySQL. In this case, we have a quite well-defined scenario that can resemble the image below:

MySQL on Kubernetes

In this scenario, the proxies must sit inside Pods, balancing the incoming traffic from the Service LoadBalancer connecting with the active data nodes.

Their role is merely to be sure that any incoming connection is redirected to nodes that can serve them, which includes having a separation between Read/Write and Read Only traffic, a separation that can be achieved, at the service level, with automatic recognition or with two separate entry points.

In this scenario, it is also crucial to be efficient in resource utilization and scaling with frugality. In this context, features like filtering, firewalling, or caching are redundant and may consume resources that could be allocated to scaling. Those are also features that will work better outside the K8s/Operator cluster, given the closer to the application they are located, the better they will serve.

About that, we must always remember the concept that each K8s/Operator cluster needs to be seen as a single service, not as a real cluster. In short, each cluster is, in reality, a single database with high availability and other functionalities built in.

Anyhow, we are here to talk about Proxies. Once we have defined that we have one clear mandate in mind, we need to identify which product allows our K8s/Operator solution to:

  • Scale at the maximum the number of incoming connections
  • Serve the request with the higher efficiency
  • Consume as fewer resources as possible

The environment

To identify the above points, I have simulated a possible K8s/Operator environment, creating:

  • One powerful application node, where I run sysbench read-only tests, scaling from two to 4096 threads. (Type c5.4xlarge)
  • Three mid-data nodes with several gigabytes of data in with MySQL and Group Replication (Type m5.xlarge)
  • One proxy node running on a resource-limited box (Type t2.micro)

The tests

We will have very simple test cases. The first one has the scope to define the baseline, identifying the moment when we will have the first level of saturation due to the number of connections. In this case, we will increase the number of connections and keep a low number of operations.

The second test will define how well the increasing load is served inside the previously identified range. 

For documentation, the sysbench commands are:

Test1

sysbench ./src/lua/windmills/oltp_read.lua  --db-driver=mysql --tables=200 --table_size=1000000 
 --rand-type=zipfian --rand-zipfian-exp=0 --skip_trx=true  --report-interval=1 --mysql-ignore-errors=all 
--mysql_storage_engine=innodb --auto_inc=off --histogram  --stats_format=csv --db-ps-mode=disable --point-selects=50 
--reconnect=10 --range-selects=true –rate=100 --threads=<#Threads from 2 to 4096> --time=1200 run

Test2

sysbench ./src/lua/windmills/oltp_read.lua  --mysql-host=<host> --mysql-port=<port> --mysql-user=<user> 
--mysql-password=<pw> --mysql-db=<schema> --db-driver=mysql --tables=200 --table_size=1000000  --rand-type=zipfian 
--rand-zipfian-exp=0 --skip_trx=true  --report-interval=1 --mysql-ignore-errors=all --mysql_storage_engine=innodb 
--auto_inc=off --histogram --table_name=<tablename>  --stats_format=csv --db-ps-mode=disable --point-selects=50 
--reconnect=10 --range-selects=true --threads=<#Threads from 2 to 4096> --time=1200 run

Results

Test 1

As indicated here, I was looking to identify when the first Proxy will reach a dimension that would not be manageable. The load is all in creating and serving the connections, while the number of operations is capped at 100. 

As you can see, and as I was expecting, the three Proxies were behaving more or less the same, serving the same number of operations (they were capped, so why not) until they weren’t.

MySQL router, after the 2048 connection, could not serve anything more.

NOTE: MySQL Router actually stopped working at 1024 threads, but using version 8.0.32, I enabled the feature: connection_sharing. That allows it to go a bit further.  

Let us take a look also the latency:

latency threads

Here the situation starts to be a little bit more complicated. MySQL Router is the one that has the higher latency no matter what. However, HAProxy and ProxySQL have interesting behavior. HAProxy performs better with a low number of connections, while ProxySQL performs better when a high number of connections is in place.  

This is due to the multiplexing and the very efficient way ProxySQL uses to deal with high load.

Everything has a cost:

HAProxy is definitely using fewer user CPU resources than ProxySQL or MySQL Router …

HAProxy

.. we can also notice that HAProxy barely reaches, on average, the 1.5 CPU load while ProxySQL is at 2.50 and MySQL Router around 2. 

To be honest, I was expecting something like this, given ProxySQL’s need to handle the connections and the other basic routing. What was instead a surprise was MySQL Router, why does it have a higher load?

Brief summary

This test highlights that HAProxy and ProxySQL can reach a level of connection higher than the slowest runner in the game (MySQL Router). It is also clear that traffic is better served under a high number of connections by ProxySQL, but it requires more resources. 

Test 2

When the going gets tough, the tough get going

Let’s remove the –rate limitation and see what will happen. 

mysql events

The scenario with load changes drastically. We can see how HAProxy can serve the connection and allow the execution of more operations for the whole test. ProxySQL is immediately after it and behaves quite well, up to 128 threads, then it just collapses. 

MySQL Router never takes off; it always stays below the 1k reads/second, while HAProxy served 8.2k and ProxySQL 6.6k.

mysql latency

Looking at the latency, we can see that HAProxy gradually increased as expected, while ProxySQL and MySQL Router just went up from the 256 threads on. 

To observe that both ProxySQL and MySQL Router could not complete the tests with 4096 threads.

ProxySQL and MySQL Router

Why? HAProxy always stays below 50% CPU, no matter the increasing number of threads/connections, scaling the load very efficiently. MySQL router was almost immediately reaching the saturation point, being affected by the number of threads/connections and the number of operations. That was unexpected, given we do not have a level 7 capability in MySQL Router.

Finally, ProxySQL, which was working fine up to a certain limit, reached saturation point and could not serve the load. I am saying load because ProxySQL is a level 7 proxy and is aware of the content of the load. Given that, on top of multiplexing, additional resource consumption was expected.   

proxysql usage

Here we just have a clear confirmation of what was already said above, with 100% CPU utilization reached by MySQL Router with just 16 threads, and ProxySQL way after at 256 threads.

Brief summary

HAProxy comes up as the champion in this test; there is no doubt that it could scale the increasing load in connection without being affected significantly by the load generated by the requests. The lower consumption in resources also indicates the possible space for even more scaling.

ProxySQL was penalized by the limited resources, but this was the game, we had to get the most out of the few available. This test indicates that it is not optimal to use ProxySQL inside the Operator; it is a wrong choice if low resource and scalability are a must.    

MySQL Router was never in the game. Unless a serious refactoring, MySQL Router is designed for very limited scalability, as such, the only way to adopt it is to have many of them at the application node level. Utilizing it close to the data nodes in a centralized position is a mistake.  

Conclusions

I started showing an image of how the MySQL service is organized and want to close by showing the variation that, for me, is the one to be considered the default approach:

MySQL service is organized

This highlights that we must always choose the right tool for the job. 

The Proxy in architectures involving MySQL/Percona Server for MySQL/Percona XtraDB Cluster is a crucial element for the scalability of the cluster, no matter if using K8s or not. Choosing the one that serves us better is important, which can sometimes be ProxySQL over HAProxy. 

However, when talking about K8s and Operators, we must recognize the need to optimize the resources usage for the specific service. In that context, there is no discussion about it, HAProxy is the best solution and the one we should go to. 

My final observation is about MySQL Router (aka MySQL Proxy). 

Unless there is a significant refactoring of the product, at the moment, it is not even close to what the other two can do. From the tests done so far, it requires a complete reshaping, starting to identify why it is so subject to the load coming from the query more than the load coming from the connections.   

Great MySQL to everyone. 

References

Mar
16
2023
--

Percona Labs Presents: Infrastructure Generator for Percona Database as a Service (DBaaS)

Percona MyDBaaS

Let’s look at how you can run Percona databases on Kubernetes, the easy way.

Chances are that if you are using the latest Percona Monitoring and Management (PMM) version, you have seen the availability of the new Percona Database as a Service (DBaaS). If not, go and get a glimpse of the fantastic feature with our docs on DBaaS – Percona Monitoring and Management.

Now, if you like it and wanna give it a try! (Yay!), but you don’t wanna deal with Kubernetes, (nay)o worries; we have a tool for you. Introducing the Percona DBaaS Infrastructure Creator, or Percona My Database as a Service (MyDBaaS).

My Database as a Service

First, a clarification: this tool is focused on running your DBaaS on Amazon Web Services (AWS).

percona My Database as a Service

At the time of the writing of this article, the PMM DBaaS supported a limited number of cloud vendors, AWS being one of them. This tool creates the infrastructure in an AWS account.

The usage is pretty simple. You have three features:

  • An instance selector: In case you don’t know what instance type to use, but you do know how many CPUs and how much memory your DBs requires.
  • A Cluster Creator: This is where you decide on some minimal and basic properties of the cluster and deploy it.
  • Deploy a Percona Operator: You can choose from our three Operators for Kubernetes to be deployed on the same infrastructure you are creating.

Instance selector

Currently, AWS has 270 different instance types available. Which one to use? The instance selector will help you with that. Just pass the number of vCPUs, amount of memory, and the region you intend to use, and it will show a list of EKS-suitable EC2 instances.

percona my database Instance selector

Why ask for the region? Costs! Instances types have different costs depending on the region they are in, and costs are something the tool will show you:

Cluster creator

A very basic interface. You only need to pass the name of the cluster, the amount of desired nodes, the instance type, and on which region you would like to run the cluster.

percona my database cluster

If you pass your AWS key/secret, the tool will take care of deploying the cluster, and once it is done, it will return the contents of the Kubeconfig file for you to use in the DBaaS of PMM.

A note on how we handle your AWS key/secret

Percona will never store this information. In the particular case of this tool, we do set the following environment variables:

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • AWS_DEFAULT_REGION

After the creation of the cluster, the environment variables are unset.

Now, why is there an option to not pass the AWS credentials? Well, for security, of course. 

Under the hood, the cluster is created using EKSCTL, a “CloudFormation stack creator.” 

If you are proficient enough with ekstcl, you can just copy/paste the created YAML and run the eksctl command on your own server without sharing your credentials ever. Now, for DBaaS, you still have to provide them, but that’s another topic

using EKSCTL

If you choose to pass the AWS credentials, an EKS cluster will be deployed. The outcome will be the contents of the kubeconfig file

kubeconfig file Percona

With that in place, one can now go to PMM and register the Kubernetes cluster as described in https://docs.percona.com/percona-monitoring-and-management/dbaas/get-started.html#add-a-kubernetes-cluster to deploy the DBaaS in there.

But there’s more:

Percona Operators

Now, there’s an additional option: deploying a Percona Operator. Currently, PMM DBaaS focuses on the Percona Operator for MySQL based on Percona XtraDB Cluster. With the MyDBaaS tool, you can choose from three operators available.

To have the full deployment, you need to check both AWS credentials and deploy a Percona Operator.

deploy a Percona Operator

At the end of the installation, you will have the contents of the Kubeconfig file, plus:

  • The password of the root user 
  • Instructions on how to connect to it

installing percona operator for MySQL

After this, you can use a Percona database running on Kubernetes. Have fun!

About Percona Labs

Percona Labs is a place for the community to have a voice on how we approach solutions, curated by Percona’s CTO, Vadim Tkachenko. Feedback is always welcome:

Community Forum

Mar
15
2023
--

Automating Physical Backups of MongoDB on Kubernetes

MongoDB

We at Percona talk a lot about how Kubernetes Operators automate the deployment and management of databases. Operators seamlessly handle lots of Kubernetes primitives and database configuration bits and pieces, all to remove toil from operation teams and provide a self-service experience for developers.

Today we want to take you backstage and show what is really happening under the hood. We will review technical decisions and obstacles we faced when implementing physical backup in Percona Operator for MongoDB version 1.14.0. The feature is now in technical preview.

The why

Percona Server for MongoDB can handle petabytes of data. The users of our Operators want to host huge datasets in Kubernetes too. However, using a logical restore recovery method takes a lot of time and can lead to SLA breaches.

Our Operator uses Percona Backup for MongoDB (PBM) as a tool to backup and restore databases. We have been leveraging logical backups for quite some time. Physical backups in PBM were introduced a few months ago, and, with those, we saw significant improvement in recovery time (read more in this blog post, Physical Backup Support in Percona Backup for MongoDB):

physical backup mongodb

So if we want to reduce the Recovery Time Objective (RTO) for big data sets, we must provide physical backups and restores in the Operator.

The how

Basics

When you enable backups in your cluster, the Operator adds a sidecar container to each replset pod (including the Config Server pods if sharding is enabled) to run pbm-agent. These agents listen for PBM commands and perform backups and restores on the pods. The Operator opens connections to the database to send PBM commands to its collections or to track the status of PBM operations.

kubernetes backup mongodb

Backups

Enabling physical backups was fairly straightforward. The Operator already knew how to start a logical backup, so we just passed a single field to the PBM command if the requested backup was physical.

Unlike logical backups, PBM needs direct access to mongod’s data directory, and we mounted the persistent volume claim (PVC) on PBM sidecar containers. With these two simple changes, the Operator could perform physical backups. But whether you know it from theory or from experience (ouch!), a backup is useless if you cannot restore it.

Restores

Enabling physical restores wasn’t easy. PBM had two major limitations when it came to physical restores:

1. PBM stops the mongod process during the restore.

2. It runs a temporary mongod process after the backup files are copied to the data directory.

What makes physical restores a complex feature is dealing with the implications of these two constraints.

PBM kills the mongod process

PBM kills the mongod process in the container, which means the operator can’t query PBM collections from the database during the restore to track their status. This forced us to use the PBM CLI tool to control the PBM operations during the physical restore instead of opening a connection to the database.

Note: We are now considering using the same approach for every PBM operation in the Operator. Let us know what you think.

Another side effect of PBM killing mongod was restarting the mongod container. The mongod process runs as PID 1 in the container, and when you kill it, the container restarts. But the PBM agent needs to keep working after mongod is killed to complete the physical restore. For this, we have created a special entrypoint to be used during the restore. This special entrypoint starts mongod and just sleeps after mongod has finished.

Temporary mongod started by PBM

When you start a physical restore, PBM does approximately the following:

1. Kills the mongod process and wipes the data directory.

2. Copies the backup files to the data directory.

3. Starts a temporary mongod process listening on a random port to perform operations on oplog.

4. Kills the mongod and PBM agent processes.

For a complete walkthrough, see Physical Backup Support in Percona Backup for MongoDB.

From the Operator’s point of view, the problem with this temporary mongod process is its version. We support MongoDB 4.4, MongoDB 5.0, and now MongoDB 6.0 in the Operator, and the temporary mongod needs to be the same version as the MongoDB running in the cluster. This means we either include mongod binary in PBM docker images and create a separate image for each version combination or find another solution.

Well… we found it: We copy the PBM binaries into the mongod container before the restore. This way, the PBM agent can kill and start the mongod process as often as it wants, and it’s guaranteed to be the same version with cluster.

Tying it all together

By now, it may seem like I have thrown random solutions to each problem. PBM, CLI, special entrypoint, copying PBM binaries… How does it all work?

When you start a physical restore, the Operator patches the StatefulSet of each replset to:

1. Switch the mongod container entrypoint to our new special entrypoint: It’ll start pbm-agent in the background and then start the mongod process. The script itself will run as PID 1, and after mongod is killed, it’ll sleep forever.

2. Inject a new init container: This init container will copy the PBM and pbm-agent binaries into the mongod container.

3. Remove the PBM sidecar: We must ensure that only one pbm-agent is running in each pod.

Until all StatefulSets (except mongos) are updated, the PerconaServerMongoDBRestore object is in a “waiting” state. Once they’re all updated, the operator starts the restore, and the restore object goes into the requested state, then the running state, and finally, the ready or error state.

Conclusion

Operators simplify the deployment and management of applications on Kubernetes. Physical backups and restores are quite a popular feature for MongoDB clusters. Restoring from physical backup with Percona Backup for MongoDB has multiple manual steps. This article shows what kind of complexity is hidden behind this seemingly simple step.

 

Try Percona Operator for MongoDB

Mar
01
2023
--

Reduce Your Cloud Costs With Percona Kubernetes Operators

cut costs percona operators

Public cloud spending is slowing down. Quarter-over-quarter growth is no longer hitting 30% gains for AWS, Google, and Microsoft. This is businesses’ response to tough and uncertain macroeconomic conditions, where organizations scrutinize their public cloud spending to optimize and adjust.

In this blog post, we will see how running databases on Kubernetes with Percona Operators can reduce your cloud bill when compared to using AWS RDS.

Inputs

These are the following instances that we will start with:

  • AWS
  • RDS for MySQL in us-east-1
  • 10 x db.r5.4xlarge
  • 200 GB storage each

The cost of RDS consists mostly of two things – compute and storage. We will not consider data transfer or backup costs in this article.

  • db.r5.4xlarge – $1.92/hour or $46.08/day or $16,819.20/year
  • For ten instances, it will be $168,192 per year
  • Default gp2 storage is $0.115 per GB-month, for 200 GB, it is $22.50/month or $270 per year
  • For ten instances, it will be $2,700 per year

Our total estimated annual cost is $168,192 + $2,700 = $170,892.

Let’s see how much we can save.

Cutting costs

Move to Kubernetes

If we take AWS EKS, the following components are included in the cost:

  1. Control plane
  2. Compute – EC2, where we will have a dedicated Kubernetes node per database node
  3. Storage – EBS

Control plane is – $0.10/hour per cluster or $876 per year.

For EC2, we will use the same r5.4xlarge instance, which is $1.008/hour or $8,830.08 per year.

EBS gp2 storage is $0.10 per GB-month, for 200GB, it is $240 per year.

For ten instances, our annual cost is:

  • Compute – $88,300.80
  • Storage – $2,880
  • EKS control plane – $876

Total: $92,056.80

By just moving from RDS to Kubernetes and Operators, you will save $78,835.20.

Why Operators?

Seems this and further techniques can be easily executed without Kubernetes and Operators by just leveraging EC2 instances. “Why do I need Operators?” would be the right question to ask.

The usual block on a way to cost savings is management complexity. Managed Database Services like RDS remove a lot of toil from operation teams and provide self-service for developers. Percona Operators do exactly the same thing on Kubernetes, enabling your teams to innovate faster but without an extremely high price tag.

Kubernetes itself brings new challenges and complexity. Managed Kubernetes like EKS or GCP do a good job by removing part of these, but not all. Keep this in mind when you make a decision to cut costs with Operators.

Spot instances

Spot instances (or Preemptible VMs at GCP) use the spare compute capacity of the cloud. They come with no availability guarantee but are available at a much lower price – up to an 80% discount.

Discount is great, but what about availability? You can successfully leverage spot instances with Kubernetes and databases, but you must have a good strategy behind it. For example:

  1. Your production database should be running in a clustered mode. For the Percona XtraDB Cluster, we recommend having at least three nodes.
  2. Run in multiple Availability Zones to minimize the risk of massive spot instance interruptions.
  3. In AWS, you can keep some percentage of your Kubernetes nodes on-demand to guarantee the availability of your database clusters and applications.

You can move the levers between your availability requirements and cost savings. Services like spot.io and tools like aws-node-termination-handler can enhance your availability greatly with spot instances.

Leverage container density

Some of your databases are always utilized, and some are not. With Kubernetes, you can leverage various automation scaling techniques to pack containers densely and reduce the number of instances.

  1. Vertical Pod Autoscaler (VPA) or Kubernetes Event-driven Autoscaling (KEDA) can help to reduce the number of requested resources for your containers based on real utilization.
  2. Cluster Autoscaler will remove the nodes that are not utilized. 

Figure 1 shows an example of this tandem, where Cluster Autoscaler terminates one EC2 instance after rightsizing the container resources.

Figure 1: Animated example showing the Kubernetes autoscaler terminating an unused EC2 instance

Local storage

Databases tend to consume storage, and usually, it grows over time. It gets even trickier when your performance requirements demand expensive storage tiers like io2, where significant spending lands on provisioned IOPs. Some instances at cloud providers come with local NVMe SSDs. It is possible to store data on these local disks instead of EBS volumes, thus reducing your storage costs.

For example,  r5d.4xlarge has 2 x 300 GBs of built-in NVME SSDs. It can be enough to host a database.

The use of local storage comes with caveats:

  1. Instances with local SSDs are approximately 10% more expensive
  2. Like spot nodes, you have two levers – cost and data safety. Clustering and proper backups are essential.

Solutions like OpenEBS or Ondat leverage local storage and, at the same time, simplify operations. We have written many posts about Percona Operators running with OpenEBS:

Conclusion

Cloud spending grows with your business. Simple actions like rightsizing or changing to new instance generations bring short-lived savings. The strategies described in this article significantly cut your cloud bill and keep the complexity away through Operators and Kubernetes.

Try out Percona Operators today!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com