Dec
29
2022
--

Missing Library: A pg_upgrade History

Missing Library: A pg_upgrade

While working as a DBA, we perform many regular tasks, and one of them is upgrading our database systems. There are some techniques to perform a PostgreSQL database upgrade, such as data dump and import, logical replication, or in-site upgrade using pg_upgrade.

The last is a good option when you can afford some downtime and desire to reuse the same server where your current instance is running.

The process using pg_upgrade is well documented, and you can easily find the instructions with little googling.

However, there are a few occasions when you face some unusual conditions that require additional research. In this blog post, I will show you a particular example I faced when working with an upgrade exercise using pg_upgrade.

As a regular procedure, the following steps were followed:

  • Install the same PostgreSQL packages you already have installed in the current version for the new one.
  • Run pg_upgrade with the –check flag to verify the compatibility between the current and the new version clusters.
  • If all is good, perform the upgrade removing the –check flag.

You might consider some extra steps, such as verifying the existence of gin/gist indexes, unknown datatypes, or planning the upgrade of any standby server using the rsync approach. These are not in this blog’s scope to keep this example simple.

Ideally, all the three above steps execute with no issues, and you get your new PostgreSQL version up and running.

There are some situations where the second step, the –check one, fails, and you need to investigate and fix it before moving forward. A “regular” case is you missed installing a required package in the new version. You will end with something like the following:

Performing Consistency Checks on Old Live Server
------------------------------------------------
Checking cluster versions                                   ok
Checking database user is the install user                  ok
Checking database connection settings                       ok
Checking for prepared transactions                          ok
Checking for system-defined composite types in user tables  ok
Checking for reg* data types in user tables                 ok
Checking for contrib/isn with bigint-passing mismatch       ok
Checking for tables WITH OIDS                               ok
Checking for invalid "sql_identifier" user columns          ok
Checking for presence of required libraries                 fatal

In this case, the pg_upgrade –check command will provide a hint:

Your installation references loadable libraries that are missing from the
new installation.  You can add these libraries to the new installation,
or remove the functions using them from the old installation.  A list of
problem libraries is in the file:
    loadable_libraries.txt

For example, if checking the content of the loadable_libraries.txt file, you see something like the next:

$ cat loadable_libraries.txt
could not load library "$libdir/pg_repack": ERROR: could not access file "$libdir/pg_repack": No such file or directory

You can immediately identify the missing package as the pg_repack, so you need to install it for the new version.

Well, sometimes, this information is not quite “evident.” Let’s review the case.

Example case

When working with the upgrade exercise, the goal was to move from PostgreSQL 11 to PostgreSQL 12. This was a specific requirement, but the situation could also happen when upgrading to a different version.

Following the main process, I installed the same packages as the existing version 11 to version 12 and verified.

ii  postgresql-11                   11.18-1.pgdg20.04+1               amd64        The World's Most Advanced Open Source Relational Database
ii  postgresql-12                   12.13-1.pgdg20.04+1               amd64        The World's Most Advanced Open Source Relational Database
ii  postgresql-common               246.pgdg20.04+1                   all          PostgreSQL database-cluster manager
ii  postgresql-client-11            11.18-1.pgdg20.04+1               amd64        front-end programs for PostgreSQL 11
ii  postgresql-client-12            12.13-1.pgdg20.04+1               amd64        front-end programs for PostgreSQL 12
ii  postgresql-client-common        246.pgdg20.04+1                   all          manager for multiple PostgreSQL client versions

Looking good, then we can execute the second step and verify the cluster’s compatibility.

verify the cluster's compatibility

Mmmh, something is missing. Let’s check what could be.

Ok, a “randomness” library doesn’t sound like a usual library, but we can review some details to find it.

I could miss some packages for the new version; let’s check again.

postgresql packages

The same packages are installed for both versions.

The next idea is to verify the extensions we have installed in the source database.

verify the extensions postgresql

Here, nothing looks “related” to the randomness library.

Just to double-check check, we can even try to find the related libraries from the PostgreSQL $libdir directory. For example, with the following bash snippet.

for lib in $(psql -qtA pgbench -c"select extname||'.so' from pg_extension") ; do find $(<pg version bin path>/pg_config --pkglibdir) -name $lib -type f ; done

Checked for version 11:

And the same for version 12:

Everything seems OK, so where is the randomness library coming from?

There is something I am missing. I tried pg_dump with the –binary-upgrade flag to review some other details regarding the extensions

pg_dump -C --schema-only --quote-all-identifiers --binary-upgrade dbname=<DATABASE_NAME> -f dump_binary_upgrade.sql

From the PostgreSQL documentation we can see the following statement about this flag:

–binary-upgrade

This option is for use by in-place upgrade utilities. Its use for other purposes is not recommended or supported. The behavior of the option may change in future releases without notice.

But the intention is not to restore. Just take a look at the extra information we can get.

The –binary-upgrade option will produce a “verbose” output related to the extensions. The CREATE EXTENSION IF NOT EXISTS commands will be replaced with DROP EXTENSION IF EXISTS …, SELECT pg_catalog.binary_upgrade_create_empty_extension() calls.

Also, you will see a section with the explicit creation of the C language functions for the extensions. An example is the next for the pg_stat_statements extension:

--
-- Name: pg_stat_statements_reset(); Type: FUNCTION; Schema: public; Owner: postgres
--

CREATE FUNCTION "public"."pg_stat_statements_reset"() RETURNS "void"
    LANGUAGE "c" PARALLEL SAFE
    AS '$libdir/pg_stat_statements', 'pg_stat_statements_reset';

-- For binary upgrade, handle extension membership the hard way
ALTER EXTENSION "pg_stat_statements" ADD FUNCTION "public"."pg_stat_statements_reset"();


ALTER FUNCTION "public"."pg_stat_statements_reset"() OWNER TO "postgres";

Looking at the following line:

    AS '$libdir/pg_stat_statements', 'pg_stat_statements_reset';

You can identify the physical library ($libdir/pg_stat_statements) used to build the Postgres function and the C function (‘pg_stat_statements_reset‘) within that library.

For some extra good material about writing extensions in PostgreSQL, you can take a look at the blog post Writing PostgreSQL Extensions is Fun – C Language, it really worth it.

Well, while doing this review about the extension’s definition, I saw the next one:

--
-- Name: bytea_size("bytea"); Type: FUNCTION; Schema: public; Owner: postgres
--

CREATE FUNCTION "public"."bytea_size"("bytea") RETURNS integer
    LANGUAGE "c" STRICT
    AS 'randomness', 'bytea_size';


ALTER FUNCTION "public"."bytea_size"("bytea") OWNER TO "postgres";

Aha! There is the randomness library call! A custom C function is using it.

We should remember PostgreSQL includes support for some procedural languages with its base distribution, such as  PL/pgSQL, PL/Tcl, PL/Perl, and PL/Python. So, having C functions is perfectly supported; however, it is not what we might say the usual.

As described in the PostgreSQL documentation, there are a few paths where the libraries can be placed:

The following algorithm is used to locate the shared object file based on the name given in the CREATE FUNCTION command:

1. If the name is an absolute path, the given file is loaded.
2. If the name starts with the string $libdir, that part is replaced by the PostgreSQL package library directory name, which is determined at build time.
3. If the name does not contain a directory part, the file is searched for in the path specified by the configuration variable dynamic_library_path.
4. Otherwise (the file was not found in the path, or it contains a non-absolute directory part), the dynamic loader will try to take the name as given, which will most likely fail. (It is unreliable to depend on the current working directory.)

For this case, I searched all the paths and did not find the randomness. No library. It seems somebody was playing around to build a C function and at some point, removed the library but left the Postgres function behind (this was a DEV environment).

After confirming this “public“, “bytea_size” function is not used, the solution was to DROP it, then retry the pg_upgrade –check command.

bytea_size postgresql

After this point, the pg_upgrade was successful.

Conclusion

As we all already know, PostgreSQL is really powerful, and its core capabilities can be extended with custom functions and extensions. Writing your own in C language functions requires some extra knowledge, but you can definitely get very interesting results.

Remember, every time a custom C function is created, PostgreSQL itself doesn’t compile the C code, which should be done beforehand, but the CREATE FUNCTION command is recorded literally in the system catalogs. So it will be referenced for any other load calls, including the pg_upgrade command.

Dec
28
2022
--

A Useful GTID Feature for Migrating to MySQL GTID Replication – ASSIGN_GTIDS_TO_ANONYMOUS_TRANSACTIONS

Migrating to MySQL GTID Replication

Migrating to MySQL GTID ReplicationIn managed services, we get requests to migrate from traditional to GTID-based replication. However, the customer does not want to first enable the GTID on the source node (production). Before MySQL 8.0.23, replication from the disabled GTID source to an enabled GTID replica was impossible.

In this blog, I will talk about a new MySQL feature introduced in 8.0.23, which allows MySQL to replicate from a GTID-disabled source to GTID-enabled replica. You can enable GTID assignment on a replication channel using the ASSIGN_GTIDS_TO_ANONYMOUS_TRANSACTIONS.

You can read the Percona doc if you are unaware of how GTID replication works:

https://www.percona.com/doc/percona-server/5.6/flexibility/online_gtid_deployment.html

Note: The replica must have gtid_mode=ON set, which cannot be changed afterward unless you remove the ASSIGN_GTIDS_TO_ANONYMOUS_TRANSACTIONS setting. 

Acceptable inputs

OFF: As the name suggests, it keeps this feature turned off 

Local: Returns the local server UUID, same as server_uuid global variable

UUID: Specify a valid UUID which will be used while generating GTID transactions

Syntax 

CHANGE REPLICATION SOURCE TO ASSIGN_GTIDS_TO_ANONYMOUS_TRANSACTIONS=OFF|LOCAL|<UUID>;

Let’s demonstrate the feature  ASSIGN_GTIDS_TO_ANONYMOUS_TRANSACTIONS.

Initially, we keep the GTID replication disabled on the source and replica node.

Source: GTID replication is disabled 

mysql> select @@enforce_gtid_consistency,@@gtid_mode;

+----------------------------+-------------+

| @@enforce_gtid_consistency | @@gtid_mode |

+----------------------------+-------------+

| OFF                        | OFF         |

+----------------------------+-------------+

1 row in set (0.00 sec)

Replica: GTID replication disabled

mysql> select @@enforce_gtid_consistency,@@gtid_mode;

+----------------------------+-------------+

| @@enforce_gtid_consistency | @@gtid_mode |

+----------------------------+-------------+

| OFF                        | OFF         |

+----------------------------+-------------+

1 row in set (0.00 sec)

Let’s create a table on the source and check binlog events.  

mysql> create table gtid_test1 (id int );
Query OK, 1 row affected (0.01 sec)

mysql> show binlog events in 'binarylogs.000008';

+---------------+-----+----------------+-----------+-------------+------------------------------------------------------------------+

| Log_name      | Pos | Event_type     | Server_id | End_log_pos | Info                                                             |

+---------------+-----+----------------+-----------+-------------+------------------------------------------------------------------+

| binarylogs.000008 |   4 | Format_desc    |         3 |         126 | Server ver: 8.0.30-22, Binlog ver: 4                             |

| binarylogs.000008 | 126 | Previous_gtids |         3 |         157 |                                                                  |

| binarylogs.000008 | 157 | Anonymous_Gtid |         3 |         234 | SET @@SESSION.GTID_NEXT= 'ANONYMOUS'                             |

| binarylogs.000008 | 234 | Query          |         3 |         365 | use `test_gtid`; create table gtid_test1 (id int ) /* xid=196 */ |

| binarylogs.000008 | 365 | Anonymous_Gtid |         3 |         444 | SET @@SESSION.GTID_NEXT= 'ANONYMOUS'                             |

| binarylogs.000008 | 444 | Query          |         3 |         524 | BEGIN                                                            |

| binarylogs.000008 | 524 | Table_map      |         3 |         585 | table_id: 101 (test_gtid.gtid_test1)                             |

| binarylogs.000008 | 585 | Write_rows     |         3 |         625 | table_id: 101 flags: STMT_END_F                                  |

| binarylogs.000008 | 625 | Xid            |         3 |         656 | COMMIT /* xid=198 */                                             |

+---------------+-----+----------------+-----------+-------------+------------------------------------------------------------------+

9 rows in set (0.00 sec)

Replica: binlog events on replica node

| binarylogs.000014 | 1176 | Query          |         3 |        1251 | BEGIN                                                                                 |                                                     |

| binarylogs.000014 | 1350 | Xid            |         3 |        1381 | COMMIT /* xid=149 */                                                                  |

| binarylogs.000014 | 1381 | Anonymous_Gtid |         3 |        1465 | SET @@SESSION.GTID_NEXT= 'ANONYMOUS'                                                  |

| binarylogs.000014 | 1465 | Query          |         3 |        1596 | use `test_gtid`; create table gtid_test1 (id int ) /* xid=162 */                      |

| binarylogs.000014 | 1596 | Anonymous_Gtid |         3 |        1682 | SET @@SESSION.GTID_NEXT= 'ANONYMOUS'                                                  |

| binarylogs.000014 | 1682 | Query          |         3 |        1757 | BEGIN                                                                                 |

| binarylogs.000014 | 1757 | Table_map      |         3 |        1818 | table_id: 99 (test_gtid.gtid_test1)                                                   |

| binarylogs.000014 | 1818 | Write_rows     |         3 |        1858 | table_id: 99 flags: STMT_END_F                                                        |

| binarylogs.000014 | 1858 | Xid            |         3 |        1889 | COMMIT /* xid=164 */                                                                  |

+-----------------------------+------+----------------+-----------+-------------+---------------------------------------------------------------------------------------+

22 rows in set (0.00 sec)

Let’s enable the ASSIGN_GTIDS_TO_ANONYMOUS_TRANSACTIONS=local on the replica and insert a record.

Note: Make sure that gtid_mode=on before enabling the ASSIGN_GTIDS_TO_ANONYMOUS_TRANSACTIONS=local

Replica Node:

mysql> select @@gtid_mode;

+-------------+

| @@gtid_mode |

+-------------+

| ON          |

+-------------+

1 row in set (0.00 sec)

mysql> CHANGE REPLICATIION SOURCE TO ASSIGN_GTIDS_TO_ANONYMOUS_TRANSACTIONS=LOCAL;

Query OK, 0 rows affected, 2 warnings (0.03 sec)

mysql> start slave;

Query OK, 0 rows affected, 1 warning (0.01 sec)

mysql> show binlog events in 'binarylogs.000017';

+-----------------------------+-----+----------------+-----------+-------------+--------------------------------------------------------------------+

| Log_name                    | Pos | Event_type     | Server_id | End_log_pos | Info                                                               |

+-----------------------------+-----+----------------+-----------+-------------+--------------------------------------------------------------------+

| binarylogs.000017 |   4 | Format_desc    |         1 |         126 | Server ver: 8.0.30-22, Binlog ver: 4                               |

| binarylogs.000017 | 126 | Previous_gtids |         1 |         197 | 867e5079-8420-11ed-a0bf-1260d715ed11:1-12                          |

| binarylogs.000017 | 197 | Gtid           |         3 |         283 | SET @@SESSION.GTID_NEXT= '867e5079-8420-11ed-a0bf-1260d715ed11:13' |

| binarylogs.000017 | 283 | Query          |         3 |         358 | BEGIN                                                              |

| binarylogs.000017 | 358 | Table_map      |         3 |         419 | table_id: 99 (test_gtid.gtid_test1)                                |

| binarylogs.000017 | 419 | Write_rows     |         3 |         459 | table_id: 99 flags: STMT_END_F                                     |

| binarylogs.000017 | 459 | Xid            |         3 |         490 | COMMIT /* xid=183 */                                               |

+-----------------------------+-----+----------------+-----------+-------------+--------------------------------------------------------------------+

7 rows in set (0.00 sec)
mysql> select @@server_uuid;

+--------------------------------------+

| @@server_uuid                        |

+--------------------------------------+

| 867e5079-8420-11ed-a0bf-1260d715ed11 |

+--------------------------------------+

1 row in set (0.00 sec)

You can see the binlog has server UUID in replica binlogs. I hope this blog post will help you migrate to GTID replication even easier.

Reference:

https://dev.mysql.com/doc/refman/8.0/en/replication-gtids-assign-anon.html

Dec
28
2022
--

Impact of DDL Operations on Aurora MySQL Readers

Impact of DDL Operations on Aurora MySQL Readers

Impact of DDL Operations on Aurora MySQL ReadersRecently I came across an interesting investigation about long-running transactions getting killed on an Aurora Reader instance. In this article, I will explain why it is advisable to avoid long-running transactions on Aurora readers when executing frequent DDL operations on the Writer, or at least be aware of how a DDL can impact your Aurora readers.

Aurora uses a shared volume often called a cluster volume that manages the data for all the DB instances which are part of the cluster. Here DB instances could be a single Aurora instance or multiple instances (Writer and Aurora Read Replicas) within a cluster.

Aurora replicas connect to the same storage volume as the primary DB instance and support only read operations. So if you add a new Aurora replica it would not make a new copy of the table data and instead will connect to the shared cluster volume which contains all the data.

This could lead to an issue on replica instances when handling the DDL operations.

Below is one such example.

mysql> SELECT AURORA_VERSION();
+------------------+
| AURORA_VERSION() |
+------------------+
| 3.02.2           |
+------------------+
1 row in set (0.22 sec)

 

Start a transaction on reader:

mysql> SELECT connection_id();
+-----------------+
| connection_id() |
+-----------------+
|               21|
+-----------------+
1 row in set (0.27 sec)

mysql> SELECT * FROM t WHERE old_column not like '%42909700340-70078987867%';

 

While the transaction is ongoing on the reader, execute any DDL against the same table on the writer

mysql> ALTER TABLE t ADD COLUMN new_column VARCHAR(32);

 

Check status on reader, the transaction would be terminated forcefully

mysql> SELECT * FROM t WHERE old_column not like '%42909700340-70078987867%';
ERROR 2013 (HY000): Lost connection to MySQL server during query

mysql> SELECT connection_id();
ERROR 2006 (HY000): MySQL server has gone awayNo connection.
Trying to reconnect...
Connection id:    22
Current database: db
+-----------------+
| connection_id() |
+-----------------+
|              22 |
+-----------------+
1 row in set (3.19 sec)

 

Now, let’s see what happens when there is a backup happening from a reader node and the writer receives a DDL for that particular table that is being backed up.

Take a logical backup of a table using mydumper:

mydumper --success-on-1146 --outputdir=/backups/ --verbose=3 --host=aurora-reader --ask-password --tables-list=db.t

While the backup is ongoing on the reader, execute any DDL against the same table on the writer.

mysql> ALTER TABLE t ADD COLUMN new_column VARCHAR(32);

Check the status of the backup

** Message: 16:04:51.108: Thread 1 dumping data for `db`.`t`          into /backups/db.t.00000.sql| Remaining jobs: 6
..
..
** Message: 16:04:51.941: Waiting threads to complete
** Message: 16:04:51.941: Thread 2 shutting down
** (mydumper:44955): CRITICAL **: 16:04:55.268: Could not read data from db.t: Lost connection to MySQL server during query

So what is the issue?

As stated above, Aurora does not use binary log-based replication to replicate data to the readers. The underlying storage is the same for all the instances (writer+readers) within a cluster and Aurora handles it with let’s say “magic”.

Now, because of this “magic” in Aurora, when you perform any DDL operation on writer instance, the reader instances are forced to terminate any long-running transactions so as to acquire the metadata lock so that DDL operation can continue on writer instance.

Hence, if you are using Aurora replicas for logical backups (mysqldump/mydumper) or if you are running some long-running jobs on the reader instance you may encounter the issue mentioned above.

To understand this better let’s see what happens when we perform any DDL operation in a binary log-based replication environment and in the Aurora replication environment. Following are the high-level steps when any DDL gets executed.

Binary log-based replication:

  • On the primary, ALTER TABLE will try to acquire the metadata lock
  • Once the lock is acquired the ALTER TABLE progresses
  • Once the ALTER TABLE operation completes, the DDL statement will be written to the binary log
  • On the replicas, the IO thread will copy this event to the local relay log
  • The SQL thread will apply the query from the relay log
  • On the replica, it will also acquire the global metadata lock
  • Once the lock is acquired, the ALTER TABLE will starts execution on the replica

Aurora replication:

  • On the writer, the ALTER TABLE will try to acquire the metadata lock
  • At the same time, it will check if there is any open transaction in any of the reader nodes, if so it will kill those transactions forcefully
  • Once the metadata lock is acquired, the ALTER TABLE progresses
  • After the ALTER TABLE completes, the modified structure will be visible to the replicas because of the same underlying storage

What are the issues?

  1. If you are performing frequent DDL operations in your database, it is not recommended to take logical backups from Aurora Reader.
  2. If transactions are running for a long time they may get killed.

What is the solution?

Create an external replica of the Aurora cluster using binary log-based replication. This replica can be used to take logical backups or to execute some long-running queries that will not be interrupted by the DDL operation on the Aurora writer instance.

You may follow the Percona blog to create an external replica from Aurora using MyDumper or review the AWS documentation page.

Dec
27
2022
--

Transparent Data Encryption (TDE)

Transparent Data Encryption

Encrypting data at rest in a database management system (DBMS) refers to securing data by encrypting it when it is not being used or accessed. This is often done to protect sensitive data from unauthorized access or theft. With transparent data encryption (TDE), the encryption process is transparent to the user, allowing them to access and manipulate the data as usual without worrying about the encryption and decryption process. TDE can be applied at various levels, such as at the database, column, or cell level, depending on the organization’s specific security needs. Implementing TDE can help ensure that sensitive data is protected and only accessible to authorized users with the proper decryption keys.

Disk-level encryption

Disk-level encryption is a security measure that encrypts all data stored on a disk or storage device. The level granularity only encrypts the full or partial disk. This is independent of the database, and external tools can be used to achieve this.

  • Disk-level encryption is a security measure that encrypts all data stored on a disk or storage device.
  • It ensures that the data is protected against unauthorized access and can only be accessed by authorized users with the proper decryption keys.
  • Disk-level encryption is often used to protect data at rest and can be implemented using full or partial disk encryption.
  • Several tools are available for implementing disk-level encryption, including BitLocker for Windows, dm-crypt for Linux, and FileVault for MacOS.

Cluster-level encryption

Cluster-level encryption is a security measure that encrypts data stored in a cluster of servers or storage devices. Cluster-level encryption is a security measure that encrypts data stored in a cluster of servers or storage devices.

  • It ensures that the data is protected against unauthorized access and can only be accessed by authorized users with the proper decryption keys.
  • Cluster-level encryption can help protect data at rest and in motion when transmitted between servers in the cluster.
  • Several database management systems support cluster-level encryption, including Oracle, MySQL, Microsoft SQL Server, MongoDB, and Cassandra.

Database-level encryption

Database-level encryption is the encryption of the entire database or all data stored within the database. 

  • Database-level encryption is the encryption of an entire database or all data stored within the database.
  • It can provide a high level of security for sensitive data and is often used in conjunction with other security measures, such as TDE and cluster-level encryption.
  • Several database management systems support database-level encryption, including Oracle, MySQL, and Microsoft SQL Server.

Table-level encryption

Table-level encryption is a security measure that allows users to encrypt specific tables or columns within a database. It can provide a high level of security for sensitive data and is often used in conjunction with other security measures, such as TDE and cluster-level encryption. Here are some critical points about table-level encryption in a relational database management system.

  • Table-level encryption allows users to selectively encrypt specific tables or columns within a database rather than encrypting the entire database or all data stored within it.
  • It can provide a high level of security for sensitive data while still allowing users to access and manipulate it as usual.
  • Table-level encryption is often used with other security measures, such as TDE and cluster-level encryption, to provide a more comprehensive security solution.
  • Several RDBMSs, including Oracle, MySQL, and Microsoft SQL Server, support it.
  • Implementing table-level encryption may require additional resources and impact performance, so carefully evaluating the trade-offs between security and performance is essential.

Encryption available in databases

Here is a comparison chart of the different levels of encryption available in various relational database management systems:

Encryption Level Oracle MySQL Microsoft SQL Server PostgreSQL
Cluster-Level Transparent Data Encryption (TDE) InnoDB Plugin Transparent Data Encryption (TDE) N/A
Database-Level Transparent Data Encryption (TDE) Encryption at Rest Transparent Data Encryption (TDE) N/A
Table-Level Transparent Data Encryption (TDE) Encryption at Rest Transparent Data Encryption (TDE) N/A
Column-Level Transparent Data Encryption (TDE) Encryption at Rest Transparent Data Encryption (TDE) pgcrypto

 

Encryption Level Amazon RDS (Oracle) Amazon RDS (MySQL) Azure SQL Database Google Cloud SQL (MySQL)
Cluster-Level Transparent Data Encryption (TDE) InnoDB Plugin Transparent Data Encryption (TDE) N/A
Database-Level Transparent Data Encryption (TDE) Encryption at Rest Transparent Data Encryption (TDE) Full Disk Encryption, Transparent Data Encryption (TDE), File-level Encryption
Table-Level Transparent Data Encryption (TDE) Encryption at Rest Transparent Data Encryption (TDE) Transparent Data Encryption (TDE)
Column-Level Transparent Data Encryption (TDE) Encryption at Rest Transparent Data Encryption (TDE) Transparent Data Encryption (TDE)

 

Key management

Key management is vital to data at rest encryption in a relational database management system. This process involves creating, distributing, and managing keys to encrypt and decrypt data.

Several key management strategies can be used in an RDBMS, including:

  • Master Key Management: In this approach, a single master key is used to encrypt and decrypt all data in the RDBMS. This key is usually stored in a secure location, such as a hardware security module (HSM).
  • Column-Level Key Management: In this approach, each column of data is encrypted using a separate key. This allows for more granular control over data access and improves performance because only the necessary columns are decrypted when querying data.
  • Table-Level Key Management: Similar to column-level key management, this approach uses a separate key for each table in the RDBMS. This allows for even greater data access control and improved performance.

Regardless of which key management strategy is used, it is essential to have a robust system in place to ensure the security of encrypted data. This includes securely storing keys and quickly and easily rotating keys if there is a breach.

In addition to key management, it is also essential to consider the overall security of the RDBMS. This includes using secure protocols for communication, secure authentication methods, and implementing robust access controls to prevent unauthorized access to data.

Overall, key management is critical to data-at-rest encryption in an RDBMS. Organizations can protect their sensitive data from unauthorized access by implementing a strong key management system and taking steps to secure the RDBMS.

Transparent data encryption in PostgreSQL

The scope of TDE in PostgreSQL is limited to only column based. It does not encrypt other aspects of the database, such as table-level and database-level encryption; users may need to implement additional security measures to protect these database components.

PostgreSQL column-level encryption

We have discussed that PostgreSQL only has column-level encryption, which can be utilized by an extension called pgcrypto. This is the bare minimum implementation of encryption. It can only apply columns of the table, not on log files, whole rows, and databases.

The pgcrypto module in PostgreSQL is a powerful tool for securely storing and managing encrypted data within your database. With pgcrypto, you can easily encrypt data using a variety of algorithms and keys and then decrypt the data when needed. One of the key features of pgcrypto is its support for multiple encryption algorithms, including AES, Blowfish, and 3DES. This allows you to choose the best algorithm for your security needs and requirements.

To use pgcrypto, you will first need to install the extension in your PostgreSQL database. This can be done using the CREATE EXTENSION command:

CREATE EXTENSION pgcrypto;

Once the extension is installed, you can use the various functions of pgcrypto to encrypt and decrypt data. For example, you can use the pgp_sym_encrypt function to encrypt data using a symmetric key:

SELECT pgp_sym_encrypt('my secret data', 'my secret key', 'random_iv');

This will return the encrypted data as a bytea data type. The pgp_sym_encrypt function takes three arguments: the data to be encrypted, the symmetric key to be used for the encryption, and an initialization vector (IV) to be used for the encryption. The IV is a random value used to ensure that the same data encrypted with the same key will result in different encrypted outputs.

To decrypt the data, you can use the pgp_sym_decrypt function:

SELECT pgp_sym_decrypt(encrypted_data, 'my secret key', 'random_iv');

This will return the decrypted data as a text data type. The pgp_sym_decrypt function takes the same three arguments as the pgp_sym_encrypt function: the encrypted data, the symmetric key used to encrypt the data, and the IV used for the encryption.

In addition to symmetric key encryption, pgcrypto supports public-key encryption using the pgp_pub_encrypt and pgp_pub_decrypt functions. Public-key encryption is a type of encryption that uses a pair of keys: a public key and a private key. The public key encrypts the data, and the private key decrypts the data.

To use public-key encryption with pgcrypto, you will need to generate a pair of keys using the pgp_gen_key function:

SELECT * FROM pgp_gen_key('my_key', 'rsa', 2048, '', '', 'my@email.com');

This will generate a pair of keys and return them as a record. The pgp_gen_key function takes six arguments: the name of the key, the type of key to generate (in this case, “rsa”), the key size (in bits), optional passphrase and expiration values, and the user’s email address.

Once you have a pair of keys, you can use the pgp_pub_encrypt

Here is a complete example of encrypting and decrypting credit card data.

Now first, create a sample table with the serial number, credit card number, and when we created that card information. Here you can see only a credit card number is essential to hide from unauthorized users. Therefore we will only encrypt that column.

CREATE TABLE tbl_creadit_card (

  id serial PRIMARY KEY,

  credit_card bytea NOT NULL,

  created_at timestamptz NOT NULL DEFAULT NOW()

);

Now create a function generate_iv; the purpose of this function is to generate a random initialization vector (IV) that can be used for encryption. An IV is a random value that is used to ensure that the same data encrypted with the same key will result in different encrypted outputs. The generate_iv function can be called whenever an IV is needed for encryption.

CREATE OR REPLACE FUNCTION generate_iv()
RETURNS bytea AS
$$
BEGIN
  RETURN gen_random_bytes(16);
END;
$$
LANGUAGE plpgsql;

The next step is to create encrypt_data. This function takes two input parameters: data of type bytea and key of type text. It returns a bytea data type. Inside the function, it first generates a random initialization vector (IV) using the generate_iv function. This is done using a WITH clause and a subquery. Next, the function uses the pgp_sym_encrypt function to encrypt the data parameter using the key and iv as input. The encrypted data is returned by the function using the SELECT statement. The purpose of this function is to provide a convenient way to encrypt data using a symmetric key and a random IV. You can call the encrypt_data function and pass in the data you want to encrypt and the key for encryption. The function will generate the random IV and perform the encryption for you.

CREATE OR REPLACE FUNCTION encrypt_data(data bytea, key text)
RETURNS bytea AS
$$
BEGIN
  -- Generate a random IV
  WITH random_iv AS (
    SELECT generate_iv() AS iv
  )
  -- Encrypt the data using the key and IV
  SELECT pgp_sym_encrypt(card_number, key, iv)
  FROM random_iv;
END;
$$
LANGUAGE plpgsql;

Insert a row in the table.

INSERT INTO encrypted_data (encrypted_data)
VALUES (encrypt_data('6371-3263-3461-2735', 'MyKey'));

SELECT ed.id, decrypt_data(ed.card_number, 'Mykey') AS card_number, ed.created_at
FROM tbl_creadit_card ed
WHERE ed.id = 1;
 id |  card_number        |     created_at

----+---------------------+----------------------------
  1 | 6371-3263-3461-2735 | 2022-12-20 12:34:56

The pgcrypto extension in PostgreSQL provides several cryptographic functions that perform various types of encryption and decryption. Some of the encryption algorithms that are supported include:

  • AES (Advanced Encryption Standard)
  • Blowfish
  • DES (Data Encryption Standard)
  • MD5 (Message Digest Algorithm 5)
  • SHA-1 (Secure Hash Algorithm 1)
  • SHA-2 (Secure Hash Algorithm 2)
  • SHA-3 (Secure Hash Algorithm 3)
  • HMAC (Keyed-Hash Message Authentication Code)

PostgreSQL transparent data encryption future

Currently, there is no Transparent Data Encryption available in vanilla PostgreSQL. There have been many discussions in the PostgreSQL community to implement that, but none was committed. But there are some PostgreSQL forks owned by companies that provide these. Here is an example architecture for the Transparent data encryption for PostgreSQL.

Conclusion

Overall, TDE is a powerful tool for protecting sensitive data in a database. By encrypting data at rest and in motion, TDE helps to ensure that the data is secure throughout its lifecycle. Using key management techniques, such as an EMK or an external KMS, users can further secure their encrypted data and maintain control over it.

Dec
27
2022
--

How MySQL Uses character_set Configurations

How MySQL Uses character_set

There are eight configuration options related to the character_set in MySQL, as shown below. Without reading the MySQL Character Set documentation carefully, it could be hard to know what these configuration options are used for. In addition, for some of the options, unless there is further testing, it could be hard to know how MySQL uses them.

mysql> show variables like 'character_set%'; 
+--------------------------+-------------------------------------+ 
| Variable_name            | Value                               | 
+--------------------------+-------------------------------------+ 
| character_set_client     | utf8mb4                             | 
| character_set_connection | utf8mb4                             | 
| character_set_database   | utf8mb4                             | 
| character_set_filesystem | binary                              | 
| character_set_results    | utf8mb4                             | 
| character_set_server     | utf8mb4                             | 
| character_set_system     | utf8mb3                             | 
| character_sets_dir       | /usr/share/percona-server/charsets/ | 
+--------------------------+-------------------------------------+

Don’t be intimidated – I will simplify the purpose of these settings by grouping them and explaining them. I will also use examples to illustrate how MySQL uses and processes these character_set options so that users can mitigate the occurrence of data loss and/or errors. Lastly, I will explain how the command “set names” can be used to organize the necessary configurations to be identical.

Grouping the options

To begin, we can organize these eight options into three groups: 

  • G1. Miscellaneous; 
  • G2. To define the character_set of data(column in a table); 
  • G3. To transfer/interpret during the processing of the character_set. 

G1. Miscellaneous

G1.1 character_sets_dir: It is straightforward: the directory where dynamically loaded character sets are installed. 

G1.2 character_set_filesystem: when MySQL needs to deal with a file, it needs the file name, and that file name will be converted from character_set_client to character_set_filesystem before the file is opened. The default value is binary, which means that no conversion occurs. In most cases, please keep it as the default value binary unless you are really confident that you would require the change.

G2. Define the character set of data (column in a table)

The smallest unit of data in an RDBMS (for example MySQL) is a column of a table. When the column is used to store a collection of characters(namely string, for example, varchar(), char(), text, etc.), MySQL needs to know which character set these characters belong to, so that MySQL can store and interpret them correctly.  

The sole purpose of the three options in G2 is to help eventually define the character set of a column in a hierarchical way. The three options are character_set_system, character_set_server, and character_set_database.

The system character set and collation(character_set_system) is used as the default value if the server character set and collation(character_set_server) is not specified.

The server character set and collation(character_set_server) is used as the default value if the database character set and collation(character_set_database) is not specified. 

The database character set and collation(character_set_database) is used as the default value if the database character set and collation are not specified in CREATE DATABASE statements. The character set and collation of a database are used as the default value for a table if the table character set and collation are not specified in CREATE TABLE statements.

The character set and collation of a table are used as the default values for string columns of the table if the character set and collation of those string columns are not specified in CREATE TABLE statements.

In other words, if we define the character_set and collation of a column in CREATE TABLE statements, all of the above settings will be ignored. For example:

mysql> CREATE TABLE t1 ( c1 INT NOT NULL AUTO_INCREMENT PRIMARY KEY,     
                         c2 VARCHAR(100) COLLATE utf8mb4_general_ci,     
                         c3 VARCHAR(100) )  DEFAULT CHARACTER SET=latin1;
mysql> select TABLE_SCHEMA,TABLE_NAME,COLUMN_NAME, CHARACTER_SET_NAME from information_schema.columns where table_schema='test' and table_name='t1';
+--------------+------------+-------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME | COLUMN_NAME | CHARACTER_SET_NAME |
+--------------+------------+-------------+--------------------+
| test         | t1         | c1          | NULL               |
| test         | t1         | c2          | utf8mb4            |
| test         | t1         | c3          | latin1             |
+--------------+------------+-------------+--------------------+

In the example above, we did not specify the character set for column c3. Therefore, by default, it uses latin1 (which was defined in CREATE TABLE) as its CHARACTER_SET. However, we specified column c2 as COLLATE utf8mb4_general_ci, so it uses utf8mb4 as its CHARACTER_SET, thus ignoring the character set latin1 (which, again, was defined in CREATE TABLE for the table).

For more details, please read https://dev.mysql.com/doc/refman/8.0/en/charset-syntax.html.

10.3.2 Server Character Set and Collation

10.3.3 Database Character Set and Collation

10.3.4 Table Character Set and Collation

10.3.5 Column Character Set and Collation

G3. Transfer/interpret during the processing of the character_set

There are three options in this group: character_set_client, character_set_connection, and character_set_results

The first two options are used by MySQL for writing to tables: The server takes the character_set_client system variable to be the character set in which statements are sent by the client. The server converts statements sent by the client from character_set_client to character_set_connection.

The last one is used for reading from the table: The character_set_results system variable indicates the character set in which the server returns query results to the client. To tell the server to perform no conversion of result sets or error messages, set character_set_results to NULL or binary.

Illustration via examples

Let’s take a look at a few examples to help us understand further:

We will reuse the above table t1, which has a column c2 of CHARACTER SET utf8mb4 and a column c3 of CHARACTER SET latin1

Just so you know, in the example, I will use two Chinese characters “?” which is my Chinese last name, and “?” which is my wife’s Chinese last name.

Example one

There is no conversion when character_set_client is the same as character_set_connection. And we can see that the encoded binary of the Chinese character “?” is 0xE78E8B and the encoded binary of the Chinese character “?” is 0xE69D8E, regardless of if the character set is latin1 or utf8mb4.

mysql> set names latin1;
mysql> select @@character_set_client,@@character_set_connection;
+------------------------+----------------------------+
| @@character_set_client | @@character_set_connection |
+------------------------+----------------------------+
| latin1                 | latin1                     |
+------------------------+----------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xE78E8B                     |
+------------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xE69D8E                     |
+------------------------------+
mysql> set names utf8mb4;
mysql> select @@character_set_client,@@character_set_connection;
+------------------------+----------------------------+
| @@character_set_client | @@character_set_connection |
+------------------------+----------------------------+
| utf8mb4                | utf8mb4                    |
+------------------------+----------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xE78E8B                     |
+------------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xE69D8E                     |
+------------------------------+

Example two

MySQL will convert from character_set_client to character_set_connection when their settings are different.

Example 2.1 convert from smaller character_set to larger character_set

mysql> set character_set_client=latin1; set character_set_connection=utf8mb4;
mysql> select @@character_set_client,@@character_set_connection;
+------------------------+----------------------------+
| @@character_set_client | @@character_set_connection |
+------------------------+----------------------------+
| latin1                 | utf8mb4                    |
+------------------------+----------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xC3A7C5BDE280B9             |
+------------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xC3A6C29DC5BD               |
+------------------------------+

Why are the binary(“?”) and binary(“?”) shown here different from example 1? The reason is that MySQL did the conversion from character_set_client(latin1) to character_set_connection(utf8mb4), meaning that MySQL will interpret the string using character_set_client(latin1), then interpret that output using character_set_connection(utf8mb4). 

Let’s verify:

mysql> select binary(convert(convert('?' using latin1) using utf8mb4));
+-------------------------------------------------------------------------------+
| binary(convert(convert('?'  using latin1) using utf8mb4))                    |
+-------------------------------------------------------------------------------+
| 0xC3A7C5BDE280B9                                                              | 
+-------------------------------------------------------------------------------+
mysql> select binary(convert(convert('?' using latin1) using utf8mb4));
+-------------------------------------------------------------------------------+
| binary(convert(convert('?'  using latin1) using utf8mb4))                    |
+-------------------------------------------------------------------------------+
| 0xC3A6C29DC5BD                                                                | 
+-------------------------------------------------------------------------------+

Example 2.2 convert from larger character_set to smaller character_set

mysql> set character_set_client=utf8mb4; set character_set_connection=latin1;
mysql> select @@character_set_client,@@character_set_connection;
+------------------------+----------------------------+
| @@character_set_client | @@character_set_connection |
+------------------------+----------------------------+
| utf8mb4                | latin1                     |
+------------------------+----------------------------+
mysql> select binary("?");
+--------------------------+
| binary("?")              |
+--------------------------+
| 0x3F                     |
+--------------------------+
mysql> select binary("?");
+--------------------------+
| binary("?")              |
+--------------------------+
| 0x3F                     |
+--------------------------+

Again, MySQL does the conversion from character_set_client(utf8mb4) to character_set_connection(latin1), and due to the conversion from larger character_set to smaller character_set, it makes sense that the conversion will cause data loss in some situations, such as in this example. Different Chinese words ‘?’ and ‘?’ returned the same binary 0x3F which obviously is wrong (data was lost).

Below is how MySQL does the conversion with data loss:

mysql> select binary(convert(convert('?' using utf8mb4) using latin1));
+-------------------------------------------------------------------------------+
| binary(convert(convert('?'  using utf8mb4) using latin1))                    |
+-------------------------------------------------------------------------------+
| 0x3F                                                                          |
+-------------------------------------------------------------------------------+
mysql> select binary(convert(convert('?' using utf8mb4) using latin1));
+-------------------------------------------------------------------------------+
| binary(convert(convert('?'  using utf8mb4) using latin1))                    |
+-------------------------------------------------------------------------------+
|0x3F                                                                           |
+-------------------------------------------------------------------------------+

Example three

When the character_set of a column is different from the character_set_connection, the conversion will occur as well.

Example 3.1 converts from a smaller character_set to a larger character_set.

As we can see from the below example, column c2 is expecting character_set utf8mb4 which is different from character_set_connection latin1, so the actual data inserted is the result of the conversion(interpret the string as latin1, then interpret the output into destination character set utf8mb4): “convert(convert(‘?’ using latin1) using utf8mb4)” from which we got the binary of ‘?’ for utf8mb4: 0xC3A7C5BDE280B9; at the same time, column c3 is expecting the same character_set as character_set_connection latin1, so there is no conversion involved, and data inserted is the original one(interpret the string as latin1), so the binary of ‘?’ for latin1: 0xE69D8E.

mysql> set names latin1;
mysql> select @@character_set_client,@@character_set_connection;
+------------------------+----------------------------+
| @@character_set_client | @@character_set_connection |
+------------------------+----------------------------+
| latin1                 | latin1                     |
+------------------------+----------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xE78E8B                     |
+------------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xE69D8E                     |
+------------------------------+
mysql> insert into t1(c1,c2,c3) values(1,"?","?");
mysql> select * from t1 where c1=1;
+----+------+------+
| c1 | c2   | c3   |
+----+------+------+
|  1 | ?   | ?   |
+----+------+------+
mysql> select c1,binary(c2),binary(c3) from t1 where c1=1;
+----+------------------------+------------------------+
| c1 | binary(c2)             | binary(c3)             |
+----+------------------------+------------------------+
|  1 | 0xC3A7C5BDE280B9       | 0xE69D8E               |
+----+------------------------+------------------------+

Please also note that here, column c2 is utf8mb4 which is different from character_set_results latin1. So when reading the data, MySQL will convert it for the results as well. However, column c3 is the same as character_set_results latin1, so no conversion is required, the output will be interpreted as latin1, though.

mysql> select @@character_set_results;
+-------------------------+
| @@character_set_results |
+-------------------------+
| latin1                  |
+-------------------------+
mysql> select * from t1 where c1=1;
+----+------+------+
| c1 | c2   | c3   |
+----+------+------+
|  1 | ?   | ?    |
+----+------+------+

For column c2, MySQL will interpret the result as utf8mb4 (the character set definition of the column), then converts it into latin1(setting of character_set_results) as below:

mysql> select convert(convert(0xC3A7C5BDE280B9 using utf8mb4) using latin1);
+---------------------------------------------------------------+
| convert(convert(0xC3A7C5BDE280B9 using utf8mb4) using latin1) |
+---------------------------------------------------------------+
| ?                                                            |
+---------------------------------------------------------------+

For column c3, MySQL interprets the result directly as latin1:

mysql> select convert(0xE69D8E using latin1);
+--------------------------------+
| convert(0xE69D8E using latin1) |
+--------------------------------+
| ?                             |
+--------------------------------+

Example 3.2 converts from a larger character_set to a smaller character_set.

character_set_connection is set to utf8mb4, c2 is expecting for character_set utf8mb4, and c3 is expecting for character_set latin1. c2 works fine, but for c3, MySQL gave out the error 1366 (HY000): Incorrect string value, since it determined the conversion(convert(convert(‘?’ using utf8mb4) using latin1)) causes data loss and stops inserting the wrong data into the column.

mysql> set names utf8mb4 ;
mysql> select @@character_set_client,@@character_set_connection;
+------------------------+----------------------------+
| @@character_set_client | @@character_set_connection |
+------------------------+----------------------------+
| utf8mb4                | utf8mb4                    |
+------------------------+----------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xE78E8B                     |
+------------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xE69D8E                     |
+------------------------------+
mysql> insert into t1(c1,c2,c3) values(2,"?","?");
ERROR 1366 (HY000): Incorrect string value: '\xE6\x9D\x8E' for column 'c3' at row 1
mysql> select binary(convert(convert('?' using utf8mb4) using latin1));
+------------------------------------------------------------------------------+
| binary(convert(convert('?' using utf8mb4) using latin1))                    |
+------------------------------------------------------------------------------+
| 0x3F                                                                         |
+------------------------------------------------------------------------------+

Example four

When you want to directly convert a column from a larger character_set to a smaller character_set inside the InnoDB engine, where the variable settings are not involved, the same error may happen as well

mysql> select c1, binary(c2), binary(c3) from t1;
+----+------------------------+------------------------+
| c1 | binary(c2)             | binary(c3)             |
+----+------------------------+------------------------+
|  1 | 0xC3A7C5BDE280B9       | 0xE69D8E               |
|  3 | 0xE78E8B               | 0x4131                 |
+----+------------------------+------------------------+
mysql> ALTER TABLE t1 MODIFY c2 varchar(100) CHARACTER SET latin1 COLLATE latin1_swedish_ci DEFAULT NULL;
ERROR 1366 (HY000): Incorrect string value: '\xE7\x8E\x8B' for column 'c2' at row 2

Example five

One more thing, as you may notice, in the above examples, I used the command set names a few times. Yes, this is actually the convenient way to make all three options in the G3 group identical, so that we can avoid unnecessary conversions.

mysql> set names latin1;
mysql> show variables like '%character_set%';
+--------------------------+-------------------------------------+
| Variable_name            | Value                               |
+--------------------------+-------------------------------------+
| character_set_client     | latin1                              |
| character_set_connection | latin1                              |
| character_set_database   | utf8mb4                             |
| character_set_filesystem | binary                              |
| character_set_results    | latin1                              |
| character_set_server     | utf8mb4                             |
| character_set_system     | utf8mb3                             |
| character_sets_dir       | /usr/share/percona-server/charsets/ |
+--------------------------+-------------------------------------+
8 rows in set (0.00 sec)
mysql> set names utf8mb4;
mysql> show variables like '%character_set%';
+--------------------------+-------------------------------------+
| Variable_name            | Value                               |
+--------------------------+-------------------------------------+
| character_set_client     | utf8mb4                             |
| character_set_connection | utf8mb4                             |
| character_set_database   | utf8mb4                             |
| character_set_filesystem | binary                              |
| character_set_results    | utf8mb4                             |
| character_set_server     | utf8mb4                             |
| character_set_system     | utf8mb3                             |
| character_sets_dir       | /usr/share/percona-server/charsets/ |
+--------------------------+-------------------------------------+
8 rows in set (0.00 sec)

Conclusion

MySQL will do the potential conversion to make sure the character_set_client, character_set_connection, and the destination character set of the column are aligned. 

We need to be careful to perform the conversion, since when MySQL converts from a character set that permits more values(larger) to a character set that permits fewer values(smaller) such as utf8mb4 -> latin1, it may cause data loss or an “ERROR 1366 (HY000) Incorrect string value” may happen.

We can make all three G3 configurations(character_set_client, character_set_connection, character_set_results) identical via the command “set names”. 

However, talking about the character set of a column which can be defined as required, you may want to rethink before defining a column as a character set different from the character_set_connection.

Dec
27
2022
--

Percona Monitoring and Management 2.33, Percona Distribution for MongoDB 4.4.18: Release Roundup December 27, 2022

Percona Releases Dec 27 2022

It’s time for the release roundup!

Percona Releases Dec 27 2022Percona is a leading provider of unbiased open source database solutions that allow organizations to easily, securely, and affordably maintain business agility, minimize risks, and stay competitive.

Our Release Roundups showcase the latest Percona software updates, tools, and features to help you manage and deploy our software. It offers highlights and critical information, as well as links to the full release notes and direct links to the software or service itself to download.

Today’s post includes those releases and updates that have come out since December 12, 2022. Take a look!

Percona Monitoring and Management 2.33

Percona Monitoring and Management 2.33 (PMM) was released on December 13, 2022. It is an open source database monitoring, management, and observability solution for MySQL, PostgreSQL, and MongoDB. Along with new features and improvements, release highlights include data loss prevention for PMM server outages, a guided tour of Percona Alerting, and a simplified process for restoring physical MongoDB backups.

Download Percona Monitoring and Management 2.33

 

Percona Distribution for MongoDB 4.4.18

Percona Distribution for MongoDB 4.4.18 was released on December 19, 2022. It is a freely available MongoDB database alternative, giving you a single solution that combines enterprise components from the open source community, designed and tested to work together. The aim of Percona Distribution for MongoDB is to enable you to run and operate your MongoDB efficiently with the data being consistently backed up. This release of Percona Distribution for MongoDB includes improvements and bug fixes, provided by MongoDB and included in Percona Server for MongoDB.

Download Percona Distribution for MongoDB 4.4.18

Percona Server for MongoDB 4.4.18-18

On December 19, 2022, we released Percona Server for MongoDB 4.4.18-18, a source available, highly-scalable database that is a fully-compatible, drop-in replacement for MongoDB 4.4.18 Community Edition enhanced with enterprise-grade features. It supports MongoDB 4.4.18 protocols and drivers.

Download Percona Server for MongoDB 4.4.18-18

 

That’s it for this roundup, and be sure to follow us on Twitter to stay up-to-date on the most recent releases! Percona is a leader in providing best-of-breed enterprise-class support, consulting, managed services, training, and software for MySQL, MongoDB, PostgreSQL, MariaDB, and other open source databases in on-premises and cloud environments.

Dec
26
2022
--

Talking Drupal #379 – The EOWG

Today we are talking about The Drupal Event Organizers Working Group with Avi Schwab.

For show notes visit: www.talkingDrupal.com/379

Topics

  • What is the EOWG
  • How was it formed
  • What are some of the initiatives
  • How does the EOWG support Drupal Events
  • How is the EOWG different from the Contrib Events Community Initiative
  • What is planned for 2023
  • How do you get involved
  • Becoming a board member
  • Drupalcon
  • Midcamp

Resources

Guests

Avi Schwab – froboy.org @froboy@mastodon.online

Hosts

Nic Laflin – www.nLighteneddevelopment.com @nicxvan John Picozzi – www.epam.com @johnpicozzi Leslie Glynn – redfinsolutions.com @leslieglynn

MOTW Correspondent

Martin Anderson-Clutz – @mandclu Previous/Next API Provides a performant way to define a sequence of content.

Dec
23
2022
--

Diffing PostgreSQL Schema Changes

Diffing PostgreSQL Schema Changes

One of the routine operations when administering PostgreSQL is periodic updates to the database system’s architecture. PostgreSQL does a good job of allowing one to update a schema, add types, functions, triggers, or alter a table by adding and removing columns and updating column data types, etc., in a reliable manner. However, there is no built-in mechanism to help identify the differences, let alone generate the necessary SQL, to accomplish updates in an easy manner from the development to the production environment.

So let’s talk about possible approaches to schema changes.

Using logical dump manifests

The easiest way to identify changes between schemas from one database to another is to compare schema dump manifests

The following example demonstrates an approach one can take looking for differences between schema on different databases:

EXAMPLE:

-- create database schemas
create database db01
create database db01


-- db01: version 1
create table t1 (
    c1 int,
    c2 text,
    c4 date
);

create table t2(
    c1 int,
    c2 varchar(3),
    c3 timestamp,
    c4 date
);
-- db02: version 2
create table t1 (
    c1 serial primary key,
    c2 varchar(256),
    c3 date default now()
);

create table t2(
    c1 serial primary key,
    c2 varchar(3),
    c3 varchar(50),
    c4 timestamp with time zone default now(),
    c5 int references t1(c1)
);

create index on t2 (c5);

 

# generate schema dumps
pg_dump -s db01 -Fc > db01.db
pg_dump -s db02 -Fc > db02.db
# generate manifests
pg_restore -l db01.db > db01_manifest.ini
pg_restore -l db02.db > db02_manifest.ini

This snippet demonstrates looking for differences by comparing the md5 checksums:

# EX 1: generate checksums
md5sum \
<(tail +16 db01_manifest.ini | cut -d ' ' -f 4-) \
<(tail +16 db02_manifest.ini | cut -d ' ' -f 4-)

# output
$ 9d76c028259f2d8bed966308c256943e  /dev/fd/63
$ ba124f9410ea623085c237dc4398388a  /dev/fd/62

This next snippet diffs the differences between the two manifests identifying only those objects and attributes that have changed. Notice that redundant information, the first 16 lines, are skipped:

# EX 2: perform diff
diff \
> <(tail +16 db01_manifest.ini | cut -d ' ' -f 4-) \
> <(tail +16 db02_manifest.ini | cut -d ' ' -f 4-)

 

This resultant diff shows the changes made between the two schemas:

1a2,3
> SEQUENCE public t1_c1_seq postgres
> SEQUENCE OWNED BY public t1_c1_seq postgres
2a5,12
> SEQUENCE public t2_c1_seq postgres
> SEQUENCE OWNED BY public t2_c1_seq postgres
> DEFAULT public t1 c1 postgres
> DEFAULT public t2 c1 postgres
> CONSTRAINT public t1 t1_pkey postgres
> CONSTRAINT public t2 t2_pkey postgres
> INDEX public t2_c5_idx postgres
> FK CONSTRAINT public t2 t2_c5_fkey postgres

The good news is that there are a number of existing tools that can reconcile differences between a proposed schema design and the target schema:

  1. Commercial offerings can differentiate schema between databases in an elegant and efficient manner. Researching, ala Google, yields the most popular technologies one can use.
  2. In regards to open source solutions, there are a number of projects capable of diffing Postgres database schemas.

Working with the apgdiff extension

The following is an example implementation of the open source tool apgdiff

Apgdiff can be found in the Postgres community repository. It compares two schema dump files and creates an SQL output file that is, for the most part, suitable for upgrades of old schemata:

Package: apgdiff

Version: 2.7.0-1.pgdg18.04+1
Architecture: all
Maintainer: Debian PostgreSQL Maintainers <team+postgresql@tracker.debian.org>
Installed-Size: 173
Depends: default-jre-headless | java2-runtime-headless
Homepage: https://www.apgdiff.com/
Priority: optional
Section: database
Filename: pool/main/a/apgdiff/apgdiff_2.7.0-1.pgdg18.04+1_all.deb
Size: 154800
SHA256: 9a83fcf54aed00e1a28c3d00eabe1c166977af1e26e91035e15f88b5215b181b
SHA1: ea713acb55898f07374dadd1bebb09ec2fa4b589
MD5sum: e70a97903cb23b8df8a887da4c54e945

The following example demonstrates how one can update differences between the development environment and the production database schema using apgdiff.

EXAMPLE:

apt install -y apgdiff
# EX 1: dump as SQL statements
pg_dump -s db01 -Fp > db01.sql
pg_dump -s db02 -Fp > db02.sql
createdb db03 --template=db01

apgdiff --ignore-start-with db01.sql db02.sql > db01-db02.sql

# “psql -1” encapsulates statements within a transaction
psql -1 -f db01-db02.sql db03

 

# EX 2: uses logical dumps
# notice the dumps are standard logical dumps and includes data
pg_dump db01 -Fc > db01.db
pg_dump db02 -Fc > db02.db
createdb db03 --template=db01

# this invocation assumes the resultant diff doesn’t require editing
apgdiff --ignore-start-with \
    <(pg_restore -s -f - db01.db) \
    <(pg_restore -s -f - db02.db) \
    | psql -1 db03

There’s more you can accomplish with these simple approaches. By incorporating variations of these, one can create fairly sophisticated shell scripts with little code and, with a little luck, not that much effort.

Dec
23
2022
--

Rebuild Patroni Replica Using pgBackRest

Rebuild Patroni Replica Using pgBackRest

Patroni is one of the most used high availability (HA) solutions with the PostgreSQL database. It uses a Distributed Configuration Store (DCS) to keep the configuration in a centralized location available for all nodes making it an easy-to-use and reliable HA solution available in the market today.

On the other hand, pgBackRest is a backup solution that helps in taking not only the FULL backup but also incremental and differential backup. This is one of the most used backup tool used for PostgreSQL databases.

In the previous blogs, we have discussed how to set up PostgreSQL HA with Patroni, and how to configure pgBackRest. Both these open source tools have been saviors for DBAs for achieving high availability and performing restoration in ample ways. In this blog, we will integrate both of these tools and understand how they can work together to reduce the server load.

Scenarios:

  • Reinitializing the Patroni cluster using pgBackRest (instead of pgBaseBackup).
  • Creating the Patroni replica using pgBackRest.

Pre-configured setup:

  • Patroni configuration setup for two or more databases.
  • pgBackRest configured on a dedicated backup host.

For the purpose of testing these scenarios, the below configurations will be used throughout this blog:

Patroni Nodes:

+ Cluster: prod (7171021941707843784) ----+-----------+
| Member | Host  | Role    | State   | TL | Lag in MB |
+--------+-------+---------+---------+----+-----------+
| node1  | node1 | Leader  | running | 1 |           |
| node2  | node2 | Replica | running | 1 |         0 |
+--------+-------+---------+---------+----+-----------+

Patronictl edit-config shows below:

loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
  create_replica_methods:
  - pgbackrest
  - basebackup
  parameters:
    archive_command: pgbackrest --stanza=patroni archive-push %p
    archive_mode: 'on'
    archive_timeout: 120s
    hot_standby: 'on'
    listen_addresses: '*'
    logging_collector: 'on'
    max_replication_slots: 10
    max_wal_senders: 10
    pg_hba:
    - host all all 0.0.0.0/0 md5
    - host replication all 0.0.0.0/0 md5
    - local all  postgres    peer
    wal_level: replica
    wal_log_hints: 'on'
  pgbackrest:
    command: /usr/bin/pgbackrest --stanza=patroni --log-level-file=detail --delta restore
    keep_data: true
    no_params: true
  recovery_conf:
    restore_command: pgbackrest --stanza=patroni archive-get %f %p
  use_pg_rewind: true
  use_slots: true
retry_timeout: 10
ttl: 30

Note: Please check the highlighted sections, which are specific for rebuilding nodes using pgBackRest backup.

Apart from Patroni, we will need a Backup Repo host where pgBackRest has been configured. It can be on a dedicated server or one of the DB hosts. However, it is recommended to use a dedicated server as in case DB goes down, we have a separate server to make the life of the DBAs easier.

Let’s test the scenarios one by one:

Reinitializing the Patroni cluster using pgBackRest (instead of pgBaseBackup)

The main advantage of using pgBackRest instead of pgBaseBackup while reinitializing the node is that it’ll reduce the load from the leader node. This will not make any difference if the DB size is smaller. However, this feature is very useful in case the DB size is huge and it takes hours or days to build the node. This will divert the resource utilization on the dedicated backup host instead of the primary server, which anyways is busy fulfilling the majority of the requests coming to the database.

Let us try to understand how we can rebuild the node using backup.

Many times, we are unable to start the secondary nodes after failover or switchover. To handle this situation, Patroni allows us to reinitialize the database cluster, which will create/rebuild the node by wiping the data directory. In the background, it will copy all the contents of the data directory from the Primary Server and re-create the desired node. 

Please make the changes in the Patroni configuration/yml file and reload the configuration, as shown previously. To reinitialize the Patroni replica node, the reinit command is used as below:

ubuntu@192.168.0.1:~$ patronictl -c /etc/patroni/node1.yml reinit prod
+ Cluster: prod (7171021941707843784) ----+-----------+
| Member | Host  | Role    | State   | TL | Lag in MB |
+--------+-------+---------+---------+----+-----------+
| node1  | node1 | Leader  | running | 1 |           |
| node2  | node2 | Replica | running | 1 |         0 |
+--------+-------+---------+---------+----+-----------+
Which member do you want to reinitialize [node1, node2]? []: node2
Are you sure you want to reinitialize members node2? [y/N]: y
Success: reinitialize for member node2

On the replica node, we can notice in the top command that it is rebuilding the node using pgBackRest backup and not pgBackRest. Ideally, Patroni uses pgBaseBackup in case create_replica_methods is not used which increases the load on the leader node.

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  12939 postgres  20   0  306440 268140      4 S  34.9  27.1   8871:14 GkwP468a
 791850 postgres  20   0  218692  29544  26820 S   8.9   3.0   0:00.28 /usr/lib/postgresql/14/bin/postgres -D /var/lib/postgresql/14/main --config-file=/var/lib/postgresql/14/main/postgresql.conf --listen_ad+
 791881 postgres  20   0   60980  12696  10628 S   3.2   1.3   0:00.10 pgbackrest --config=/etc/pgbackrest/pgbackrest.conf --stanza=patroni archive-get 00000013.history pg_wal/RECOVERYHISTORY
 791874 postgres  20   0  218692   8032   5260 S   2.2   0.8   0:00.07 postgres: prod: startup
 791827 postgres  20   0    7760   3516   3212 R   1.3   0.4   0:00.04 bash
 784973 postgres  20   0   23316    212      0 S   0.3   0.0   0:18.42 tracepath

In case it is using pgBackRest, then it will create a restore file mentioning pgBackRest command as below:

2022-12-15 15:41:29.070 P00   INFO: restore command begin 2.41: --config=/etc/pgbackrest/pgbackrest.conf --delta --exec-id=791815-82f4ea68 --log-level-console=info --log-level-file=detail --pg1-path=/var/lib/postgresql/14/main --process-max=2 --repo1-host=192.168.0.5 --repo1-host-user=postgres --repo1-path=/pgrdbackups --stanza=patroni
2022-12-15 15:41:30.800 P00   INFO: repo1: restore backup set 20221213-154604F, recovery will start at 2022-12-13 15:46:04
2022-12-15 15:41:30.877 P00 DETAIL: check '/var/lib/postgresql/14/main' exists
2022-12-15 15:41:30.877 P00 DETAIL: remove 'global/pg_control' so cluster will not start if restore does not complete
2022-12-15 15:41:30.918 P00   INFO: remove invalid files/links/paths from '/var/lib/postgresql/14/main'
2022-12-15 15:41:30.919 P00 DETAIL: remove invalid file '/var/lib/postgresql/14/main/backup_label.old'
2022-12-15 15:41:31.841 P00 DETAIL: remove invalid file '/var/lib/postgresql/14/main/base/13761/pg_internal.init'
2022-12-15 15:41:31.920 P00 DETAIL: remove invalid file '/var/lib/postgresql/14/main/global/pg_internal.init'

 

Creating the Patroni replica using pgBackRest:

In case the bootstrap section contains code to rebuild the node using pgBackRest, then while adding the node in the already existing Patroni cluster, the first time building of the new node will use pgBackRest backup instead of pgBaseBackup. Also, point-in-time recovery can be done using the bootstrap section. This will help in not only reducing the load from the leader node but also the backup node will help restore the data with comparatively lesser resource utilization.

To configure the same, please use the below in the Patroni configuration file:

bootstrap:
    method: <custom_bootstrap_method_name>
    <custom_bootstrap_method_name>:
        command: <path_to_custom_bootstrap_script> [param1 [, ...]]
        keep_existing_recovery_conf: True/False
        no_params: True/False
        recovery_conf:
            recovery_target_action: promote
            recovery_target_timeline: <PITR_Time>
            restore_command: <method_specific_restore_command>

In this example, the below section has been added to Patroni config, which will build the node by performing point-in-time recovery using the time stamp mentioned.

bootstrap:
  method: pitr_restore_by_pgbackrest
  pitr_restore_by_pgbackrest:
    command: 'pgbackrest --config=/etc/pgbackrest/pgbackrest.conf --stanza=patroni --log-level-file=detail --link-all --type=time
 --target="2022-12-13 15:46:04" restore'
    keep_existing_recovery_conf: True
    no_params: True
    recovery_conf:
      recovery_target_action: "promote"
      recovery_target_time: "2022-12-13 15:46:04"
      restore_command: 'pgbackrest -config=/etc/rdba/pgbackrest/pgbackrest.conf --stanza=patroni --log-level-file=detail archive-get %f "%p"'
      recovery_target_inclusive: true

When the node is being built, one can see in the TOP processes that pgBackRest is being used instead of base backup and Patroni Status when the node is being built:

? patroni.service - PostgreSQL high-availability manager
     Loaded: loaded (/lib/systemd/system/patroni.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/patroni.service.d
             ??override.conf
     Active: active (running) since Mon 2022-12-19 19:06:16 UTC; 3s ago
   Main PID: 2094 (patroni)
      Tasks: 11 (limit: 1143)
     Memory: 114.6M
        CPU: 941ms
     CGroup: /system.slice/patroni.service
             ??2094 /usr/bin/python3 /usr/bin/patroni /etc/patroni/db2.yml
             ??2100 /usr/bin/pgbackrest --config=/etc/pgbackrest/pgbackrest.conf --stanza=patroni --log-level-file=detail --delta restore
             ??2102 /usr/bin/pgbackrest --config=/etc/pgbackrest/pgbackrest.conf --exec-id=2100-9618fafd --log-level-console=off --log-level-file=off --log-level-stderr=error --process=1 --remote-type=repo --stanza=p>
             ??2103 /usr/bin/pgbackrest --config=/etc/pgbackrest/pgbackrest.conf --exec-id=2100-9618fafd --log-level-console=off --log-level-file=off --log-level-stderr=error --process=2 --remote-type=repo --stanza=p>
             ??2104 ssh -o LogLevel=error -o Compression=no -o PasswordAuthentication=no postgres@192.168.0.3 "/usr/bin/pgbackrest --exec-id=2100-9618fafd --log-level-console=off --log-level-file=off --log-level-st>
             ??2105 ssh -o LogLevel=error -o Compression=no -o PasswordAuthentication=no postgres@192.168.0.3 "/usr/bin/pgbackrest --exec-id=2100-9618fafd --log-level-console=off --log-level-file=off --log-level-st>
Dec 19 19:06:16 ip-192-168-0-2 systemd[1]: Started PostgreSQL high-availability manager.
Dec 19 19:06:16 ip-192-168-0-2 patroni[2094]: 2022-12-19 19:06:16,415 INFO: Selected new etcd server http://192.168.0.1:2379
Dec 19 19:06:16 ip-192-168-0-2 patroni[2094]: 2022-12-19 19:06:16,434 INFO: No PostgreSQL configuration items changed, nothing to reload.
Dec 19 19:06:16 ip-192-168-0-2 patroni[2094]: 2022-12-19 19:06:16,551 INFO: Lock owner: node3; I am db2
Dec 19 19:06:16 ip-192-168-0-2 patroni[2094]: 2022-12-19 19:06:16,633 INFO: trying to bootstrap from leader 'node3'
Dec 19 19:06:16 ip-192-168-0-2 patroni[2100]: 2022-12-19 19:06:16.645 P00   INFO: restore command begin 2.41: --config=/etc/pgbackrest/pgbackrest.conf --delta --exec-id=2100-9618fafd --log-level-console=info --log-l>
Dec 19 19:06:16 ip-192-168-0-2 patroni[2100]: WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/var/lib/postgresql/14/main' to confirm that this is a valid $PGDATA director
Dec 19 19:06:17 ip-192-168-0-2 patroni[2100]: 2022-12-19 19:06:17.361 P00   INFO: repo1: restore backup set 20221213-154604F, recovery will start at 2022-12-13 15:46:04

Notice that it is using delta restore which means it will automatically identify which files are required to restore and only those will be restored making the whole process faster.

Also, the log file (by default – /var/log/pgbackrest) will contain the restore date and time as below:

-------------------PROCESS START-------------------
2022-12-19 19:06:16.645 P00   INFO: restore command begin 2.41: --config=/etc/pgbackrest/pgbackrest.conf --delta --exec-id=2100-9618fafd --log-level
-console=info --log-level-file=detail --pg1-path=/var/lib/postgresql/14/main --process-max=2 --repo1-host=192.168.0.3 --repo1-host-user=postgres -
-repo1-path=/pgrdbackups --stanza=patroni
2022-12-19 19:06:16.646 P00   WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/var/lib/postgresql/14/mai
n' to confirm that this is a valid $PGDATA directory.  --delta and --force have been disabled and if any files exist in the destination directories
the restore will be aborted.
2022-12-19 19:06:17.361 P00   INFO: repo1: restore backup set 20221213-154604F, recovery will start at 2022-12-13 15:46:04
2022-12-19 19:06:17.361 P00 DETAIL: check '/var/lib/postgresql/14/main' exists
2022-12-19 19:06:17.362 P00 DETAIL: create path '/var/lib/postgresql/14/main/base'
2022-12-19 19:06:17.362 P00 DETAIL: create path '/var/lib/postgresql/14/main/base/1'
2022-12-19 19:06:17.362 P00 DETAIL: create path '/var/lib/postgresql/14/main/base/13760'
2022-12-19 19:06:17.362 P00 DETAIL: create path '/var/lib/postgresql/14/main/base/13761'

This blog majorly focuses on integrating the pgBackRest and Patroni, however, one can use other backup tools like WAL_E or BARMAN to rebuild the nodes. More information on such configuration can be found in the Patroni documentation.

Conclusion

Patroni and pgBackRest solutions work best when integrated, which helps in reducing the load from the Primary DB Server. This integration, where the node is reinitialized using pgBackRest, makes the optimum utilization of the dedicated backup server. If the Patroni setup is already configured and the dedicated backup host is also available, then making a few configuration changes in patroni.yml can work like wonders.

Dec
23
2022
--

A MyRocks Use Case

A MyRocks Use Case

A MyRocks Use CaseI wrote this post on MyRocks because I believe it is the most interesting new MySQL storage engine to have appeared over the last few years. Although MyRocks is very efficient for writes, I chose a more generic workload that will provide a different MyRocks use case.

The use case is the TPC-C benchmark but executed not on a high-end server but on a lower-spec virtual machine that is I/O limited like for example, with AWS EBS volumes. I decided to use a virtual machine with two CPU cores, four GB of memory, and storage limited to a maximum of 1000 IOPs of 16KB. The storage device has performance characteristics pretty similar to an AWS gp2 EBS volume of about 330 GB in size. I emulated these limits using the KVM iotune settings in my lab.

<iotune>
     <total_iops_sec>1000</total_iops_sec>
     <total_bytes_sec>16384000</total_bytes_sec>
</iotune>

MyRocks and RocksDB

If you wonder what is the difference between MyRocks and RocksDB, consider MyRocks as the piece of code, or the glue, that allows MySQL to store data in RocksDB. RocksDB is a very efficient key-value store based on LSM trees. MyRocks stores table rows in RocksDB using an index id value concatenated with the primary key as the key and then the internal MySQL binary row representation as the value. MyRocks handles indexes in a similar fashion. There are obviously tons of details but that is the main principle behind MyRocks. Inside MyRocks, there is an embedded instance of RocksDB running.

 

Dataset

The TPC-C dataset I used was with a scale of 200. As seen in the figure below, the sizes of the dataset are very different using InnoDB vs MyRocks.  While with InnoDB the size is 20GB, it is only 4.3GB with MyRocks. This is a tribute to the efficient compression capabilities of MyRocks.

InnoDB and MyRocks dataset sizes

InnoDB and MyRocks dataset sizes

A keen observer will quickly realize the compressed dataset size with MyRocks is roughly the same as the amount of memory of the virtual machine. This is not an accident, it is on purpose. I want to illustrate, maybe using an obvious use case, that you can’t use general rules like “InnoDB is faster for reads” or “MyRocks is only good for writes”. A careful answer would be: “it depends…”

 

TPC-C on MyRocks

In order to be able to run the sysbench TPC-C script, you need to use a binary collation and the read-committed isolation level. You must also avoid foreign key constraints. A typical sysbench invocation would look like this:

./tpcc.lua --mysql-host=10.0.4.112 --mysql-user=sysbench --mysql-password=sysbench --mysql-ssl=off \
   --mysql-db=sysbench --threads=4 --scale=200 --use_fk=0 --mysql_storage_engine=rocksdb \
   --mysql_table_options="COLLATE latin1_bin" --trx_level=RC --report-interval=10 --time=3600 run

I used a rocksdb_block_cache_size of 512MB. I wanted most of the memory to be available for the file cache, where the compressed SST files will be cached. The block cache just needs to be large enough to keep the index and filter blocks in memory. In terms of compression, the relevant settings in the column family options are:

compression_per_level=kLZ4Compression;bottommost_compression=kZSTD;compression_opts=-14:1:0

MyRocks uses ZStd:1 compression for the bottom level and LZ4 for the upper levels. The bottom-level compression is really critical as it contains most of the data.

Being an LSM-type storage engine, RocksDB must frequently perform level compactions. Level compactions consume IOPs and in environments where IOPs are scarce, those impact performance. Fortunately, RocksDB has the variable rocksdb_rate_limiter_bytes_per_sec to limit the impacts of compaction. The IO bandwidth used by the background compaction threads is limited by this parameter. The following figure illustrates the impacts.

myrocks

As the filesystem cache and the block cache warms up, the TPC-C transaction rates rise from 50 to around 175/s. After roughly 500s, the need for compaction arises and the performance drops. With no rate limit (0), the background threads consume too much IOPs and the compaction adversely affects the workload. With lower values of rocksdb_rate_limiter_bytes_per_sec, the impacts are reduced and the compactions are spread over longer periods of time.

For this environment, a rate limit of 4 MB/s achieves the lowest performance drops. Once warmed, the performance level never felt under 100 Trx/s. If you set rocksdb_rate_limiter_bytes_per_sec too low, like at 1MB/s, compaction cannot keep up and processing has to stall for some time. You should allocate enough bandwidth for compaction to avoid these stalls.

Long term stability

Over time, as data accumulates in the RocksDB LSM tree, performance can degrade. Using the 2 MB/s rate limiter, I pushed the runtime to 10 hours and observed very little degradation as shown in the following figure.

MyRocks performance stability

There are of course many compaction events but the performance baseline remains stable.

 

MyRocks Vs InnoDB

Now, how does this workload perform on InnoDB? InnoDB is more IO bound than MyRocks, essentially the 20GB dataset is large for the 3GB buffer pool.

MyRocks Vs InnoDB

The compaction event diminishes MyRocks performance but even then, the transaction rate stays well above the InnoDB one. Over the course of one hour, InnoDB executed 125k transactions while MyRocks achieved in excess of 575k transactions. Even if InnoDB uses compression (CMP8k), the performance level is still much lower.

Conclusion

I hope this post has raised your interest in the MyRocks storage engine. If you are paying too much for cloud-based storage and IOPs, make sure you evaluate MyRocks as it has super compression capabilities and is IO efficient.

Note: all the raw results, scripts, and configuration files used for this post can be found on Github.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com