Aug
30
2023
--

PostgreSQL Parameters: Scope and Priority Users Should Know

postgresql parameters

PostgreSQL allows its users to set parameters at different scopes, and the same parameter can be specified at different places and using different methods. And there could be conflicts. Someone might be wondering why certain changes are not coming into effect, so it is important to understand/recollect the scope and priority of settings.

In this blog, I am trying to list the options available for users and list them in increasing order of priority. The purpose is to give a high-level view to users.

1. Compile time parameter settings

These are the set of parameters that are set at the time of compilation. This acts as the default value for PostgreSQL. We can check these values in the boot_val field of pg_settings.

select name,boot_val from pg_settings;

These compile time settings have the least priority and can be overridden in any other levels. However, some of these parameters cannot be modified by any other means. Changing these values at compile time is not intended for common use. If a PostgreSQL user wants to change these values, they need to recompile the PostgreSQL from the source code. Some are exposed through the
configure command line option. Some such configuration options are: 
withblocksize=<BLOCKSIZE>    This sets table block size in kB. The default is 8kb. 
withsegsize=<SEGSIZE>  This sets table segment size in GB. The default is 1GB. This means PostgreSQL creates a new file in the data directory as the table size exceeds 1GB.  
withwalblocksize=<BLOCKSIZE>   sets WAL block size in kB, and the default is 8kB.

Most of the parameters have compile time defaults. That is the reason why we can start running PostgreSQL by specifying a very minimal number of parameter values.

2. Data directory/initialization-specific parameter settings

Parameters can also be specified at the data directory initialization time.  Some of these parameters cannot be changed by other means or are difficult to change.

For example, the
wal_segment_size, which determines the WAL segment file, is such a parameter. PostgreSQL generates WAL segment files of 16MB by default, and it can be specified at the time of initialization only. This is the level at which decisions on whether to use
data_checksums  need to be taken. This can be changed later using the pg_checksums utility, but that will be a painful exercise on a big database.

The default character encoding and locale settings to be used can be specified at this level. But this can be specified at the subsequent levels also. You may refer to initdb options for more information:  https://www.postgresql.org/docs/current/app-initdb.html.

Those parameters taken from the specific data directory initialization, which overrides the built-in parameters, can be checked like this:

select name,setting from pg_settings where source='override';

This override includes some of the calculated auto-tune values for that environment.

3. PostgreSQL parameters set by environment variables

PostgreSQL executables, including the postmaster, are honoring many environment variables. But they are generally used by client tools. The most common parameter used by the PostgreSQL server (postmaster) will be PGDATA, which sets the parameter data_directory.  These parameters can be specified by the service managers like systemd.

$ cat /usr/lib/systemd/system/postgresql-14.service
...
Environment=PGDATA=/var/lib/pgsql/14/data/
…
Environment=PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj
Environment=PG_OOM_ADJUST_VALUE=0

For many automation/scripting, this will be handy. Here is an example:

$ export PGDATA=/home/postgres/data
$ export PGPORT=5434
$ pg_ctl start
waiting for server to start....2023-08-04 06:53:09.637 UTC [5787] LOG:  starting PostgreSQL 15.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
2023-08-04 06:53:09.637 UTC [5787] LOG:  listening on IPv6 address "::1", port 5434
2023-08-04 06:53:09.637 UTC [5787] LOG:  listening on IPv4 address "127.0.0.1", port 5434
2023-08-04 06:53:09.639 UTC [5787] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5434"

As we can see, PostgreSQL took the port as 5434.

4. Configuration files

Probably, this is the method every novice user will be aware of. The fundamental configuration file is postgresql.conf, and it is the most common place to have global settings. PostgreSQL looks for a configuration file in the PGDATA by default, but an alternate location can be specified using the command line parameter config_file  of postmaster.  The parameter specifications can be split into multiple files and directories because Postgresql supports 
include and 
include_dir directives in the configuration files. So, there can be nested/cascaded configurations.

PostgreSQL rereads all its configuration files if it receives a SIGHUP signal. If the same parameter is set in multiple locations, the last to read will be will be considered. Among all configuration files, postgresql.auto.conf gets the highest priority because that is the file to read the last. That is where all “ALTER SYSTEM SET/RESET” commands keep the information.

5. Command line argument to postmaster

The postmaster, aka Postgres, has a feature to set parameters as command-line arguments (it has features to get the values also). This is one of the most reliable methods used by many of the external tools to manage PostgreSQL service. For example, the high availability solution Patroni passes some of the most critical parameters as a command line argument.  Here is how the Postgres process with command-line options looks in a Patroni environment

/usr/pgsql-14/bin/postgres -D /var/lib/pgsql/14/data --config-file=/var/lib/pgsql/14/data/postgresql.conf --listen_addresses=0.0.0.0 --port=5432 
 --cluster_name=kc_primary_cluster --wal_level=replica --hot_standby=on --max_connections=100 --max_wal_senders=5 --max_prepared_transactions=0 
 --max_locks_per_transaction=64 --track_commit_timestamp=off --max_replication_slots=10 --max_worker_processes=8 --wal_log_hints=on

So Patroni can ensure that there is no local configuration mistake that can adversely affect the availability and stability of the cluster. But changing this is possible only at the server startup. Obviously, this has higher precedence over the values from the configuration files. The scope will be at the instance level. This answers many of the Patroni user’s questions on why they cannot change some of the parameters directly in the parameter file. PostgreSQL users can check those parameters which came as command line arguments like:

postgres=# select name,setting from pg_settings where source='command line';
       	name        	      |  setting  	 
------------------------------+--------------------
 cluster_name          	      | perconapg_cluster
 hot_standby           	      | on
 listen_addresses      	      | 0.0.0.0
 max_connections       	      | 100
 max_locks_per_transaction    | 64
 max_prepared_transactions    | 0
 max_replication_slots 	      | 10
 max_wal_senders       	      | 5
 max_worker_processes  	      | 8
 port                  	      | 5432
 track_commit_timestamp	      | off
 wal_level             	      | replica
 wal_log_hints         	      | on
(13 rows)

Parameters specifications up to this level can have “postmaster” context.  The concept of “context” is discussed as part of the next section.

6. Database level setting

All options discussed so far have a global scope. Meaning they are applicable for the entire instance. But there could be reasons why a PostgreSQL user wants to change that at a specific database level. For example, one of the databases might be handling an OLTP workload where query parallelism may not be really needed and may have an adverse impact. But another database might be an OLAP system.

postgres=# ALTER DATABASE newdb SET max_parallel_workers_per_gather = 4;
ALTER DATABASE

The concept of context

At this stage, we should recollect another concept called the “context” of a parameter. For example, the network port at which PostgreSQL listens cannot be changed at the individual database level. A change of such parameters requires a PostgreSQL restart. So we say that the context of parameter “
port” is postmaster. A change to such parameters requires the postmaster — the main process of the PostgreSQL — to restart. Please refer to the documentation: https://www.postgresql.org/docs/current/view-pg-settings.html to understand different contexts of PostgreSQL parameters.  We won’t be allowed to change a set of parameters at this level onwards; any attempt will be prevented.

postgres=# ALTER DATABASE db1 SET max_connections=100;
ERROR:  parameter "max_connections" cannot be changed without restarting the server

The max_connections is something to specify at the global (instance) level by the postmaster, and it requires restart.

There are other sets of parameters that need to be communicated through postmaster only, even though they can be changed without restarting the server. That context is called sighup. Because we can signal the postmaster, and it will re-read such parameters and propagate the same to all its child processes, changing them at the database level will be prevented.

postgres=# ALTER DATABASE db1 SET log_filename='postgresql-DB1.log';
ERROR:  parameter "log_filename" cannot be changed now

You may even consider looking at the PostgreSQL source code:

https://github.com/postgres/postgres/blob/6fde2d9a005a5bc04aa059d3faeb865c8dd322ce/src/backend/utils/misc/guc.c#L3376

for a much deeper understanding of the logic of “context” and what is allowed in which level.

7. User-level settings

Each user can have their preferred parameter settings so that all sessions created by that user will have that setting in place.  Please remember that this user-level setting has a higher preference than database-level settings. Users can check their own user-level settings, like this:

select name,setting,source,context from pg_settings where  source='user';

8. Database – user combination

PostgreSQL allows us to have parameter settings that will be applicable when a particular user/role connects to a particular database.

For example:

ALTER USER admin IN DATABASE db1 SET max_parallel_workers_per_gather=6;

Setting at this level has even higher priority than everything mentioned before.

select name,setting,source,context from pg_settings where  name='max_parallel_workers_per_gather';
          	name           	| setting |	source 	| context
---------------------------------+---------+---------------+---------
 max_parallel_workers_per_gather | 6   	| database user | user
(1 row)

9. Parameters by the client connection request

There is an option to specify parameters while making a new connection. It can be passed to PostgreSQL as part of the connection string.

For example, I want to connect to the database to perform some bulk data loading and manipulation (ETL), and I don’t want to wait for any WAL writing. If, at all, there is any crash in between, I am OK to perform the ETL again. So, I am going to request a connection with
synchronous_commit  off.

$ psql "host=localhost user=postgres options='-c synchronous_commit=off'"
psql (14.8)
Type "help" for help.

postgres=# select name,setting,source,context from pg_settings where  name='synchronous_commit';
    	name        | setting   | source | context
--------------------+-----------+--------+---------
 synchronous_commit | off 	| client | user
(1 row)

10. Session-level setting

Each session can decide on the settings for that session at that point in time or execution. The sessions are allowed to modify this session-level setting as and when required.

postgres=# set jit=off;
SET
postgres=# select name,setting,source,context from pg_settings where  name='jit';
 name | setting | source  | context
------+---------+---------+---------
 jit  | off 	| session | user
(1 row)

A good use case is that, suppose we are going to rebuild a big index. We know that it is going to use considerable maintenance_work_mem.  Setting this at the session level simplifies our life without affecting other sessions.

set maintenance_work_mem = '4GB';

11. Transaction-level settings

PostgreSQL allows us to specify parameters at a very small scope, like transaction level.

Here is an example of discouraging sequential scans in a particular transaction block.

postgres=# BEGIN;
BEGIN
postgres=*# SET LOCAL enable_seqscan=off;
SET

I prefer transaction-level settings because the changes are very local to the transaction, and they will be reverted back once the transaction is completed. This is the most preferred place to set the work_mem to minimize the impact.

12. Object-level settings

PostgreSQL allows us to specify the parameter specific to a program block, like a PL/pgSQL function. So, the setting goes as part of the function definition.

Here is an example of the function definition to test the function-level settings.

CREATE OR REPLACE FUNCTION checkParams()
 RETURNS BOOLEAN
as $$
DECLARE
  nm TEXT;
  setng TEXT;
  cntxt TEXT;
  src TEXT;
BEGIN
SELECT name,setting,context,source INTO nm,setng,cntxt,src from pg_settings where  name='enable_partitionwise_join';
RAISE NOTICE 'Parameter Name: % value:%  Context:%  Source:%',nm,setng,cntxt,src;
RETURN true;
END;
$$ LANGUAGE plpgsql
SET enable_partitionwise_join = 'on'
SET enable_seqscan = 'off';

PostgreSQL parameters summary

PostgreSQL parameter specification is very flexible and powerful, so understanding the scope, context, and priority is important for every user. A rule of thumb could be the broader the scope, the lower the priority.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together.

 

Download Percona Distribution for PostgreSQL Today!

Aug
30
2023
--

Backup and Recovery for Databases: What You Should Know

People used to say, “Coal is king,” and for decades, it was. Coal powered just about everything, but mismanagement and a lack of stewardship left some messes. 

These days, “Data is king,” and a lot more. Data powers everything, and unlike coal and coal combustion, data and databases aren’t going away. So staying with our analogy, and learning from historical lessons, organizations must be responsible stewards of data — for the sake of customers, stakeholders, and the business itself. 

Any organization that uses personal and other sensitive data must have a firm, proven plan for business continuity in the event of a disaster or cyberattack. Losing access to or control of data for an extended period of time will disrupt operations, lead to financial losses, and damage an organization’s reputation. Recovering from a tarnished reputation can be costly and time-consuming.

Data stewardship also means protecting people’s privacy, safeguarding against breaches, and adhering to regulations and standards such as the European Union’s General Data Protection Regulation (GDPR), the United States’ Sarbanes-Oxley Act (SOX), and the Payment Card Industry Data Security Standard (PCI DSS).  

In this blog, we’ll focus on the elements of database backup and disaster recovery, and we’ll introduce proven solutions for maintaining business continuity, even amid otherwise dire circumstances.

Why backup and recovery preparedness is so important

Such a royal introduction demands background, so let’s get after it: Any data loss or unexpected downtime hurts an organization. Some losses can be crippling, even a business death knell. So it’s important to know (and not overlook) the trouble spots. Some are hidden, but some are in plain sight. In general terms, here are potential trouble spots:

  • Hardware failure: Manufacturing defects, wear and tear, physical damage, and other factors can cause hardware to fail. Power surges, outages, and harsh conditions (i.e., heat) can damage hardware components and prompt data loss. 
  • Software failure: Software applications can become vulnerable, or they can crash altogether. Without data backup mechanisms, there can be data loss or system downtime. Even worse, entire operating systems can crash, also resulting in data loss.
  • Human mistakes: Incorrect configuration is an all-too-common cause of hardware and software failure. Similarly, accidental deletion is a culprit.

In all three instances, failure to regularly back up data can result in significant data loss in the event of a disaster. Solid backup procedures must be in place.

Backup types and strategies

It all begins with choosing a strategy, and that depends on factors such as the use and importance of data to your business, your recovery time objectives (RTO), and your budget. Depending on what is needed, here are some common database backup types, strategies, and considerations:

Full backup vs. incremental backup: Best suited for smaller databases or those that don’t incur a lot of changes, a full backup includes a complete restore point. Though it can put safety concerns at ease, a full backup can be time-consuming and expensive. An incremental backup, which is faster and requires less storage than a full backup, captures changes made since the previous backup. It’s suitable for databases with moderate change rates. (For a more detailed description, read Full vs. Incremental vs. Differential Backups: Comparing Backup Types.)

Hot backups vs. cold backups: A hot backup — used to minimize downtime and ensure data availability for critical applications — allows you to create a copy of a database while the system is still actively serving user requests and processing transactions. In a cold backup, the database is taken offline. This is advisable only in certain scenarios, such as maintenance during low-use hours.

Choosing the right backup frequency: The appropriate frequency, of course, will vary from one organization to another. If you can’t afford to lose much data, you’ll need more frequent backups, possibly even continuous data protection solutions. The rate at which data changes within your database is a significant factor as well. To incorporate the latest changes, databases with high write activity might require more frequent backups.

Offsite backup: This involves physically storing backups in a secure location somewhere other than the primary data center. It’s more expensive, but offsite backup protects against site-wide disasters.

With the right backup strategy, you’ll be better able to achieve the aforementioned business continuity. Still, disasters might happen, so let’s also examine recovery.

Disaster recovery strategies and testing

One might think disaster begets loss, but with the right planning, that loss can be so minimal it’s more “blip” than “bad.” With that planning in place, you can look at it this way instead: Backup begets disaster recovery.

Disaster recovery (DR) strategies are essential to ensuring the integrity, availability, and reliability of data, particularly in the event of unexpected failures or errors. Such strategies help restore a database to a consistent and usable state. A disaster recovery plan can be as simple as the use of a backup and recovery procedure, or it can be complex, depending on the RTO and the recovery point objective (RPO). Key elements of comprehensive database recovery plans include point-in-time recovery, high availability and failover, replication, and others (we’ll break them down in a bit). 

Whatever plan you arrive at, you must test it. There should be no shortcuts in the testing, which should include:

  • Backup testing: Test to ensure that data can indeed be recovered from backups. This includes both data and log backups.
  • Failover testing: If you have a high availability setup or a secondary data center, test to ensure the failover process switches over to the backup database server.
  • Application testing: Test to make sure applications function correctly after recovery.
  • Data consistency: Verify that data consistency between the primary and secondary systems will exist after recovery.
  • Application testing: Test to make sure applications function correctly after recovery.

Backup and recovery tools

Now that we’ve covered some strategies, it’s a good time to look at some tools for putting plans in motion. There’s too much out there to provide a comprehensive list, so we’ll mention just a few high-profile options. These obviously are not detailed descriptions; they’re just introductions:

MySQL Enterprise Backup: This software supports hot backups and incremental backups.

Oracle Recovery Manager (RMAN): This enables users of Oracle databases to perform full, incremental, and differential backups. This tool also provides point-in-time recovery.

SQL Server Management Studio: This software includes backup and recovery tools for use with Microsoft SQL Server.

MongoDB Atlas backup: This hosted cloud service offers continuous backups or snapshots for point-in-time recovery, as well as incremental backups.

MongoDB Cloud Manager: This hosted service, which uses a graphical user interface, supports backup and restoration of replica sets and sharded clusters.

MongoDB Ops Manager: Available with Enterprise Advanced subscriptions, this is an on-premise tool that provides backup software and features much like those of Cloud Manager.

IBM Data Studio: This provides graphics-based and command-line tools for backup and recovery within IBM Db2 databases.

Commvault: This data management and protection software includes features for backup and recovery. It’s used to help ensure the functionality of hardware, software, and applications. 

All of the options above have enterprise-grade attributes, but not one of them is truly open source. So let’s discuss some options that are open source.

Open source Percona solutions for backup and disaster recovery

When you read reviews of what’s out there for backup and disaster recovery solutions, you tend to see glowing words about software with big-name recognition. But you don’t see a warning that the software is proprietary (most often, it is); you just see hints, like a “starting price.” And you certainly don’t see a warning that vendor lock-in might be lurking

Amid the options, you see Percona at or near the top of lists and reviews. But with Percona, there are differences. There’s no hint of proprietary lock-in because it doesn’t exist. The software is truly open source. Percona solutions, fully supported and enterprise-grade, include:

Percona Backup for MongoDB: Percona Backup for MongoDB is a distributed and low-impact solution for consistent backups of MongoDB clusters, including sharding support. It enables you to make logical, physical, incremental, and selective backups and restores. Plus, point-in-time recovery functionality allows you to recover your database to a specific timestamp.

Percona XtraBackup: This is a free, complete online backup solution for all versions of Percona Server for MySQL, MySQL, and MariaDB. Percona XtraBackup performs online non-blocking, tightly compressed, highly secure backups on transactional systems. Percona XtraBackup is the world’s only open source, free MySQL hot backup software that performs non-blocking backups for InnoDB and XtraDB databases. 

Percona Distribution for PostgreSQL: This production-ready PostgreSQL distribution includes pgBackRest, an open source backup and restore solution that enables full backup, incremental backup, and differential backup in PostgreSQL databases. The toolset also supports high availability and disaster recovery via Patroni, pg_bouncer, and HA proxy. 

Essential elements of database backup and recovery

Now, let’s introduce a couple elements of an ideal backup and recovery solution. (There are more coming in the next section, but these two are big, huge in fact.) 

Database replication

Data redundancy is a cornerstone of disaster recovery strategy, and to achieve it, you must have replication mechanisms in place. Depending on the use case, that can mean synchronous replication or asynchronous replication. 

In synchronous replication, data is written to the primary database. The data then is copied to one or more replica databases. The primary database waits for acknowledgment from the replica(s) before advancing the transaction to the application. This means that the data in the primary and replica databases is always in sync. (View a Percona whitepaper that shows synchronous replication in action.)

In asynchronous replication, data is written to the primary database, but the acknowledgment to the application occurs before the data is replicated to secondary databases. This results in a delay between the time data is written to the primary and when it appears in the replica(s). Real-time data consistency is not guaranteed.

Point-in-time recovery

With point-in-time recovery, a database is restored to a specific moment in time rather than the time of the most recent backup. PITR is especially essential in situations when data integrity and consistency cannot be compromised. Financial systems and critical business applications are especially dependent on PITR. In addition to protecting data accuracy and limiting data loss, PITR can help with auditing and compliance requirements by providing a record of changes to the database.

Common components of DR and HA architectures

By definition, there are differences between high availability (HA) and disaster recovery (DR). High availability is focused on preventing downtime and ensuring that the database remains available; disaster recovery is focused on recovering from a catastrophic event and minimizing negative effects on the business. High availability typically involves redundant hardware, software, applications, and network components that can quickly take over if the primary component fails; disaster recovery typically involves regular backups, replication to a secondary site, and a clear recovery plan with steps to be taken in the event of a disaster. 

Although the emphasis and configurations may vary depending on whether the focus is on high availability or disaster recovery, there are shared components. Some of them include:

  • Redundant hardware: Both HA and DR use redundant hardware components, including servers and storage devices. Redundancy ensures that if one component fails, there is another to take its place. This helps minimize downtime.
  • Clustering: In HA, clustering helps ensure that there are redundant database servers; if one fails, another can take over. This minimizes downtime during disasters or hardware failures. For disaster recovery, clustering can be used to maintain a synchronized copy of the database in a different location. Database replication, log shipping, or synchronous data mirroring can be used for DR purposes.
  • Load balancing: Load balancers distribute traffic evenly across multiple servers or data centers. In HA, load balancers help ensure that no single server is overwhelmed. In DR, load balancers route traffic to the secondary data center when a failover occurs.
  • Backup systems: Both HA and DR architectures have backup systems in place. HA setups tend to use backup servers within the same data center. DR setups have backup data centers in different locations.
  • Monitoring and alerting: Continuous monitoring of system health and performance is essential for both HA and DR. Automated alerts are set up to notify administrators of any issues that require attention.

What’s at stake and what to do about it

Now that we’ve taken a deeper dive into the components of a backup and disaster recovery solution, as well as a look at high availability, let’s expand on what backup and disaster recovery solutions are used for. We’ll also talk about what can happen if you don’t have a solid plan in place, if you take shortcuts, or if you turn it all over to the wrong vendor.

Data archiving and retention matter

We’ve discussed the essential nature of backup and recovery in business continuity. Relatedly, by separating historical data from operational data, archiving helps you manage data growth, maintain compliance, and optimize backup processes. Likewise, establishing clear retention policies for both backups and archives is crucial to balancing data recovery needs with data management efficiency and compliance requirements.

And the essential nature of compliance can’t be emphasized enough. Failure to adhere to legal requirements can result in monetary and even criminal penalties, reputational damage, and loss of data integrity. Here are some of those regulations:

General Data Protection Regulation (GDPR): In addition to stipulating that organizations must have a lawful basis for processing personal data, this regulation includes guidelines for data retention and erasure policies.

Sarbanes-Oxley Act: SOX, a U.S. federal law, requires companies to have database disaster recovery and business continuity plans in place. The purpose is to ensure the availability and integrity of financial data amid unexpected events.

Payment Card Industry Data Security Standard: The PCI DSS mandates that organizations must regularly back up data, including critical payment card data, so that data availability and integrity are maintained. The PCI DSS also prescribes steps for responding to data breaches and other security incidents, including how to restore services and data after a disaster. 

California Consumer Privacy Act: Similar to GDPR, CCPA also includes mandates concerning data retention and erasure policies.

Regional regulations: Depending on your location and who your customers are, your organization might have to adhere to privacy, archiving, and retention mandates as spelled out in HIPAA (healthcare), FERPA (education), PIPEDA (Canada), and other regulations.

Examining cloud-based backup solutions

Benefits related to accessibility, automation, scalability, and security might inspire you to go with a cloud service for database backup and disaster recovery. If this is the direction you choose, carefully consider the long-term costs, data security concerns, and potential vendor lock-in. Find a partner that will answer any questions about such concerns — and be as certain as possible that vendor lock-in is not on the horizon.

While cloud services can be cost-effective initially, long-term costs can escalate if your needs for data storage grow significantly. Relatedly, transferring large amounts of data into or out of the cloud can be expensive, especially for companies with limited bandwidth. Further, there are data compliance and security concerns that could matter a lot more to you than to the vendor. Whereas the vendor might make promises, the responsibility and potential repercussions are all on you. 

Safe and reliable cloud services exist, but read the fine print and ask those questions.

Never scrimp on security measures

Here are some of the best practices for helping ensure that data remains secure during backup and recovery:

  • Encryption (which also could have appeared in the Essential Elements section) is a must-have component of database backup and recovery strategies. Using algorithms and keys to make data unreadable, encryption helps safeguard information during backup and recovery. Even if backup files fall into the wrong hands, you’re safe. Encryption also helps you adhere to strict data-protection regulations (GDPR, Sarbanes-Oxley Act, PCI DSS, CCPA, HIPAA, etc.).
  • Access control is the process of managing who can access a database and what actions they can perform. In backup and recovery, access control helps prevent unauthorized access to sensitive backup files and database recovery mechanisms.
  • Backup authorization and recovery authorization are the processes of determining whether a user or process has the permissions to perform a specific action. Authorization helps ensure that only authorized entities can initiate backup and recovery operations.

Be proactive — set up monitoring and alerting

The coal miners of yesteryear carried caged canaries deep into mine tunnels. If carbon monoxide or other dangerous gasses were present, the gasses would kill the canary, signaling the miners to leave the tunnels immediately.

Thankfully, with database backup and disaster recovery, watching for warning signs can be a lot more scientific — and foolproof. Instead of relying on a wing and prayer, an effective monitoring and alerting solution can rely on:

Thresholds: Thresholds for various metrics, such as backup completion time, replication lag, and resource availability, are defined. When those thresholds are reached, alerts are triggered.

Notification channels for real-time alerts: With the proper configuration, the appropriate personnel are contacted promptly via automated email, text, chat, and other channels when the previously mentioned thresholds are hit. Such notification should include an escalation process in which a different support avenue or person is contacted if the alert is not acknowledged or resolved within a predetermined amount of time.

Automated storage increase: There should be automation — tied into the alerts — in which storage space is increased when it reaches a predefined threshold. This will help prevent backup failures.

Detailed logs: It’s important to maintain logs of all monitoring activities and alerts. Then, you have the information to generate reports for identifying trends and areas of improvement. 

With the right monitoring system in place, you can avoid losses. You also can spot critical performance issues faster, understand the root cause of incidents better, and troubleshoot them more efficiently moving forward.

Recovering from different scenarios

In database backup and disaster recovery planning, clear steps for dealing with hardware failure must be in place. Those steps should include:

  • Identification: The first step in recovering from a hardware failure is identifying the affected hardware component. This could be a hard drive, a server, or even an entire data center. Monitoring tools and systems can help detect such failures and trigger automated alerts described in the previous section.
  • Isolation and remediation: Once the failure is identified, IT staff or automated systems should work to isolate the affected hardware and restore it to a functional state as soon as possible. This may involve taking a server offline, rerouting network traffic, or replacing a failed disk drive.
  • Restoration: With the hardware issue resolved, the next step is to restore the database services. This involves restarting database servers, restoring network connections, and ensuring that the database management system is operational. The previously mentioned automated failover mechanisms and load balancers can help minimize downtime during this phase.
  • Recovery of data from backups: This might be a full backup or a combination of full and incremental backups. Use backup software to restore the database to its clean state at the time of the last good backup. This might involve copying data from backup storage to the production database server.

Perhaps we should call this the All Things Ominous Section because now we’ll look at restoring data after a cyberattack. (Of course, the dead canary was pretty dark.) 

But really, this section is about bouncing back. With the right preparedness, a cyberattack doesn’t have to be that death knell to your database system and business. In fact, you can beat down an incursion and come back stronger. So let’s look at planning, preparedness, and a systematic approach to minimizing downtime and data loss while ensuring the security and integrity of your systems and information. Here are key steps in overcoming a cyberattack:

Preparation: Conduct regular backups that ensure you have recent, clean copies of your data to restore. Keep backup copies in offsite or isolated locations, perhaps in the cloud. This safeguards data from physical damage or compromise in the event of an attack.

Documentation: Maintain thorough documentation of your database configurations, schemas, and data structures. This documentation will be invaluable during the restoration process.

Response plan: Develop a clear incident response plan that outlines roles, responsibilities, and steps to take in the event of a cyberattack.

Detection and isolation: As soon as you detect an attack, quickly identify the scope and nature. Determine which databases or systems are affected. Then, quarantine or disconnect affected systems from the network to prevent the spread of malware and further data corruption.

Damage assessment: Evaluate the extent of data loss or corruption. This assessment will help determine the appropriate restoration strategy.

Culprit identification: So that you can patch vulnerabilities and prevent future attacks, determine how the attack happened.

Data restoration: Use your latest clean backup to restore the affected database. Ensure the backup is from a time before the attack occurred. In some cases in which the attack had compromised data, you might have to perform incremental restoration. This involves applying incremental backups to bring the data up to standards.

Security updates and auditing: Immediately patch and update the database system and associated software to address vulnerabilities that were exploited.To prevent future attacks, implement intrusion detection systems (IDS) and access controls.

Data consistency and integrity

You must maintain data accuracy before, during, and after a disaster or attack. By doing so, your organization can recover quickly and reliably. In addition to replication, monitoring and alerts, encryption, auditing, and other activities already mentioned, here are some other best practices for maintaining data accuracy and addressing data corruption:

  • Perform regular backup testing, full and incremental, to verify data integrity.
  • Perform automated backup verification to check the integrity of backup files and ensure they are not corrupted.
  • Implement version control for database backups, which will give you a history of changes and let you choose a specific point-in-time if necessary.
  • Always apply the latest security patches and updates to your database management system and backup software to prevent vulnerabilities.
  • Use IDS and security information and event management (SIEM) tools to monitor network activity and detect suspicious activity.
  • Develop and regularly update a comprehensive disaster recovery plan that outlines roles, responsibilities, and procedures for data recovery in various scenarios.
  • Consider having outside experts assess your data protection and recovery plans and provide recommendations for improvement.

Scaling backup and recovery processes

Massive amounts of data can reside in large and enterprise databases. Though it might seem obvious, it’s important that you aren’t caught off-guard, and that means having significant storage capacity and efficient data transfer mechanisms.

Enterprises also typically require frequent backups, ranging from daily to hourly, depending on their RPOs. Automated backup processes are essential in ensuring data consistency and minimizing downtime. Techniques like online backups and snapshot-based backups can help ensure databases remain accessible during the backup process.

Relatedly, transferring large database backups over the network can strain available bandwidth. So, enterprises might need dedicated high-speed connections or WAN optimization solutions to mitigate network congestion during backup operations. To reduce storage and bandwidth requirements, compression and deduplication techniques are often applied to the backup data. This involves identifying redundant data and storing only the unique blocks.

Planning for business continuity

In real estate, they say “location, location, location.” In database management, we say: Proactivity. Proactivity. Proactivity. (OK, not so catchy and maybe we don’t say it in those words, but we think it.) And here, we’ll say it a fourth time in relation to protecting data and avoiding downtime: Proactivity. 

Any business or organization that relies on data (meaning just about all of them) must be proactive if they’re to maintain business continuity amid a disaster, power outage, cyberattack, or other event that could threaten data processes.

We covered a lot about components, tools, and best technical practices, so here we’ll key in on the actual planning parts that should be included in a business continuity plan. Since replication and redundancy are such big elements of database management, let’s stick with the redundancy theme and call these proactive activities:

Risk assessment: Identify potential risks and threats that could negatively affect your database systems. They might include natural disasters (hurricanes, tornadoes, earthquakes, flooding, blizzards, etc.), cyberattacks (data breaches, malicious bugs, ransomware, etc.), human error, and hardware failure. 

Impact assessment: Evaluate how each identified risk or threat could negatively affect your database systems.

Recovery objectives: To determine how quickly your organization must recover its database systems after a disruption (maximum allowable downtime), you should establish an RTO. To determine the maximum amount of data loss that your organization can tolerate, you should set an RPO. This will determine how frequently you should back up your data.

Disaster recovery plan (DRP): All the components in this section are part of the DRP, which outlines the steps to be taken in the event of a disaster or cyberattack. The DRP should include roles and responsibilities, communication procedures, and recovery procedures. You should test the DRP regularly through simulations to ensure it works effectively.

Communication plan: Develop a communication plan for keeping employees, customers, and stakeholders informed during a disaster or cyberattack.

Financial plan: Allocate budget resources for disaster recovery and business continuity initiatives to ensure they are adequately funded.

Additional tools and outside expertise: A business continuity plan isn’t something you just throw together. You might not have the tools and expertise on-staff to get it done. Consider ready-to-go backup software and whether or not you can design, implement, and maintain the business continuity plan on your own. If not, consider outside help, but beware of proprietary licensing and the pitfalls of vendor lock-in.  

Keep learning

Now that you’ve got the basics down, level up your knowledge with our on-demand presentation: The Many Ways To Copy Your Database. In it, Nicolai Plum of Booking.com discusses the best ways to copy your database – from logical data dump and file copying through native cloning and backup tools to advanced scale-out techniques.

 

Watch The Many Ways To Copy Your Database

Aug
29
2023
--

Data Redundancy With the PostgreSQL Citus Extension

Data Redundancy With the PostgreSQL Citus

Over the years, I’ve had the opportunity to architect all sorts of configurations using Postgres as a backend. I’ve always found it very cool and satisfying to implement sophisticated business rules, often in more ways than one has fingers and toes. So, it’s not an understatement when I say that Citus is one of the more interesting technologies that I’ve come across when scaling PostgreSQL.

Citus is an extension that horizontally scales PostgreSQL across multiple machines by sharding and creating redundant copies of tables. Its query engine not only parallelizes incoming SQL queries for time series but also has the ability to create column-wise tables making it ideal for OLAP analysis duties.

Among its many capabilities, a Citus cluster can:

  • Create distributed tables that are sharded across a cluster of PostgreSQL nodes to combine their CPU, memory, storage, and I/O capacity.
  • Reference tables can be replicated to all nodes for joins and foreign keys from distributed tables and for maximum read performance.
  • The distributed query engine can route and parallelize SELECT, DML, and other operations on distributed tables across the cluster.
  • Columnar storage of tables can compress data, speeding up scans and supporting fast projections, both on regular and distributed tables.

Data redundancy, a database version of a RAID

Pondering the case of high availability and redundancy, one replicates data by creating a replica via streaming replication.

Now let’s stretch our imagination and consider a second method of high availability, ala Citus.

The best way to describe the Citus way of doing things is to reflect how data is managed by a disk RAID array. Depending on the configuration, one can tune a hardware RAID for either performance or redundancy. The same can be said for Citus data sharding.

Here is an example of a table named replica2x, which has 2X redundancy across a cluster of four (4) nodes. The colors indicate duplicated shards of the table. For example, if node citus1 goes offline, the sharded table it holds still has copies on nodes citus2 and citus4. Likewise, it can be said that if node citus2 goes offline, the same data is still available on nodes 1, 3, and 4.

Citus example

About this POC

I’ll be upfront: I love working with Linux Containers, LXD. Much of what I will show you makes heavy use of them. You won’t need LXD to replicate this POC, of course, but I can’t say enough how flexible and versatile such an environment can be when prototyping a cluster of Postgres nodes, let alone an entire multi-data center infrastructure on a single workstation.

There are two parts to this POC;

  • Part A: Setup
  • Part B: Redundancy demonstration

Part A: POC setup

Step one: Getting and installing Citus

Referring to this earlier blog, you’ll see how to get and install Citus into your environment.

Step two: Creating the Citus nodes

The Citus cluster consists of a five (5) node cluster:

  • citus-coord-01: coordinator
  • citus1: worker
  • citus2: worker
  • citus3: worker
  • citus4: worker

By using LXD, I created a single templated container with Citus on Ubuntu 20.04, where the various nodes were copied from this template.

for u in citus-coord-01 citus1 citus2 citus3 citus4
do
    echo "==== $u ===="
    lxc rm --force $u 2>/dev/null
    lxc cp template-ubuntu-2004-citusdb $u
    lxc start $u
done

And here’s the resultant cluster:

lxc ls -c ns4 citus

+----------------+---------+----------------------+
|      NAME      |  STATE  |         IPV4         |
+----------------+---------+----------------------+
| citus1         | RUNNING | 10.231.38.140 (eth0) |
+----------------+---------+----------------------+
| citus2         | RUNNING | 10.231.38.151 (eth0) |
+----------------+---------+----------------------+
| citus3         | RUNNING | 10.231.38.171 (eth0) |
+----------------+---------+----------------------+
| citus4         | RUNNING | 10.231.38.204 (eth0) |
+----------------+---------+----------------------+
| citus-coord-01 | RUNNING | 10.231.38.34 (eth0)  |
+----------------+---------+----------------------+

It’s understood that on each of the five nodes:

  1. Database db01 has already been created.
  2. The postgresql.conf configuration file has been appropriately edited for remote access.
  3. The .pgpass file has been configured to supply the superuser password for all nodes.
  4. Extension citus has been created in database db01.

Step three: Check packages

Inspecting the nodes confirms Citus has been correctly installed:

for u in citus1 citus2 citus3 citus4 citus-coord-01
do

echo "==== NODE: $u ===="

lxc exec $u -- su - postgres -c 'psql db01'<<_eof_
select extname, extversion from pg_extension;
_eof_

done | less -S

==== NODE: citus1 ====
    extname     | extversion 
----------------+------------
 plpgsql        | 1.0
 citus_columnar | 11.1-1
 citus          | 11.1-1

==== NODE: citus2 ====
    extname     | extversion 
----------------+------------
 plpgsql        | 1.0
 citus_columnar | 11.1-1
 citus          | 11.1-1

==== NODE: citus3 ====
    extname     | extversion 
----------------+------------
 plpgsql        | 1.0
 citus_columnar | 11.1-1
 citus          | 11.1-1

==== NODE: citus4 ====
    extname     | extversion 
----------------+------------
 plpgsql        | 1.0
 citus_columnar | 11.1-1
 citus          | 11.1-1

==== NODE: citus-coord-01 ====
    extname     | extversion 
----------------+------------
 plpgsql        | 1.0
 citus_columnar | 11.1-1
 citus          | 11.1-1

Properly installed, the Citus runtime variables are now available:

lxc exec citus1 -- su postgres -c psql db01<<_eof_ | less -S
    select name,setting,unit from pg_settings where name ~ 'citus' order by 1;
_eof_

name                        |     setting     | unit 
----------------------------------------------------+-----------------+------
 citus.all_modifications_commutative                | off             | 
 citus.background_task_queue_interval               | 5000            | ms
 citus.cluster_name                                 | default         | 
 citus.coordinator_aggregation_strategy             | row-gather      | 
 citus.count_distinct_error_rate                    | 0               | 
 citus.cpu_priority                                 | 0               | 
 citus.cpu_priority_for_logical_replication_senders | inherit         | 
 citus.defer_drop_after_shard_move                  | on              | 
 citus.defer_drop_after_shard_split                 | on              | 
 citus.defer_shard_delete_interval                  | 15000           | ms
 citus.desired_percent_disk_available_after_move    | 10              | 
 citus.distributed_deadlock_detection_factor        | 2               | 
 citus.enable_binary_protocol                       | on              | 
 citus.enable_create_role_propagation               | on              | 
 citus.enable_deadlock_prevention                   | on              | 
 citus.enable_local_execution                       | on              | 
 citus.enable_local_reference_table_foreign_keys    | on              | 
 citus.enable_repartition_joins                     | off             | 
 citus.enable_statistics_collection                 | off             | 
 citus.explain_all_tasks                            | off             | 
 citus.explain_analyze_sort_method                  | execution-time  | 
 citus.limit_clause_row_fetch_count                 | -1              | 
 citus.local_hostname                               | localhost       | 
 citus.local_shared_pool_size                       | 50              | 
 citus.local_table_join_policy                      | auto            | 
 citus.log_remote_commands                          | off             | 
 citus.max_adaptive_executor_pool_size              | 16              | 
 citus.max_cached_connection_lifetime               | 600000          | ms
 citus.max_cached_conns_per_worker                  | 1               | 
 citus.max_client_connections                       | -1              | 
 citus.max_high_priority_background_processes       | 2               | 
 citus.max_intermediate_result_size                 | 1048576         | kB
 citus.max_matview_size_to_auto_recreate            | 1024            | MB
 citus.max_shared_pool_size                         | 100             | 
 citus.max_worker_nodes_tracked                     | 2048            | 
 citus.multi_shard_modify_mode                      | parallel        | 
 citus.multi_task_query_log_level                   | off             | 
 citus.node_connection_timeout                      | 30000           | ms
 citus.node_conninfo                                | sslmode=require | 
 citus.propagate_set_commands                       | none            | 
 citus.recover_2pc_interval                         | 60000           | ms
 citus.remote_task_check_interval                   | 10              | ms
 citus.shard_count                                  | 32              | 
 citus.shard_replication_factor                     | 1               | 
 citus.show_shards_for_app_name_prefixes            |                 | 
 citus.skip_constraint_validation                   | off             | 
 citus.skip_jsonb_validation_in_copy                | on              | 
 citus.stat_statements_track                        | none            | 
 citus.task_assignment_policy                       | greedy          | 
 citus.task_executor_type                           | adaptive        | 
 citus.use_citus_managed_tables                     | off             | 
 citus.use_secondary_nodes                          | never           | 
 citus.values_materialization_threshold             | 100             | 
 citus.version                                      | 11.1.4          | 
 citus.worker_min_messages                          | notice          | 
 citus.writable_standby_coordinator                 | off             |

Step 4: Define/configure the cluster

Log into the coordinator in order to declare and configure the cluster:

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_
    select citus_set_coordinator_host('citus-coord-01', 5432);
    insert into pg_dist_node(nodename)
        values ('citus1')
              ,('citus2')
              ,('citus3')
              ,('citus4');
_eof_

And here’s the cluster’s organization:

db01=# select nodeid,nodename,groupid,isactive from pg_dist_node order by 1;
 nodeid |    nodename    | groupid | isactive 
--------+----------------+---------+----------
      1 | citus-coord-01 |       0 | t
      2 | citus1         |       1 | t
      3 | citus2         |       2 | t
      4 | citus3         |       3 | t
      5 | citus4         |       4 | t

Step four: Create, distribute, and populate a single table

Table myevents is created, and the newly inserted records are evenly distributed across the cluster of nodes.

Login to the coordinator and execute the following commands. Notice that all DML and SQL statements are executed on the coordinator node:

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_

-- create the table
    create table myevents (
        device_id
        bigint,
        event_id
        bigserial,
        event_time
        timestamptz default now(),
        data
        jsonb not null,
        primary key (device_id, event_id)
    );

-- distribute the events among the nodes
    select create_distributed_table('myevents', 'device_id');

-- populate the table
    insert into myevents (device_id, data)
        select s % 100, ('{"measurement":'||random()||'}')::jsonb
        from generate_series(1,1000000) s;
_eof_

Querying the coordinator:

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
select *
    from myevents
    where device_id = 1
    order by event_time desc, event_id desc
    limit 10;
_eof_

device_id | event_id |          event_time           |                data                  
----------+----------+-------------------------------+-------------------------------------
        1 |   999901 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.2868659956537316}
        1 |   999801 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.7931493079697731}
        1 |   999701 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.4875322951757288}
        1 |   999601 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.6491362745752653}
        1 |   999501 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.9629266554851366}
        1 |   999401 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.1185674800281864}
        1 |   999301 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.5133762596297742}
        1 |   999201 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.7307634886202119}
        1 |   999101 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.2997471209159892}
        1 |   999001 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.9692520484104021}

Step five: A review of table distributions across the cluster

This demonstrates clearly that table myevents is hash-sharded across every node member:

  • Notice how well-balanced the sharded tables are in size.
  • Notice the table name numbering order; see how each named shard that is incremented by one is found on the next node.
for u in citus1 citus2 citus3 citus4
do
echo "==== NODE: $u ===="
lxc exec $u -- su - postgres -c 'psql db01'<<_eof_
    dt+
_eof_
done | less -S

==== NODE: citus1 ====
                                        List of relations
Schema |      Name       | Type  |  Owner   | Persistence | Access method |  Size   | Description  
-------+-----------------+-------+----------+-------------+---------------+---------+-------------
public | myevents_102008 | table | postgres | permanent   | heap          | 2832 kB |  
public | myevents_102012 | table | postgres | permanent   | heap          | 2840 kB |  
public | myevents_102016 | table | postgres | permanent   | heap          | 5624 kB |  
public | myevents_102020 | table | postgres | permanent   | heap          | 2840 kB |  
public | myevents_102024 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102028 | table | postgres | permanent   | heap          | 5632 kB |  
public | myevents_102032 | table | postgres | permanent   | heap          | 2840 kB |  
public | myevents_102036 | table | postgres | permanent   | heap          | 6560 kB |  

==== NODE: citus2 ====
                                        List of relations
Schema |      Name       | Type  |  Owner   | Persistence | Access method |  Size   | Description  
-------+-----------------+-------+----------+-------------+---------------+---------+-------------
public | myevents_102009 | table | postgres | permanent   | heap          | 4696 kB |  
public | myevents_102013 | table | postgres | permanent   | heap          | 976 kB  |  
public | myevents_102017 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102021 | table | postgres | permanent   | heap          | 3768 kB |  
public | myevents_102025 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102029 | table | postgres | permanent   | heap          | 2840 kB |  
public | myevents_102033 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102037 | table | postgres | permanent   | heap          | 2840 kB |  

==== NODE: citus3 ====
                                         List of relations
Schema |      Name       | Type  |  Owner   | Persistence | Access method |    Size    | Description  
-------+-----------------+-------+----------+-------------+---------------+------------+-------------
public | myevents_102010 | table | postgres | permanent   | heap          | 1904 kB    |  
public | myevents_102014 | table | postgres | permanent   | heap          | 1904 kB    |  
public | myevents_102018 | table | postgres | permanent   | heap          | 1904 kB    |  
public | myevents_102022 | table | postgres | permanent   | heap          | 4696 kB    |  
public | myevents_102026 | table | postgres | permanent   | heap          | 8192 bytes |  
public | myevents_102030 | table | postgres | permanent   | heap          | 2832 kB    |  
public | myevents_102034 | table | postgres | permanent   | heap          | 976 kB     |  
public | myevents_102038 | table | postgres | permanent   | heap          | 4696 kB    |  

==== NODE: citus4 ====
                                        List of relations
Schema |      Name       | Type  |  Owner   | Persistence | Access method |  Size   | Description  
-------+-----------------+-------+----------+-------------+---------------+---------+-------------
public | myevents_102011 | table | postgres | permanent   | heap          | 5632 kB |  
public | myevents_102015 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102019 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102023 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102027 | table | postgres | permanent   | heap          | 2840 kB |  
public | myevents_102031 | table | postgres | permanent   | heap          | 2832 kB |  
public | myevents_102035 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102039 | table | postgres | permanent   | heap          | 4696 kB |

Part B: Redundancy demonstration

Method

  • Step 1: Update shard replication factor from 1X to 2X
  • Step 2: Create table myevents2x with 2X redundancy
  • Step 3: Identify shard myevents2x_102040 across citus1 and citus2
  • Step 4: Identify some records to query from shards known to be on nodes citus1 and citus2
  • Step 5: Test
    • Shutdown citus1; perform the aforementioned identified query
    • Startup citus1, shutdown citus2; perform afore identified query
    • Restart citus2; perform the aforementioned identified query

ATTENTION: Please note that you may have to edit your own queries as the values may be different for your setup.

Step one:

Update shard replication factor from 1X to 2X:

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_
    show citus.shard_replication_factor;
    alter system set citus.shard_replication_factor=2;
    select pg_reload_conf();
_eof_

# validate
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_
    show citus.shard_replication_factor;
_eof_

citus.shard_replication_factor  
--------------------------------
2

Step two:

Table myevents2x is created and populated with a redundancy of 2X across the cluster:

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
-- create a new table with 2X redundancy
    create table myevents2x (
        device_id
        bigint,
        event_id
        bigserial,
        event_time
        timestamptz default now(),
        data
        jsonb not null,
        primary key (device_id, event_id)
    );

-- distribute the events among the nodes
    select create_distributed_table('myevents2x', 'device_id');


-- confirm table has been added across the cluster
    select * from master_get_active_worker_nodes() order by 1;

-- populate the table
    insert into myevents2x (device_id, data)
        select s % 100, ('{"measurement":'||random()||'}')::jsonb
        from generate_series(1,1000000) s;
_eof_

Here’s the output:

CREATE TABLE
 create_distributed_table 
--------------------------
 
(1 row)

 node_name | node_port 
-----------+-----------
 citus1    |      5432
 citus2    |      5432
 citus3    |      5432
 citus4    |      5432
(4 rows)

INSERT 0 1000000

Step three:

Locate shard myevents2x_102040, which should be on nodes citus1 and citus2:

for u in citus1 citus2 citus3 citus4
do
echo "==== NODE: $u ===="
lxc exec $u -- su - postgres -c 'psql db01'<<_eof_
    select tablename from pg_tables where tablename~'myevents2x' order by 1
_eof_
done | less -S

Here’s the output:

==== NODE: citus1 ====
     tablename     
-------------------
 myevents2x_102040
 myevents2x_102043
 myevents2x_102044
 myevents2x_102047
 myevents2x_102048
 myevents2x_102051
 myevents2x_102052
 myevents2x_102055
 myevents2x_102056
 myevents2x_102059
 myevents2x_102060
 myevents2x_102063
 myevents2x_102064
 myevents2x_102067
 myevents2x_102068
 myevents2x_102071
(16 rows)

==== NODE: citus2 ====
     tablename     
-------------------
 myevents2x_102040
 myevents2x_102041
 myevents2x_102044
 myevents2x_102045
 myevents2x_102048
 myevents2x_102049
 myevents2x_102052
 myevents2x_102053
 myevents2x_102056
 myevents2x_102057
 myevents2x_102060
 myevents2x_102061
 myevents2x_102064
 myevents2x_102065
 myevents2x_102068
 myevents2x_102069

==== NODE: citus3 ====
     tablename     
-------------------
 myevents2x_102041
 myevents2x_102042
 myevents2x_102045
 myevents2x_102046
 myevents2x_102049
 myevents2x_102050
 myevents2x_102053
 myevents2x_102054
 myevents2x_102057
 myevents2x_102058
 myevents2x_102061
 myevents2x_102062
 myevents2x_102065
 myevents2x_102066
 myevents2x_102069
 myevents2x_102070

==== NODE: citus4 ====
     tablename     
-------------------
 myevents2x_102042
 myevents2x_102043
 myevents2x_102046
 myevents2x_102047
 myevents2x_102050
 myevents2x_102051
 myevents2x_102054
 myevents2x_102055
 myevents2x_102058
 myevents2x_102059
 myevents2x_102062
 myevents2x_102063
 myevents2x_102066
 myevents2x_102067
 myevents2x_102070
 myevents2x_102071

Step four:

Locate and return the first three records of shard myevents2x_102040 on nodes citus1 and citus2:

lxc exec citus1 -- su - postgres -c 'psql db01'<<_eof_ | less -S
    qecho ==== citus1 ====
    qecho select * from myevents2x_102040 order by 1,2 limit 3
    select * from myevents2x_102040 order by 1,2 limit 3;
    c 'host=citus2 dbname=db01 user=postgres'
    qecho ==== citus2 ====
    qecho select * from myevents2x_102040 order by 1,2 limit 3
    select * from myevents2x_102040 order by 1,2 limit 3;
    c 'host=citus-coord-01 dbname=db01 user=postgres'
    qecho ==== coordinator ====
    qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
    select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2;
_eof_

Here’s the output:

==== citus1 ====
select * from myevents2x_102040 order by 1,2 limit 3
 device_id | event_id |          event_time           |                data                 
-----------+----------+-------------------------------+-------------------------------------
         8 |        8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
         8 |      108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
         8 |      208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}


You are now connected to database "db01" as user "postgres" on host "citus2" (address "fd42:cb6a:5384:9a60:216:3eff:fe38>
==== citus2 ====
select * from myevents2x_102040 order by 1,2 limit 3
 device_id | event_id |          event_time           |                data                 
-----------+----------+-------------------------------+-------------------------------------
         8 |        8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
         8 |      108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
         8 |      208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}


You are now connected to database "db01" as user "postgres" on host "citus-coord-01" (address "fd42:cb6a:5384:9a60:216:3>
==== coordinator ====
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
 device_id | event_id |          event_time           |                data                 
-----------+----------+-------------------------------+-------------------------------------
         8 |        8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
         8 |      108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
         8 |      208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}

Step five:

The next few steps demonstrate our ability to continuously query and return those records found in shard myevents2x_102040.

Step 5a: Test, shutdown node citus1

lxc stop citus1

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
    qecho "==== citus1 is shutdown ===="
    qecho ==== querying coordinator ====
    qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
    select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2;
_eof_

Here’s the output:

"==== citus1 is shutdown ===="
==== querying coordinator ====
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
 device_id | event_id |          event_time           |                data                 
-----------+----------+-------------------------------+-------------------------------------
         8 |        8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
         8 |      108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
         8 |      208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}

Step 5b: Test, restart citus1, and shutdown citus2

lxc start citus1
lxc stop citus2

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
    qecho "==== citus2 is shutdown ===="
    qecho ==== querying coordinator ====
    qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
    select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2;
_eof_

Here’s the output; note that it’s exactly the same as the previous test:

"==== citus2 is shutdown ===="
==== querying coordinator ====
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
 device_id | event_id |          event_time           |                data                 
-----------+----------+-------------------------------+-------------------------------------
         8 |        8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
         8 |      108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
         8 |      208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}

Step 5c: Test, restart node citus2

lxc start citus2

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
    qecho "==== cluster restored ===="
    qecho ==== querying coordinator ====
    qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
    select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2;
_eof_

Here’s the output, of course!

"==== cluster restored ===="
==== querying coordinator ====
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
device_id | event_id |          event_time           |                data                  
----------+----------+-------------------------------+-------------------------------------
        8 |        8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
        8 |      108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
        8 |      208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}

Conclusion

Data redundancy is the hallmark of high availability. But with Citus, we’ve raised the bar. Can you think of a better system that can stay up without losing time initiating failovers when a node fails?

Have fun!

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!

Aug
29
2023
--

Bootstrap PostgreSQL on Kubernetes

Bootstrap PostgreSQL on Kubernetes

PostgreSQL has become increasingly popular in modern cloud-native environments. However, managing PostgreSQL clusters on Kubernetes can be a complex task. This is where the Percona Operator comes into play, offering a powerful solution to deploy and manage PostgreSQL clusters effortlessly. Developers often seek an easy way to bootstrap the clusters with data so that applications can start running immediately. It is especially important for CICD pipelines, where automation plays a crucial role.

In this blog post, we will explore the immense value of provisioning PostgreSQL clusters with Percona Operator by using bootstrap capabilities:

  1. Start the cluster with init SQL script
  2. Bootstrap the cluster from the existing cluster or backup

Bootstrap PostgreSQL on Kubernetes

Getting started

You need to have the Percona Operator for PostgreSQL deployed. Please follow our installation instructions and use your favorite way.

You can find all examples from this blog post in this GitHub repository. A single command to deploy the operator would be:

kubectl apply -f https://raw.githubusercontent.com/spron-in/blog-data/master/bootstrap-postgresql-k8s/00-bundle.yaml --server-side

Init SQL script

Init SQL allows the creation of the database cluster with some initial data in it. Everything is created with postgres admin user. The way it works is the following:

  1. Create the ConfigMap resource with the SQL script
  2. Reference it in the
    PerconaPGCluster Custom Resource

The operator will apply the SQL during cluster creation. It is quite usual to combine this feature with the user creation.

Create the ConfigMap from 01-demo-init.yaml manifest:

The init.sql does the following:

  1. Connects to
    demodb database
  2. Creates schema
    media for user
    myuser
  3. Creates 2 tables –
    BLOG and
    AUTHORS in the schema

I’m combining bootstrapping with the user and database creation functionality that the Operator also provides. In my 02-deploy-cr.yaml manifest, I created the user
myuser and database
demodb:

  users:
    - name: myuser
      databases:
        - demo-db

Reference the
ConfigMap in the custom resource:

  databaseInitSQL:
    key: init.sql
    name: demo-cluster-init

Applying the manifest would do the trick:

kubectl apply -f https://raw.githubusercontent.com/spron-in/blog-data/master/bootstrap-postgresql-k8s/02-deploy-cr.yaml

Troubleshooting

To verify if init SQL was executed successfully or to check if something went wrong, see the Operator’s log. Search for init SQL. For example, the following tells me that I had a syntax error in my SQL script for
democluster:

$ kubectl logs --tail=2000 percona-postgresql-operator-6f96ffd8d4-ddzth  | grep 'init SQL'
time="2023-08-14T09:37:37Z" level=debug msg="applied init SQL" PostgresCluster=default/demo-cluster controller=postgrescluster controllerKind=PostgresCluster key=init.sql name=demo-cluster-init namespace=default reconcileID=1d0cfdcc-0464-459a-be6e-b25eb46ed2c9 stderr="psql:<stdin>:11: ERROR:  syntax error at or near "KEYS"nLINE 2:    ID INT PRIMARY KEYS     NOT NULL,n                          ^n" stdout="You are now connected to database "demo-db" as user "postgres".nCREATE SCHEMAnCREATE TABLEn" version=

Bootstrap from cluster or backup

ConfigMaps cannot store more than one MB of data, which means that init SQL approach is good for some small data bootstraps. If you have a big dataset that you want to roll out along with cluster creation, then there are two ways to do that:

  1. From an existing cluster in Kubernetes
  2. From the backup

From the cluster

To use this, you must have a running cluster and pgBackrest configured repo for it. Now, you can create the second cluster.

03-deploy-cr2.yaml manifest will provision
democluster2. I have removed the
spec.databaseInitSQL section while keeping
spec.users. To instruct the Operator to restore from
democluster and its
repo1 I added the
dataSource  section:

  dataSource:
    postgresCluster:
      clusterName: demo-cluster
      repoName: repo1

The new cluster will be created once the manifest is applied:

$ kubectl apply -f https://raw.githubusercontent.com/spron-in/blog-data/master/bootstrap-postgresql-k8s/03-deploy-cr2.yaml
$ kubectl get pg
NAME             ENDPOINT                               STATUS   POSTGRES   PGBOUNCER   AGE
demo-cluster     demo-cluster-pgbouncer.default.svc     ready    1          1           14m
demo-cluster-2   demo-cluster-2-pgbouncer.default.svc   ready    1          1           13m

demo-cluster-2  will have the same data as
democluster. Keep in mind that even if data is the same, the user passwords would be different by default. You can change this; please see users documentation.

From the backup

Another common case is bootstrapping from an existing backup in case the database cluster is not running anymore, or it is isolated in another Kubernetes cluster. In this case, the backups should be stored on some object storage. Please use our documentation to configure backups.

For example,
my democluster configuration in 04-deploy-cr.yaml looks like this if I want to take the backups to Google Cloud Storage (GCS):

pgbackrest:
      global:
        - secret:
            name: demo-cluster-gcs
...
      repos:
      - name: repo1
        schedules:
          full: "0 0 * * 6"
        gcs:
          bucket: "my-demo-bucket"

Once you have backups stored in the object storage, you can delete the cluster and reference it in the manifest anytime for bootstrapping. For example, in 05-deploy-cr3.yaml,
dataSource section looks like this:

  dataSource:
    pgbackrest:
      stanza: db
      configuration:
      - secret:
          name: demo-cluster-gcs
      global:
        repo1-path: /pgbackrest/demo/repo1
      repo:
        name: repo1
        gcs:
          bucket: "my-demo-bucket"

The fields have the same structure and reference the same Secret resource where GCS configuration is stored.

Troubleshooting

When you bootstrap the cluster from pgBackrest backup, the Operator creates a
pgbackrestrestore pod. If it crashes and jumps into Error state, it indicates that something went wrong.

$ kubectl get pods
NAME                                           READY   STATUS     RESTARTS   AGE
demo-cluster-3-pgbackrest-restore-74dg5        0/1     Error      0          27s
$ kubectl logs demo-cluster-3-pgbackrest-restore-74dg5
Defaulted container "pgbackrest-restore" out of: pgbackrest-restore, nss-wrapper-init (init)
+ pgbackrest restore --stanza=db --pg1-path=/pgdata/pg15 --repo=1 --delta --link-map=pg_wal=/pgdata/pg15_wal
WARN: unable to open log file '/pgdata/pgbackrest/log/db-restore.log': No such file or directory
      NOTE: process will continue without log file.
WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/pgdata/pg15' to confirm that this is a valid $PGDATA directory.  --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
WARN: repo1: [FileMissingError] unable to load info file '/pgbackrest/demo/repo1/backup/db/backup.info' or '/pgbackrest/demo/repo1/backup/db/backup.info.copy':
      FileMissingError: unable to open missing file '/pgbackrest/demo/repo1/backup/db/backup.info' for read
      FileMissingError: unable to open missing file '/pgbackrest/demo/repo1/backup/db/backup.info.copy' for read
      HINT: backup.info cannot be opened and is required to perform a backup.
      HINT: has a stanza-create been performed?
ERROR: [075]: no backup set found to restore

Conclusion

One of the key advantages of running PostgreSQL with Percona Operator is the speed of innovation it brings to the table. With the ability to automate database bootstrapping and management tasks, developers and administrators can focus on more important aspects of their applications. This leads to increased productivity and faster time-to-market for new features and enhancements.

Furthermore, the integration of bootstrapping PostgreSQL clusters on Kubernetes with CICD pipelines is vital. With Percona Operator, organizations can seamlessly incorporate their database deployments into their CI/CD processes. This not only ensures a rapid and efficient release cycle but also enables developers to automate database provisioning, updates, and rollbacks, thereby reducing the risk of errors and downtime.

Try out the Operator by following the quickstart guide here.

You can get early access to new product features, invite-only ”ask me anything” sessions with Percona Kubernetes experts, and monthly swag raffles. Interested? Fill in the form at percona.com/k8s.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

 

Download Percona Distribution for PostgreSQL Today!

Aug
28
2023
--

Talking Drupal #413 – Drupal Coffee Exchange

On today’s show we are talking about The Drupal Coffee Exchange, How it got started, and How it keeps going with guest AmyJune Hineline.

For show notes visit: www.talkingDrupal.com/413

Topics

  • What’s new with AmyJune
  • What is the Drupal Coffee Exchange
  • How did it start
  • Who started it
  • Where was the first Drupal Coffee Exchange
  • How do you participate
    • Live
    • Mail
  • How is it live
  • Does it have to be ground or whole bean
  • Listener question: Stephen – How can it be added to a camp
  • What was your favorite coffee received
  • What was your favorite coffee given
  • John’s first experience
  • Brick of Shame
  • When is the next Coffee Exchange
  • Where is the schedule
  • Favorite type of coffee

Resources

Guests

AmyJune Hineline – volkswagenchick

Hosts

Nic Laflin – nLighteneddevelopment.com nicxvan John Picozzi – epam.com johnpicozzi Andy Blum – andy-blum.comandy_blum

MOTW Correspondent

Martin Anderson-Clutz – @mandclu Coffee

  • Brief description:
    • Have you ever wanted to quickly search your Drupal admin UI for the specific page you need to get to? There’s a module for that!
  • Brief history
    • How old: created in Nov 2011
    • Covered as MOTW back in episode #45
    • Versions available: 8.x-1.3 (D9 & 10), 7.x-2.3
  • Maintainership
    • Number of open issues: 53 open, 13 of which are bugs against the 8.x branch
    • Does have test coverage
  • Usage stats:
    • Almost 31,000 sites
  • Maintainer(s):
    • Michaelmol, who appears to have created the module in his first year on drupal.org, now almost 13 years ago
  • Module features and usage
    • With a simple keystroke, (option-d or alt-d) you can start your search, and then see suggestions as you type
    • Can use the arrow keys to move between suggestions
    • Listeners who have used Mac apps like Alfred or Spotlight will be familiar with the experience
    • A very fast way to move between different places in your Drupal admin
    • Also includes :add as a quick way to add content of a specific type
    • Provides a hook you can use to define your own commands
    • There is a Coffee Extras module that adds more commands, but it doesn’t work with Drupal 10 and is marked as no longer developed
    • Also worth noting that the Gin admin theme comes with built-in optimization and formatting for Coffee, so you can enjoy Gin and Coffee together
Aug
25
2023
--

Failover and Recovery Scenarios in InnoDB Cluster and ClusterSet

Failover and Recovery Scenarios in InnoDB Cluster

This blog post will focus on failover and recovery scenarios inside the InnoDB Cluster and ClusterSet environment. To know more about the deployments of these topologies, you can refer to the manuals – InnoDB Cluster and Innodb ClusterSet setup.

In the below snippet, we have two clusters (cluster1 and cluster2), which are connected via an async channel and combined, known as a ClusterSet topology. We are going to use the below topology in all of our cases.

MySQL  127.0.0.1:3308 ssl  JS > myclusterset.status({extended: 1})
{
    "clusters": {
        "Cluster1": {
            "clusterRole": "PRIMARY", 
            "globalStatus": "OK", 
            "primary": "127.0.0.1:3308", 
            "status": "OK", 
            "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
            "topology": {
                "127.0.0.1:3308": {
                    "address": "127.0.0.1:3308", 
                    "memberRole": "PRIMARY", 
                    "mode": "R/W", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                }, 
                "127.0.0.1:3309": {
                    "address": "127.0.0.1:3309", 
                    "memberRole": "SECONDARY", 
                    "mode": "R/O", 
                    "replicationLagFromImmediateSource": "", 
                    "replicationLagFromOriginalSource": "", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                }, 
                "127.0.0.1:3310": {
                    "address": "127.0.0.1:3310", 
                    "memberRole": "SECONDARY", 
                    "mode": "R/O", 
                    "replicationLagFromImmediateSource": "", 
                    "replicationLagFromOriginalSource": "", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                }
            }, 
            "transactionSet": "39ec1328-3909-11ee-a0f8-5254004d77d3:1-4,b3a79c40-3909-11ee-805d-5254004d77d3:1-124,b3a7a659-3909-11ee-805d-5254004d77d3:1-5"
        }, 
        "cluster2": {
            "clusterRole": "REPLICA", 
            "clusterSetReplication": {
                "applierStatus": "APPLIED_ALL", 
                "applierThreadState": "Waiting for an event from Coordinator", 
                "applierWorkerThreads": 4, 
                "receiver": "127.0.0.1:3311", 
                "receiverStatus": "ON", 
                "receiverThreadState": "Waiting for source to send event", 
                "source": "127.0.0.1:3308"
            }, 
            "clusterSetReplicationStatus": "OK", 
            "globalStatus": "OK", 
            "status": "OK", 
            "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
            "topology": {
                "127.0.0.1:3311": {
                    "address": "127.0.0.1:3311", 
                    "memberRole": "PRIMARY", 
                    "mode": "R/O", 
                    "replicationLagFromImmediateSource": "", 
                    "replicationLagFromOriginalSource": "", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                }, 
                "127.0.0.1:3312": {
                    "address": "127.0.0.1:3312", 
                    "memberRole": "SECONDARY", 
                    "mode": "R/O", 
                    "replicationLagFromImmediateSource": "", 
                    "replicationLagFromOriginalSource": "", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                }, 
                "127.0.0.1:3313": {
                    "address": "127.0.0.1:3313", 
                    "memberRole": "SECONDARY", 
                    "mode": "R/O", 
                    "replicationLagFromImmediateSource": "", 
                    "replicationLagFromOriginalSource": "", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                }
            }, 
            "transactionSet": "39ec1328-3909-11ee-a0f8-5254004d77d3:1-4,3db995e4-390a-11ee-b678-5254004d77d3:1-5,b3a79c40-3909-11ee-805d-5254004d77d3:1-124,b3a7a659-3909-11ee-805d-5254004d77d3:1-5", 
            "transactionSetConsistencyStatus": "OK", 
            "transactionSetErrantGtidSet": "", 
            "transactionSetMissingGtidSet": ""
        }
    }, 
    "domainName": "firstclusterset", 
    "globalPrimaryInstance": "127.0.0.1:3308", 
    "metadataServer": "127.0.0.1:3308", 
    "primaryCluster": "Cluster1", 
    "status": "HEALTHY", 
    "statusText": "All Clusters available."
}

How failover happens inside a single InnoDB Cluster

  • Connect to any node of the first cluster (“cluster1”) via MySQLShell and fetch the details.
    MySQL  127.0.0.1:3308 ssl  JS > c root@localhost:3308
    MySQL  localhost:3308 ssl  JS > cluster1=dba.getCluster()

    MySQL  localhost:3308 ssl  JS > cluster1.status()
    {
        "clusterName": "Cluster1", 
        "clusterRole": "PRIMARY", 
        "defaultReplicaSet": {
            "name": "default", 
            "primary": "127.0.0.1:3308", 
            "ssl": "REQUIRED", 
            "status": "OK", 
            "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
            "topology": {
                "127.0.0.1:3308": {
                    "address": "127.0.0.1:3308", 
                    "memberRole": "PRIMARY", 
                    "mode": "R/W", 
                    "readReplicas": {}, 
                    "role": "HA", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                }, 
                "127.0.0.1:3309": {
                    "address": "127.0.0.1:3309", 
                    "memberRole": "SECONDARY", 
                    "mode": "R/O", 
                    "readReplicas": {}, 
                    "replicationLagFromImmediateSource": "", 
                    "replicationLagFromOriginalSource": "", 
                    "role": "HA", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                }, 
                "127.0.0.1:3310": {
                    "address": "127.0.0.1:3310", 
                    "memberRole": "SECONDARY", 
                    "mode": "R/O", 
                    "readReplicas": {}, 
                    "replicationLagFromImmediateSource": "", 
                    "replicationLagFromOriginalSource": "", 
                    "role": "HA", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                }
            }, 
            "topologyMode": "Single-Primary"
        }, 
        "domainName": "firstclusterset", 
        "groupInformationSourceMember": "127.0.0.1:3308"
    }

     

  • Now perform a primary switchover from the instance (“127.0.0.1:3308”) to (“127.0.0.1:3309”).
MySQL  localhost:3308 ssl  JS > cluster1.setPrimaryInstance("root@127.0.0.1:3309")

Output:

Setting instance '127.0.0.1:3309' as the primary instance of cluster 'Cluster1'...

Instance '127.0.0.1:3308' was switched from PRIMARY to SECONDARY.
Instance '127.0.0.1:3309' was switched from SECONDARY to PRIMARY.
Instance '127.0.0.1:3310' remains SECONDARY.

The instance '127.0.0.1:3309' was successfully elected as primary.

  • Finally, the instance (“127.0.0.1:3309”) will show the status as primary.
MySQL localhost:3308 ssl JS > cluster1.status()
...
"127.0.0.1:3309": {
"address": "127.0.0.1:3309",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
...

How to rejoin the lost instance again

If, for some reason, an instance leaves the cluster or loses connection and can’t automatically rejoin the cluster, we might need to rejoin an instance to a cluster by issuing the “Cluster.rejoinInstance(instance)” command. Here, we will try to create a small example that can demonstrate the usage of this command.

  • Create some blocker by stopping group replication on the instance (“127.0.0.1:3310”).
MySQL localhost:3310 ssl SQL > stop group_replication;

  • Looking over the information below, we can see the instance (“127.0.0.1:3310”) is showing “MISSING” status.
MySQL  localhost:3310 ssl  JS > cluster1.status()
...
"127.0.0.1:3310": {
                "address": "127.0.0.1:3310", 
                "instanceErrors": [
                    "NOTE: group_replication is stopped."
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "OFFLINE", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.0.31"
            }
...

  • Now, add the instance (“127.0.0.1:3310”) again with the rejoinInstance() command.
MySQL  localhost:3310 ssl  JS > cluster1.rejoinInstance('127.0.0.1:3310')

...
Validating instance configuration at 127.0.0.1:3310...
NOTE: Instance detected as a sandbox.
Please note that sandbox instances are only suitable for deploying test clusters for use within the same host.

This instance reports its own address as 127.0.0.1:3310

Instance configuration is suitable.
Rejoining instance '127.0.0.1:3310' to cluster 'Cluster1'...

Re-creating recovery account...
NOTE: User 'mysql_innodb_cluster_2558498413'@'%' already existed at instance '127.0.0.1:3309'. It will be deleted and created again with a new password.

* Waiting for the Cluster to synchronize with the PRIMARY Cluster...
** Transactions replicated  ############################################################  100% 

The instance '127.0.0.1:3310' was successfully rejoined to the cluster.
...

And after the above operation, the instance seems to be “ONLINE” now.

"127.0.0.1:3310": {
"address": "127.0.0.1:3310",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLagFromImmediateSource": "",
"replicationLagFromOriginalSource": "",
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
}

How to recover a cluster from a quorum or vote loss

There are situations when the running instance can fail, and a cluster can lose its quorum (ability to vote) due to insufficient members. This could trigger when there is a failure of enough/majority of instances that make up the cluster to vote on Group Replication operations. In case a cluster loses the quorum, it can no longer process any write transactions within the cluster or change the cluster’s topology.

However, if any instance is online that contains the latest InnoDB Cluster metadata, it is possible to restore a cluster with the quorum.

Let’s see how we can use the feature “forceQuorumUsingPartitionOf” to recover the cluster again with minimal member votes.

  • First, we will try to fail the majority of nodes with a simple “KILL” operation.
root     30281     1  1 07:10 ?        00:02:03 /root/mysql-sandboxes/3308/bin/mysqld --defaults-file=/root/mysql-sandboxes/3308/my.cnf --user=root
root     30788     1  1 07:14 ?        00:01:53 /root/mysql-sandboxes/3309/bin/mysqld --defaults-file=/root/mysql-sandboxes/3309/my.cnf --user=root
root     30912     1  2 07:14 ?        00:02:48 /root/mysql-sandboxes/3310/bin/mysqld --defaults-file=/root/mysql-sandboxes/3310/my.cnf --user=root

[root@localhost ~]# kill -9 30281 30912

  • Now check the cluster1 status again. The cluster lost the quorum, and no write activity was allowed.
MySQL  localhost:3309 ssl  JS > cluster1.status()
{
    "clusterName": "Cluster1", 
    "clusterRole": "PRIMARY", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "127.0.0.1:3309", 
        "ssl": "REQUIRED", 
        "status": "NO_QUORUM", 
        "statusText": "Cluster has no quorum as visible from '127.0.0.1:3309' and cannot process write transactions. 2 members are not active.", 
        "topology": {
            "127.0.0.1:3308": {
                "address": "127.0.0.1:3308", 
                "memberRole": "SECONDARY", 
                "memberState": "(MISSING)", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "shellConnectError": "MySQL Error 2003: Could not open connection to '127.0.0.1:3308': Can't connect to MySQL server on '127.0.0.1:3308' (111)", 
                "status": "UNREACHABLE", 
                "version": "8.0.31"
            }, 
            "127.0.0.1:3309": {
                "address": "127.0.0.1:3309", 
                "memberRole": "PRIMARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.31"
            }, 
            "127.0.0.1:3310": {
                "address": "127.0.0.1:3310", 
                "memberRole": "SECONDARY", 
                "memberState": "(MISSING)", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "shellConnectError": "MySQL Error 2003: Could not open connection to '127.0.0.1:3310': Can't connect to MySQL server on '127.0.0.1:3310' (111)", 
                "status": "UNREACHABLE", 
                "version": "8.0.31"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "domainName": "firstclusterset", 
    "groupInformationSourceMember": "127.0.0.1:3309"
}

  • To fix this situation, we can connect to the available instance (“127.0.0.1:3309”) and reset the quorum again with the below command.
MySQL  localhost:3309 ssl  JS > mycluster1.forceQuorumUsingPartitionOf("root@localhost:3309")

...

Restoring cluster 'Cluster1' from loss of quorum, by using the partition composed of [127.0.0.1:3309]

Restoring the InnoDB cluster ...

The InnoDB cluster was successfully restored using the partition from the instance 'root@localhost:3309'.

WARNING: To avoid a split-brain scenario, ensure that all other members of the cluster are removed or joined back to the group that was restored.

...

  • The Instances with port (3308,3310) are still down/failed. However, we recovered the cluster quorum with a single primary member (“127.0.0.1:3309”). Now, we can perform the write activity on the cluster without any issues.
MySQL  localhost:3309 ssl  JS > cluster1.status()
{
    "clusterName": "Cluster1", 
    "clusterRole": "PRIMARY", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "127.0.0.1:3309", 
        "ssl": "REQUIRED", 
        "status": "OK_NO_TOLERANCE_PARTIAL", 
        "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active.", 
        "topology": {
            "127.0.0.1:3308": {
                "address": "127.0.0.1:3308", 
                "memberRole": "SECONDARY", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "shellConnectError": "MySQL Error 2003: Could not open connection to '127.0.0.1:3308': Can't connect to MySQL server on '127.0.0.1:3308' (111)", 
                "status": "(MISSING)"
            }, 
            "127.0.0.1:3309": {
                "address": "127.0.0.1:3309", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.31"
            }, 
            "127.0.0.1:3310": {
                "address": "127.0.0.1:3310", 
                "memberRole": "SECONDARY", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "shellConnectError": "MySQL Error 2003: Could not open connection to '127.0.0.1:3310': Can't connect to MySQL server on '127.0.0.1:3310' (111)", 
                "status": "(MISSING)"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "domainName": "firstclusterset", 
    "groupInformationSourceMember": "127.0.0.1:3309"
}


MySQL  localhost:3309 ssl  SQL > create database sbtest2;
Query OK, 1 row affected (0.0055 sec)

Later on, we can fix the failed instances, and they will join cluster1 again. Sometimes, we might need to perform the “rejoinInstance()” operation in order to add the nodes again.

How to recover a complete cluster from a major outage

Sometimes, even if all the nodes are up and some internal issues happen, like group replication is stuck or some networking problem, you might experience a complete outage and be unable to perform any writes/activity on the cluster.

In such circumstances, you can use any one node and use its metadata to recover the cluster. You need to connect to the most up-to-date instance, as otherwise, you may lose data or have inconsistency in the cluster nodes.

Let’s see how we can introduce such a situation manually and then try to fix that.

  • First, stop the group replication on all three instances.
MySQL localhost:3308 ssl SQL > stop group_replication;
MySQL localhost:3310 ssl SQL > stop group_replication;
MySQL localhost:3309 ssl SQL > stop group_replication;

127.0.0.1:3308": {
                "address": "127.0.0.1:3308", 
                "instanceErrors": [
                    "NOTE: group_replication is stopped."
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "OFFLINE", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.0.31"
            }, 
            "127.0.0.1:3309": {
                "address": "127.0.0.1:3309", 
                "instanceErrors": [
                    "NOTE: group_replication is stopped."
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "OFFLINE", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.0.31"
            }, 
            "127.0.0.1:3310": {
                "address": "127.0.0.1:3310", 
                "instanceErrors": [
                    "NOTE: group_replication is stopped."
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "OFFLINE", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.0.31"
            }

Now we are completely stuck, and even if we try to perform any writes on the last primary member, we see the below error since the “–super-read-only” is enabled in order to protect the trxs on the database.

MySQL  localhost:3309 ssl  SQL > create database sbtest3;
ERROR: 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement

  • Now, we will fix the issue with the below set of steps.

1. Connect to the instance with most executions/Gtid’s. We can confirm the same by connecting to each instance and comparing the GTIDs with the below command.

mysql> SELECT @@GLOBAL.GTID_EXECUTED;
mysql> SELECT received_transaction_set FROM performance_schema.replication_connection_status WHERE channel_name="group_replication_applier";

2. In our case, it is the same in all three instances, so we can choose any of the nodes. Let’s choose the last primary (“127.0.0.1:3309”).

MySQL  localhost:3309 ssl  SQL > SELECT @@GLOBAL.GTID_EXECUTED;
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| @@GLOBAL.GTID_EXECUTED                                                                                                                          |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| 39ec1328-3909-11ee-a0f8-5254004d77d3:1-4,
b3a79c40-3909-11ee-805d-5254004d77d3:1-152:1000071-1000079,
b3a7a659-3909-11ee-805d-5254004d77d3:1-12 |
+-------------------------------------------------------------------------------------------------------------------------------------------------+

MySQL  localhost:3309 ssl  SQL > SELECT received_transaction_set FROM performance_schema.replication_connection_status WHERE channel_name="group_replication_applier";
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| received_transaction_set                                                                                                                        |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| 39ec1328-3909-11ee-a0f8-5254004d77d3:1-4,
b3a79c40-3909-11ee-805d-5254004d77d3:1-152:1000071-1000079,
b3a7a659-3909-11ee-805d-5254004d77d3:1-12 |
+-------------------------------------------------------------------------------------------------------------------------------------------------+

3. Finally, connect to the instance (“127.0.0.1:3309”) and execute the below command to recover from the outage/failure.

MySQL  localhost:3309 ssl  JS > c root@localhost:3309
MySQL  localhost:3309 ssl  JS > mycluster1 = dba.rebootClusterFromCompleteOutage("cluster1",{force: true})

...

Restoring the Cluster 'Cluster1' from complete outage...

Cluster instances: '127.0.0.1:3308' (OFFLINE), '127.0.0.1:3309' (OFFLINE), '127.0.0.1:3310' (OFFLINE)
Validating instance configuration at localhost:3309...

This instance reports its own address as 127.0.0.1:3309

Instance configuration is suitable.
* Waiting for seed instance to become ONLINE...
127.0.0.1:3309 was restored.
Validating instance configuration at 127.0.0.1:3308...

This instance reports its own address as 127.0.0.1:3308

Instance configuration is suitable.
Rejoining instance '127.0.0.1:3308' to cluster 'Cluster1'...

Re-creating recovery account...
NOTE: User 'mysql_innodb_cluster_3676500949'@'%' already existed at instance '127.0.0.1:3309'. It will be deleted and created again with a new password.

* Waiting for the Cluster to synchronize with the PRIMARY Cluster...
** Transactions replicated  ############################################################  100% 

The instance '127.0.0.1:3308' was successfully rejoined to the cluster.

Validating instance configuration at 127.0.0.1:3310...

This instance reports its own address as 127.0.0.1:3310

Instance configuration is suitable.
Rejoining instance '127.0.0.1:3310' to cluster 'Cluster1'...

Re-creating recovery account...
NOTE: User 'mysql_innodb_cluster_2558498413'@'%' already existed at instance '127.0.0.1:3309'. It will be deleted and created again with a new password.

* Waiting for the Cluster to synchronize with the PRIMARY Cluster...
** Transactions replicated  ############################################################  100% 

The instance '127.0.0.1:3310' was successfully rejoined to the cluster.

The Cluster was successfully rebooted.

<Cluster:Cluster1>

...

Note: Be careful with the force option as it can bypass other important checks like GTID_SET or instance reachability.

Now, if we check the status again, we see all three nodes are up now. The recovery command fixes all problems (group replication started) and rejoins the instances again.

MySQL  localhost:3309 ssl  JS > cluster1=dba.getCluster()
MySQL  localhost:3309 ssl  JS > cluster1.status()  
{
    "clusterName": "Cluster1", 
    "clusterRole": "PRIMARY", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "127.0.0.1:3309", 
        "ssl": "REQUIRED", 
        "status": "OK", 
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
        "topology": {
            "127.0.0.1:3308": {
                "address": "127.0.0.1:3308", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLagFromImmediateSource": "", 
                "replicationLagFromOriginalSource": "", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.31"
            }, 
            "127.0.0.1:3309": {
                "address": "127.0.0.1:3309", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.31"
            }, 
            "127.0.0.1:3310": {
                "address": "127.0.0.1:3310", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLagFromImmediateSource": "", 
                "replicationLagFromOriginalSource": "", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.31"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "domainName": "firstclusterset", 
    "groupInformationSourceMember": "127.0.0.1:3309"
}

How to perform switchover/failover from one cluster to another in a ClusterSet

Sometimes, we need to perform maintenance and update activities on the database instances. To avoid a long downtime or major impact on production, we do some control switchover so the concerned node can be offline without impacting any running workload.

Inside ClusterSet, we can perform such activity by changing the clusterRole from “REPLICA” to “PRIMARY” among the running clusters.

Let’s see the exact scenario below.

  • Fetch the ClusterSet information.
MySQL  127.0.0.1:3308 ssl  JS > myclusterset=dba.getClusterSet()
<ClusterSet:firstclusterset>

MySQL  127.0.0.1:3308 ssl  JS > myclusterset.status({extended: 1})

...

    "clusters": {
        "Cluster1": {
            "clusterRole": "PRIMARY", 
            "globalStatus": "OK", 
            "primary": "127.0.0.1:3309", 
            "status": "OK", 
            "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
            "topology": {

      "cluster2": {
            "clusterRole": "REPLICA", 
            "clusterSetReplication": {
                "applierStatus": "APPLIED_ALL", 
                "applierThreadState": "Waiting for an event from Coordinator", 
                "applierWorkerThreads": 4, 
                "receiver": "127.0.0.1:3311", 
                "receiverStatus": "ON", 
                "receiverThreadState": "Waiting for source to send event", 
                "source": "127.0.0.1:3309"
...

  • Performing a switchover from “cluster1” to “cluster2”.
MySQL  127.0.0.1:3308 ssl  JS > myclusterset.setPrimaryCluster('cluster2')

...

Switching the primary cluster of the clusterset to 'cluster2'
* Verifying clusterset status
** Checking cluster Cluster1
  Cluster 'Cluster1' is available
** Checking cluster cluster2
  Cluster 'cluster2' is available

* Reconciling 5 internally generated GTIDs

* Refreshing replication account of demoted cluster
* Synchronizing transaction backlog at 127.0.0.1:3311
** Transactions replicated  ############################################################  100% 


* Updating metadata

* Updating topology
** Changing replication source of 127.0.0.1:3308 to 127.0.0.1:3311
** Changing replication source of 127.0.0.1:3310 to 127.0.0.1:3311
** Changing replication source of 127.0.0.1:3309 to 127.0.0.1:3311
* Acquiring locks in replicaset instances
** Pre-synchronizing SECONDARIES
** Acquiring global lock at PRIMARY
** Acquiring global lock at SECONDARIES

* Synchronizing remaining transactions at promoted primary
** Transactions replicated  ############################################################  100% 


* Updating replica clusters
Cluster 'cluster2' was promoted to PRIMARY of the clusterset. The PRIMARY instance is '127.0.0.1:3311'
...

  • Now, if we see the status again, we can observe the change in the ClusterRole.
"clusters": {
        "Cluster1": {
            "clusterRole": "REPLICA", 
            "clusterSetReplication": {
                "applierStatus": "APPLIED_ALL", 
                "applierThreadState": "Waiting for an event from Coordinator", 
                "applierWorkerThreads": 4, 
                "receiver": "127.0.0.1:3309", 
                "receiverStatus": "ON", 
                "receiverThreadState": "Waiting for source to send event", 
                "source": "127.0.0.1:3311"
            }, 

"cluster2": {
            "clusterRole": "PRIMARY", 
            "globalStatus": "OK", 
            "primary": "127.0.0.1:3311", 
            "status": "OK", 
            "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
            "topology": {
                "127.0.0.1:3311": {
                    "address": "127.0.0.1:3311", 
                    "memberRole": "PRIMARY", 
                    "mode": "R/W", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                },

In some unlucky situations, when one of the clusters (“cluster1”) is completely down, and we don’t have any choice, we might need to do an emergency failover. The below command could be handy in those cases.

mysql> myclusterset.forcePrimaryCluster("cluster2")

In case you are using the MySQLRouter interface for routing traffic in the ClusterSet then you also need to change the Router option with (“setRoutingOption”) so the traffic can be diverted to the new Primary Cluster (“cluster2”) otherwise the application will fail and not able to communicate.

In the next scenario, we will see the usage of “setRoutingOption” in more detail.

How to change traffic routes with MySQLRouter

In the case of Cluster:

Within a single cluster or “cluster1,” the writes will fall to the Primary and reads will fall to the secondary in a balanced order. The router will automatically detect the failovers and choose the Primary.

In the case of ClusterSet:

We can switch the Primary stack for routing traffic from “cluster1” to “cluster2” with the help of the below steps.

  • Verifying the current configurations.
MySQL localhost:3309 ssl JS > myclusterset=dba.getClusterSet() 
MySQL  localhost:3309 ssl  JS > myclusterset.listRouters()
 
{
    "domainName": "firstclusterset", 
    "routers": {
        "localhost.localdomain::Router1": {
            "hostname": "localhost.localdomain", 
            "lastCheckIn": "2023-08-12 10:13:22", 
            "roPort": "6447", 
            "roXPort": "6449", 
            "rwPort": "6446", 
            "rwXPort": "6448", 
            "targetCluster": null, 
            "version": "8.0.31"
        }
    }
}

MySQL  localhost:3309 ssl  JS > myclusterset.routingOptions()

Output:

    
    {
    "domainName": "firstclusterset", 
    "global": {
        "invalidated_cluster_policy": "drop_all", 
        "stats_updates_frequency": 0, 
        "target_cluster": "primary"
    }, 
    "routers": {
        "localhost.localdomain::Router1": {}
    }
}

  • Now, switching the target cluster from “cluster1” to “cluster2”.
MySQL  localhost:3309 ssl  JS > myclusterset.setRoutingOption('localhost.localdomain::Router1', 'target_cluster', 'cluster2')
Routing option 'target_cluster' successfully updated in router 'localhost.localdomain::Router1'.

  • Checking the status again, we can see the target cluster is now changed to “cluster2”.
MySQL  localhost:3309 ssl  JS > myclusterset.routingOptions()
 
{
    "domainName": "firstclusterset", 
    "global": {
        "invalidated_cluster_policy": "drop_all", 
        "stats_updates_frequency": 0, 
        "target_cluster": "primary"
    }, 
    "routers": {
        "localhost.localdomain::Router1": {
            "target_cluster": "cluster2"
        }
    }
}

How to rejoin the lost cluster again in the ClusterSet

There are situations when the cluster has been marked as invalidated, or there might be some issue with the replication channel. To fix that, we might need to perform the “rejoinCluster()” process in order to add the same to the ClusterSet.

We can simulate the same by using the below steps.

  • Stopping Async replication channel on the Primary instance (“127.0.0.1:3311”) of cluster2, which is syncing from the instance (“127.0.0.1:3309”) of cluster1.
MySQL  localhost:3311 ssl  SQL > stop slave;

...
               Slave_IO_State: 
                  Master_Host: 127.0.0.1
                  Master_User: mysql_innodb_cs_63c324c0
                  Master_Port: 3309
                Connect_Retry: 3
              Master_Log_File: localhost-bin.000005
          Read_Master_Log_Pos: 2248
               Relay_Log_File: localhost-relay-bin-clusterset_replication.000002
                Relay_Log_Pos: 432
        Relay_Master_Log_File: localhost-bin.000005
             Slave_IO_Running: No
            Slave_SQL_Running: No
...

  • If we check the status again, we see the replication is stopped now on the cluster2 instance.
MySQL  localhost:3311 ssl  JS > myclusterset.status({extended:1})

"clusterRole": "REPLICA", 
            "clusterSetReplication": {
                "applierStatus": "OFF", 
                "applierThreadState": "", 
                "applierWorkerThreads": 4, 
                "receiver": "127.0.0.1:3311", 
                "receiverStatus": "OFF", 
                "receiverThreadState": "", 
                "source": "127.0.0.1:3309"
            }, 
            "clusterSetReplicationStatus": "STOPPED", 
            "globalStatus": "OK_NOT_REPLICATING", 
            "status": "OK", 
            "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",

  • Now, we will fix the same by running the below rejoin command.
MySQL  localhost:3311 ssl  JS > myclusterset.rejoinCluster('cluster2')

...

Rejoining cluster 'cluster2' to the clusterset
NOTE: Cluster 'cluster2' is not invalidated
* Refreshing replication settings
** Changing replication source of 127.0.0.1:3312 to 127.0.0.1:3309
** Changing replication source of 127.0.0.1:3313 to 127.0.0.1:3309
** Changing replication source of 127.0.0.1:3311 to 127.0.0.1:3309

Cluster 'cluster2' was rejoined to the clusterset

...

So, the stopped replication started automatically and synced with the main cluster.

cluster2": 
            "clusterRole": "REPLICA", 
            "clusterSetReplication": {
                "applierStatus": "APPLIED_ALL", 
                "applierThreadState": "Waiting for an event from Coordinator", 
                "applierWorkerThreads": 4, 
                "receiver": "127.0.0.1:3311", 
                "receiverStatus": "ON", 
                "receiverThreadState": "Waiting for source to send event", 
                "source": "127.0.0.1:3309"
            }, 
            "clusterSetReplicationStatus": "OK", 
            "globalStatus": "OK", 
            "status": "OK", 
            "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",

Summary

Here, we have discussed a few scenarios and methodologies that could be very useful in order to recover the cluster nodes and perform some manual failovers in the topology. The above-discussed options would be used in both InnoDB Cluster and Innodb ClusterSet-related environments.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

 

Try Percona Distribution for MySQL today!

Aug
24
2023
--

Automated Percona Monitoring and Management Upgrades

Automated Percona Monitoring and Management Upgrades

Welcome to our guide on keeping your Percona Monitoring and Management (PMM) Server and Client up-to-date. In this blog post, we’ll walk you through a method to ensure your PMM solution runs the latest and most secure versions. It’s important to note that this is just one option among many, and we’re eager to hear your feedback and alternative approaches.

Maintaining an up-to-date PMM environment is crucial for accessing the latest features and ensuring maximum security. However, the update process can be complex and time-consuming. To simplify it, we’ll introduce you to a method involving Watchtower and cron jobs. Watchtower automates PMM Server updates, while cron jobs handle PMM Client updates.

We value your input and want to know if this approach works for you. Share your thoughts, suggestions, and alternative methods on our forum. Your feedback will help us improve the update process and provide better solutions for the PMM community.

Let’s explore the steps to install, configure, and automate the update process for your PMM Server and Client. Together, we can keep your monitoring solution up-to-date effortlessly and enhance its effectiveness.

Please remember this is not official documentation but a cookbook about easier PMM management. This method may not work in later versions.

PMM Server installation

Installing PMM is a straightforward process. You can refer to the instructions provided for detailed guidance.

To install PMM, you can use the following command:

curl -fsSL https://www.percona.com/get/pmm | /bin/bash

This command will install Docker and the PMM server for you. However, when updating PMM, there are a few considerations to remember.

The recommended approach for updating PMM is to replace the Docker image. This method offers greater stability and ensures a clean installation of the latest version. To update, you must stop the current PMM server, obtain the new Docker image, and start a new instance. PMM will handle any necessary migrations during the startup process. While this method provides a more stable update, it does require additional steps, such as accessing the terminal and executing specific commands. This can be cumbersome, especially if you rely on IT support for such tasks.

Alternatively, you can use the Update button within the PMM server itself, which offers a more convenient solution for updating. This method simplifies the process by automating the update within the PMM user interface. You can find more details on this approach in our documentation.

Considering the trade-off between stability and convenience, the Docker image replacement method is recommended for ensuring a reliable update. However, choosing the approach that best suits your requirements and operational considerations is important.

Now, let’s explore automation options for simplifying the update process of our PMM Server.

PMM Server update automation

To automate the update process for your PMM Server, we’ll utilize a tool called Watchtower. Watchtower is designed to monitor your Docker container and update it automatically whenever a new image becomes available.

To get started with Watchtower, you can follow these steps:

1. Run the following command to start Watchtower:

docker run -d --name pmm-watchtower -v /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower pmm-server

2. That’s it! Watchtower is now actively monitoring your PMM Server container. It will regularly check for new versions of the pmm-server:2 image (used in the quick install script by default) every 24 hours. Watchtower automatically updates your PMM instance when a new image is detected, ensuring you have the latest version.

Please note that you should wait for the next official PMM release to be published before Watchtower can perform an update.

Watchtower simplifies the update process by eliminating the need for manual intervention and monitoring. It provides a convenient way to keep your PMM Server up-to-date with the latest releases, ensuring you have access to new features and security enhancements.

Next, let’s explore the process of updating the PMM Client and establishing a connection with the PMM Server.

PMM Client Installation and Сonfiguration

In addition to the PMM Server, the PMM Client is another critical component of the distributed PMM system. The PMM Client is required unless you are solely adding remote instances via the PMM Server, in which case the Client embedded in the PMM server itself is used and is always in sync with the PMM Server version.

To install the PMM Client, you can follow the instructions provided. This method uses the package manager on your system, offering a straightforward installation process that supports seamless upgrades.

Once the PMM Client is installed, it needs to be connected to the PMM Server. Use the following command to configure the connection:

pmm-admin config --server-insecure-tls --server-url=https://admin:admin@X.X.X.X:443

Replace X.X.X.X with the IP address or hostname of your PMM Server. This command establishes the necessary settings for communication between the Client and Server.

After PMM Client is configured, you probably will add some Database into PMM.

PMM Client update automation

Automating the PMM Client upgrade process can be a bit complex due to version compatibility requirements with the PMM Server. A typical PMM Client upgrade process involves executing the following commands when a new version is released:

sudo apt update
sudo apt install -y pmm2-client

However, it’s crucial to note that the PMM Server should be the same version as, or newer than, the PMM Client. Therefore, you must upgrade the PMM Server first before updating the clients. This condition adds complexity to automating upgrades because it requires checking the server version to ensure compatibility.

To address this challenge, we can leverage the pmm-admin status command, which provides information about the Server and Client versions. By running pmm-admin status, you can obtain details such as the Server URL and version, as well as the Client’s connection status, version, and other relevant data.

To simplify automation, the pmm-admin command can expose data in JSON format using the --JSON flag. This allows for easier integration with automation scripts.

To assist you further, I have created a simple script that performs the PMM Client update process. You can find the script at this URL: https://gist.github.com/rnovikovP/e6d7d1dcda7f5004ad186f6bc4de565a . I recommend placing this script in the /usr/local/percona/pmm2/cron/ directory, which serves as a centralized location for all PMM-related files.

Once you have downloaded the script, the next step is configuring it to run automatically using cron. Here’s a simple one-liner that creates the necessary directory, downloads the script, sets the appropriate permissions, and adds a cron job to execute it daily at 8 AM:

(mkdir -p /usr/local/percona/pmm2/cron/ && curl -sSL https://gist.githubusercontent.com/rnovikovP/e6d7d1dcda7f5004ad186f6bc4de565a/raw/ -o /usr/local/percona/pmm2/cron/pmm2-client-updater.sh && chmod +x /usr/local/percona/pmm2/cron/pmm2-client-updater.sh && (crontab -l 2>/dev/null; echo "0 8 * * * /usr/local/percona/pmm2/cron/pmm2-client-updater.sh") | crontab -)

This one-liner creates the necessary directory structure, downloads the script using curl, sets the script’s executable permission, and adds a cron job to execute the script at 8 AM daily.

By implementing this automation, the script will handle the PMM Client updates, checking for the latest version and upgrading the Client as needed. This saves you from manually executing the upgrade commands and ensures that your PMM Client is always up to date.

Next, we’ll review what we need to confirm for the automated updates process.

Update configuration confirmation

To ensure that your PMM update automation is functioning correctly, there are a few steps you can take. Since the configuration relies on officially released images and binaries, you will need to wait until the new version of PMM is officially released. Rest assured, the release is just around the corner.

Here’s what you can check in the meantime:

On the PMM Server, verify the running containers by executing the following command:
docker ps

You should see output similar to this:

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e139815b8317 containrrr/watchtower "/watchtower pmm-ser…" 4 seconds ago Up 3 seconds 8080/tcp watchtower
8f9573bb063f percona/pmm-server:2 "/opt/entrypoint.sh" 21 hours ago Up 21 hours 80/tcp, 0.0.0.0:443-&gt;443/tcp, :::443-&gt;443/tcp pmm-server

This confirms that the PMM Server container is running successfully.

On the PMM Client, verify the cron job configuration by executing the following command:

crontab -l

You should see the following line in the output:

0 8 * * * /usr/local/percona/pmm2/cron/pmm2-client-updater.sh

This confirms that the cron job for the PMM Client updater script is set to run daily at 8 AM.

Additionally, you can manually execute the updater script to confirm that it is functioning correctly:
/usr/local/percona/pmm2/cron/pmm2-client-updater.sh

Running this command should display the PMM Server and Client versions, which should match the following:

PMM Server version: [2.39.0]
PMM Client version: [2.39.0]

Ensure that both versions are the same, indicating a successful update.

If all the checks above align with the presented information, it means your PMM update automation is on track. Now, all you need to do is wait for the new PMM release and observe your PMM system’s updated state.

Closing

In conclusion, this blog post has provided a comprehensive guide on keeping your PMM Server and Client updated. While the methods discussed here are effective, we acknowledge that alternative approaches or better options may be available. We value your input and encourage you to share any additional techniques or ideas that could improve the update process for PMM.

One topic worth exploring is automating the update process for PMM and its components. Are you interested in auto-updates for PMM? We would love to hear your thoughts on this matter. Share your experiences, suggestions, or concerns on the forum.

Remember, staying up-to-date with the latest versions of PMM is crucial for maintaining maximum security and accessing the latest features. Your feedback and contributions are invaluable as we strive to provide the best solutions for the PMM community.

Thank you for reading, and we appreciate your engagement in improving the PMM update process. Together, we can make monitoring with PMM even more efficient and user-friendly.

If you want to share your experience using PMM  – please don’t hesitate to schedule a call with me using this link; I’m always open to hearing users’ pains to drive the product in the right direction.

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

 

Download Percona Monitoring and Management Today

Aug
23
2023
--

The Deprecation of qpress/QuickLZ Compression Algorithm

Deprecation of qpress/QuickLZ

To reduce the backup size, save storage space, and speed up the backup and restore process, you can compress a backup with Percona XtraBackup. The XtraBackup --compress option makes XtraBackup compress all output data, including the transaction log file and metadata files, with one of the supported compression algorithms. To decompress all files in a backup made with the --compress option, use the --decompress option.

Version changes

  • With Percona XtraBackup 8.0.34-29, qpress/QuickLZ is no longer supported for compress operations. The Zstandard (ZSTD) compression algorithm is moved to General Availability. With this version, ZSTD becomes the default compression algorithm for the --compress option. The alternative compression algorithm is LZ4.

    To compress files using the ZSTD compression algorithm, use the --compress option:

    xtrabackup --backup --compress --target-dir=/data/backup

    To compress files using the LZ4 compression algorithm, set the --compress option to LZ4:

    xtrabackup --backup --compress=lz4 --target-dir=/data/backup

    To decompress all files in a backup, use the --decompress option:

    xtrabackup --decompress --target-dir=/data/compressed/

    To decompress backups taken by older versions of Percona XtraBackup that used a QuickLZ compression algorithm, the --decompress option still supports qpress for backward compatibility.

  • Up to Percona XtraBackup 8.0.33-28, the --compress option uses a QuickLZ compression algorithm by default. When using --compress, the resulting files have the qpress (*.qp) archive format. 

    To compress files using the QuickLZ compression algorithm, use the --compress option:

    xtrabackup --backup --compress --target-dir=/data/backup

    Every *.qp file produced by XtraBackup is a one-file qpress archive. You can extract the contents of these files with the --decompress option that supports the qpress file archiver.

  • Starting with Percona XtraBackup 8.0.31-24, the use of qpress/QuickLZ to compress backups is deprecated. Percona recommends using either LZ4 or ZSTD compression algorithms. 
  • Percona XtraBackup 8.0.30-23 adds ZSTD compression algorithm in tech preview. ZSTD is a fast lossless compression algorithm that targets real-time compression scenarios and better compression ratios. 

    To compress files using the ZSTD compression algorithm, set the --compress option to zstd.

    xtrabackup --backup --compress=zstd --target-dir=/data/backup

    The --compress=zstd option produces *.zst files. You can extract the contents of these files with the --decompress option.

    Also, you can specify the ZSTD compression level with the --compress-zstd-level(=#) option as follows:

    xtrabackup --backup --compress --compress-zstd-level=1 --target-dir=/data/backup

 

Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server for MySQL and MySQL. It performs online non-blocking, tightly compressed, highly secure backups on transactional systems so that applications remain fully available during planned maintenance windows.

 

Download Percona XtraBackup

Aug
22
2023
--

High Availability vs. Fault Tolerance: Is FT’s 00.001% Edge in Uptime Worth the Headache?

high availability

Estimates vary, but most reports put the average cost of unplanned database downtime at approximately $300,000 to $500,000 per hour, or $5,000 to $8,000 per minute. With so much at stake, database high availability and fault tolerance have become must-have items, but many companies just aren’t certain which one they must have.

This blog article will examine shared attributes of high availability (HA) and fault tolerance (FT). We’ll also look at the differences, as it’s important to know what architecture(s) will help you best meet your unique requirements for maximizing data assets and achieving continuous uptime.

We’ll wrap it up by suggesting high availability open source solutions, and we’ll introduce you to support options for ensuring continuous high performance from your systems.

What does high availability mean?

High availability refers to the continuous operation of a database system with little to no interruption to end users in the event of system failures, power outages, or other disruptions. A basic high availability database system provides failover (preferably automatic) from a primary database node to redundant nodes within a cluster.

High availability does not guarantee 100% uptime, but an HA system enables you to minimize downtime to the point you’re almost there. Within IT, the gold standard for high availability is 99.999%, or “five-nines” of availability, but the level of HA needed really depends on how much system downtime you can bear. Streaming services, for example, run mission-critical systems in which excessive downtime could result in significant financial and reputational losses for the business. But many organizations can tolerate a few minutes of downtime without negatively affecting their end users.

The following table shows the amount of downtime for each level of availability.

high availability

Implementing high availability systems: How does it work?

High availability works through a combination of key elements. Some of the most important elements include:

  • No single point of failure (SPOF): You must eliminate any SPOF in the database environment, including any potential for an SPOF in physical or virtual hardware.
  • Redundancy: Critical components of a database are duplicated so that if one component fails, the functionality continues by using a redundant component. For example, in a server cluster, multiple servers are used to host the same application so that if one server fails, the application can continue to run on the other servers.
  • Load balancing: Traffic is distributed across multiple servers to prevent any one component from becoming overloaded. Load balancers can detect when a component is not responding and put traffic redirection in motion.
  • Failure detection: Monitoring mechanisms detect failures or issues that could lead to failures. Alerts report failures or issues so that they are addressed immediately.
  • Failover: This involves automatically switching to a redundant component when the primary component fails. If a primary server fails, a backup server can take over.

High availability architecture

Two previously mentioned absolutes — no SPOF and foolproof failover — must apply across the following areas if an HA architecture is to be achieved:

  • Infrastructure — This is the hardware that database systems rely on. Without enough infrastructure (physical or virtualized servers, networking, etc.), there cannot be high availability.
  • Topology management — This is the software management related specifically to the database and managing its ability to stay consistent in the event of a failure.
  • Connection management — This is the software management related specifically to the networking and connectivity aspect of the database. Clustering solutions typically bundle with a connection manager. However, in asynchronous clusters, deploying a connection manager is mandatory for high availability.
  • Backup and continuous archiving — This is of extreme importance if any replication delay happens and the replica node isn’t able to work at the primary’s pace. The backed-up and archived files can also be used for point-in-time recovery.

What is fault tolerance?

Fault tolerance refers to the ability of a database system to continue functioning in full, with no downtime, amid hardware or software failures. When a failure event occurs — such as a server failure, power outage, or network disruption — a fault-tolerant system will ensure that data integrity is preserved and that the system remains operational.

Implementing fault-tolerant systems: How does it work?

Here are some of the key components and characteristics of a fault-tolerant database environment:

  • Replication: Data is replicated across multiple nodes or servers, so if one node fails, the data remains accessible from replicas. Replication can be synchronous (ensuring immediate consistency) or asynchronous (allowing some delay).
  • Redundancy: The system stores multiple copies of data on separate devices. Redundancy provides backups and safeguards against data loss in case of hardware failures.
  • Error detection and correction: Techniques such as checksums, parity bits, and error-correcting codes are used to detect and correct errors that might occur during data transmission or storage.
  • Data integrity checks: Various mechanisms, such as checksums or hashing algorithms, are used to verify the integrity of stored data. This ensures that data remains consistent and accurate even in the presence of errors.
  • Specialized hardware: Specialized hardware is used to detect a hardware fault and initiate an immediate switch to a redundant hardware component.

Architecture for fault-tolerant systems

Fault-tolerant information systems are designed to offer 100% availability. Some of the key elements of such designs include:

  • Backup hardware systems — The hardware system is backed up by systems that are the same or perform the same functions but are at separate data centers and are not dependent on any of the same physical sources of power and functionality.
  • Backup software systems — As with hardware, there is no loss of software functionality because there are mirrored copies performing the same functions but residing elsewhere and using a different source of power.
  • Containerization platforms — Today, companies are increasingly using containerization platforms such as Kubernetes to run multiple instances of the same software. By doing so, if an error or other problem forces the software to go offline, traffic can be routed to other instances, and application functionality continues. 
  • Alternate power sources — To withstand power outages, businesses have alternate sources on standby. Some businesses, for example, have powerful generators ready to roll if electricity is lost.
  • Alternate environments — Fault-tolerant information systems can include separate cloud and on-premises databases that perform the same functions. For companies without their own physical servers, replicated cloud databases are maintained in different regions to provide contingencies for power outages.

High availability vs. Fault tolerance: Understanding the differences

There’s obviously a lot in common in terms of setup, functionality, and purpose. So with all the similarities (replication, load balancing, redundant components, and more), what are the differences? 

In general terms, the purpose of a high availability solution is to minimize downtime and provide continuous access to the database, while the purpose of fault tolerance is to maintain system functionality and data integrity at all times, including during failure events or faults. 

Still, generally speaking, but in financial terms, achieving fault tolerance is more costly, and the payoff often does not justify the expense. For example, it takes a lot of time, money, and expertise to completely mirror a system so that if one fails, the other takes over without any downtime. It can be considerably cheaper to establish a high availability database system in which there’s not total redundancy, but there is load balancing that results in minimal downtime (mere minutes a year).   

Basically, it comes down to a company’s needs, what’s at stake, and what fits its budget. Here are key questions to consider when deciding between high availability and fault tolerance:

How much downtime can your company endure? 

With high availability, you can achieve the gold standard previously mentioned — and your database system will be available 99.999% of the time. 

With a fault-tolerant system, you can spend a lot more and perhaps do better than the 5.26 minutes of downtime (in an entire year) that come with the “five nines” described immediately above. But is it mission-critical to do better than 99.999% availability?

How much complexity and how many redundant components are you willing to take on? 

High availability systems typically have redundant servers or clusters to ensure that if one component fails, another can take over seamlessly. 

Fault tolerance incorporates multiple versions of hardware and software, and it also includes power supply backups. The hardware and software can detect failures and instantly switch to redundant components, but they also constitute more complexity, parts, and cost.

Which option fits your budget?

A high availability database system requires redundancy, load balancing, and failover. It also must ensure that there is no single point of failure. But depending on your needs, there are varying levels of high availability, and an architecture can be fairly simple. 

Fault-tolerant systems are designed with more complex architectures, requiring sophisticated hardware and software components, along with specialized expertise to design, configure, and maintain. The additional complexity adds to the cost of the system.

Percona HA expertise and architectures

At Percona, we advise that, in most cases, the attributes and cost-effectiveness of high availability are the way to go. It’s plenty available and plenty scalable. We’re also big advocates of using open source software that’s free of vendor lock-in and is backed by the innovation and expertise of the global open source community to achieve high availability. 

At the same time, we’re aware that achieving high availability using open source software takes extra effort. HA doesn’t come with the complexity and price tag of a fault-tolerant system, but it still requires considerable time and expertise. 

So instead of you having to select, configure, and test architecture for building an HA database environment, why not use ours? You can use Percona architectures on your own, call on us as needed, or have us do it all for you.

Check out these helpful whitepapers, which include ready-to-use architectures:

Percona Distribution for MySQL: High Availability With Group Replication

Percona Distribution for PostgreSQL: High Availability With Streaming Replication

High availability support you can count on

Percona designs, configures, and deploys high availability databases so that applications always have access to essential data. We’ll help you ensure that applications stay up, databases stay bug-free, and your entire system runs at optimal performance levels.

 

Percona High Availability Database Support

FAQs

What is the difference between DR vs. HA and FT?

Disaster recovery focuses on recovering from major disruptive events and restoring operations. Fault tolerance and high availability focus on building systems that can withstand failures and continue operating with minimal interruption (HA) and without interruption (FT).

What are the goals of high availability and fault tolerance?

The goal of employing fault tolerance is to maintain system functionality and data integrity at all times, including during failures or faults. The goal of employing a high availability solution is to minimize downtime and provide continuous access to the database. 

What is the difference between HA and DR?

High availability is used to prevent downtime due to individual component failures within a live environment, while disaster recovery focuses on recovering from major events that make the database inaccessible or non-functional.

Aug
21
2023
--

Talking Drupal #412 – Lando Episode II

Today we are talking about Lando, the release cycle, and Lando 4.0 with our guest Aaron Feledy.

For show notes visit: www.talkingDrupal.com/412

Topics

  • About Aaron
  • What is Lando
  • Best type of development for Lando
  • Listener Q: Stephen – I Recently switched to ddev for Mutagen, will Lando support that in the future
  • Release schedule
  • Development process
  • Favorite features
  • When is 4.0 and what is in it
  • Best reasons to choose Lando
  • Adoption rate of Lando
  • Listener Q: Stephen – Nic switched from Lando to DDEV, why and what features would he like to see in Lando to go back
  • Listener Q: Matthieu – I use Lando and I am trying to convince coworkers to use it as well, any advice

Resources

Guests

Aaron Feledy lando.devArrow

Hosts

Nic Laflin – nLighteneddevelopment.com nicxvan John Picozzi – epam.com johnpicozzi Andy Blum – andy-blum.comandy_blum

MOTW Correspondent

Martin Anderson-Clutz – @mandclu Workflow Participants

  • Brief description:
    • Have you ever wanted to assign moderated content to specific users to edit or approve it before publishing? There’s a module for that!
  • Brief history
    • How old: created in Feb 2017
    • Versions available: 8.x-2.6 (D8 & D9), 3.0.0-alpha1 (D9 & D10)
  • Maintainership
    • Seeking a new maintainer
  • Number of open issues:
    • 17 open, 3 of which are bugs
  • Does have test coverage
  • Usage stats:
    • 9 sites
  • Maintainer(s):
    • Jheadstrom, who maintains his own stable of modules, including a number in the Message stack
  • Module features and usage
    • Creates a new tab on nodes, on which a content creator (or anyone with the necessary permission) can assign one or more editors or reviewers
    • Can specifically designate which roles can be editors or reviewers
    • Editors have the ability to make changes and transition content between states, while reviewers can only move between states
    • For each transition in your workflow, can specify if editors and/or approvers are able to perform the transition
    • Can optionally send a notification email to workflow participants as they are added to a piece of content
    • I used to have a boss who liked to say “The surest way to starve a dog is to put two people in charge of feeding it”, so the idea of specifically assigning individuals to individual pieces of content is potentially very powerful

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com