Aug
22
2023
--

High Availability vs. Fault Tolerance: Is FT’s 00.001% Edge in Uptime Worth the Headache?

high availability

Estimates vary, but most reports put the average cost of unplanned database downtime at approximately $300,000 to $500,000 per hour, or $5,000 to $8,000 per minute. With so much at stake, database high availability and fault tolerance have become must-have items, but many companies just aren’t certain which one they must have.

This blog article will examine shared attributes of high availability (HA) and fault tolerance (FT). We’ll also look at the differences, as it’s important to know what architecture(s) will help you best meet your unique requirements for maximizing data assets and achieving continuous uptime.

We’ll wrap it up by suggesting high availability open source solutions, and we’ll introduce you to support options for ensuring continuous high performance from your systems.

What does high availability mean?

High availability refers to the continuous operation of a database system with little to no interruption to end users in the event of system failures, power outages, or other disruptions. A basic high availability database system provides failover (preferably automatic) from a primary database node to redundant nodes within a cluster.

High availability does not guarantee 100% uptime, but an HA system enables you to minimize downtime to the point you’re almost there. Within IT, the gold standard for high availability is 99.999%, or “five-nines” of availability, but the level of HA needed really depends on how much system downtime you can bear. Streaming services, for example, run mission-critical systems in which excessive downtime could result in significant financial and reputational losses for the business. But many organizations can tolerate a few minutes of downtime without negatively affecting their end users.

The following table shows the amount of downtime for each level of availability.

high availability

Implementing high availability systems: How does it work?

High availability works through a combination of key elements. Some of the most important elements include:

  • No single point of failure (SPOF): You must eliminate any SPOF in the database environment, including any potential for an SPOF in physical or virtual hardware.
  • Redundancy: Critical components of a database are duplicated so that if one component fails, the functionality continues by using a redundant component. For example, in a server cluster, multiple servers are used to host the same application so that if one server fails, the application can continue to run on the other servers.
  • Load balancing: Traffic is distributed across multiple servers to prevent any one component from becoming overloaded. Load balancers can detect when a component is not responding and put traffic redirection in motion.
  • Failure detection: Monitoring mechanisms detect failures or issues that could lead to failures. Alerts report failures or issues so that they are addressed immediately.
  • Failover: This involves automatically switching to a redundant component when the primary component fails. If a primary server fails, a backup server can take over.

High availability architecture

Two previously mentioned absolutes — no SPOF and foolproof failover — must apply across the following areas if an HA architecture is to be achieved:

  • Infrastructure — This is the hardware that database systems rely on. Without enough infrastructure (physical or virtualized servers, networking, etc.), there cannot be high availability.
  • Topology management — This is the software management related specifically to the database and managing its ability to stay consistent in the event of a failure.
  • Connection management — This is the software management related specifically to the networking and connectivity aspect of the database. Clustering solutions typically bundle with a connection manager. However, in asynchronous clusters, deploying a connection manager is mandatory for high availability.
  • Backup and continuous archiving — This is of extreme importance if any replication delay happens and the replica node isn’t able to work at the primary’s pace. The backed-up and archived files can also be used for point-in-time recovery.

What is fault tolerance?

Fault tolerance refers to the ability of a database system to continue functioning in full, with no downtime, amid hardware or software failures. When a failure event occurs — such as a server failure, power outage, or network disruption — a fault-tolerant system will ensure that data integrity is preserved and that the system remains operational.

Implementing fault-tolerant systems: How does it work?

Here are some of the key components and characteristics of a fault-tolerant database environment:

  • Replication: Data is replicated across multiple nodes or servers, so if one node fails, the data remains accessible from replicas. Replication can be synchronous (ensuring immediate consistency) or asynchronous (allowing some delay).
  • Redundancy: The system stores multiple copies of data on separate devices. Redundancy provides backups and safeguards against data loss in case of hardware failures.
  • Error detection and correction: Techniques such as checksums, parity bits, and error-correcting codes are used to detect and correct errors that might occur during data transmission or storage.
  • Data integrity checks: Various mechanisms, such as checksums or hashing algorithms, are used to verify the integrity of stored data. This ensures that data remains consistent and accurate even in the presence of errors.
  • Specialized hardware: Specialized hardware is used to detect a hardware fault and initiate an immediate switch to a redundant hardware component.

Architecture for fault-tolerant systems

Fault-tolerant information systems are designed to offer 100% availability. Some of the key elements of such designs include:

  • Backup hardware systems — The hardware system is backed up by systems that are the same or perform the same functions but are at separate data centers and are not dependent on any of the same physical sources of power and functionality.
  • Backup software systems — As with hardware, there is no loss of software functionality because there are mirrored copies performing the same functions but residing elsewhere and using a different source of power.
  • Containerization platforms — Today, companies are increasingly using containerization platforms such as Kubernetes to run multiple instances of the same software. By doing so, if an error or other problem forces the software to go offline, traffic can be routed to other instances, and application functionality continues. 
  • Alternate power sources — To withstand power outages, businesses have alternate sources on standby. Some businesses, for example, have powerful generators ready to roll if electricity is lost.
  • Alternate environments — Fault-tolerant information systems can include separate cloud and on-premises databases that perform the same functions. For companies without their own physical servers, replicated cloud databases are maintained in different regions to provide contingencies for power outages.

High availability vs. Fault tolerance: Understanding the differences

There’s obviously a lot in common in terms of setup, functionality, and purpose. So with all the similarities (replication, load balancing, redundant components, and more), what are the differences? 

In general terms, the purpose of a high availability solution is to minimize downtime and provide continuous access to the database, while the purpose of fault tolerance is to maintain system functionality and data integrity at all times, including during failure events or faults. 

Still, generally speaking, but in financial terms, achieving fault tolerance is more costly, and the payoff often does not justify the expense. For example, it takes a lot of time, money, and expertise to completely mirror a system so that if one fails, the other takes over without any downtime. It can be considerably cheaper to establish a high availability database system in which there’s not total redundancy, but there is load balancing that results in minimal downtime (mere minutes a year).   

Basically, it comes down to a company’s needs, what’s at stake, and what fits its budget. Here are key questions to consider when deciding between high availability and fault tolerance:

How much downtime can your company endure? 

With high availability, you can achieve the gold standard previously mentioned — and your database system will be available 99.999% of the time. 

With a fault-tolerant system, you can spend a lot more and perhaps do better than the 5.26 minutes of downtime (in an entire year) that come with the “five nines” described immediately above. But is it mission-critical to do better than 99.999% availability?

How much complexity and how many redundant components are you willing to take on? 

High availability systems typically have redundant servers or clusters to ensure that if one component fails, another can take over seamlessly. 

Fault tolerance incorporates multiple versions of hardware and software, and it also includes power supply backups. The hardware and software can detect failures and instantly switch to redundant components, but they also constitute more complexity, parts, and cost.

Which option fits your budget?

A high availability database system requires redundancy, load balancing, and failover. It also must ensure that there is no single point of failure. But depending on your needs, there are varying levels of high availability, and an architecture can be fairly simple. 

Fault-tolerant systems are designed with more complex architectures, requiring sophisticated hardware and software components, along with specialized expertise to design, configure, and maintain. The additional complexity adds to the cost of the system.

Percona HA expertise and architectures

At Percona, we advise that, in most cases, the attributes and cost-effectiveness of high availability are the way to go. It’s plenty available and plenty scalable. We’re also big advocates of using open source software that’s free of vendor lock-in and is backed by the innovation and expertise of the global open source community to achieve high availability. 

At the same time, we’re aware that achieving high availability using open source software takes extra effort. HA doesn’t come with the complexity and price tag of a fault-tolerant system, but it still requires considerable time and expertise. 

So instead of you having to select, configure, and test architecture for building an HA database environment, why not use ours? You can use Percona architectures on your own, call on us as needed, or have us do it all for you.

Check out these helpful whitepapers, which include ready-to-use architectures:

Percona Distribution for MySQL: High Availability With Group Replication

Percona Distribution for PostgreSQL: High Availability With Streaming Replication

High availability support you can count on

Percona designs, configures, and deploys high availability databases so that applications always have access to essential data. We’ll help you ensure that applications stay up, databases stay bug-free, and your entire system runs at optimal performance levels.

 

Percona High Availability Database Support

FAQs

What is the difference between DR vs. HA and FT?

Disaster recovery focuses on recovering from major disruptive events and restoring operations. Fault tolerance and high availability focus on building systems that can withstand failures and continue operating with minimal interruption (HA) and without interruption (FT).

What are the goals of high availability and fault tolerance?

The goal of employing fault tolerance is to maintain system functionality and data integrity at all times, including during failures or faults. The goal of employing a high availability solution is to minimize downtime and provide continuous access to the database. 

What is the difference between HA and DR?

High availability is used to prevent downtime due to individual component failures within a live environment, while disaster recovery focuses on recovering from major events that make the database inaccessible or non-functional.

Jul
07
2023
--

Setting Up and Deploying PostgreSQL for High Availability

Deploying PostgreSQL for High Availability

With the average cost of unplanned downtime running from $300,000 to $500,000 per hour, businesses are increasingly using high availability (HA) technologies to maximize application uptime. Unfortunately, using certain open source database software as part of an HA architecture can present significant challenges. Despite all its upside, PostgreSQL software presents such challenges. 

PostgreSQL provides the foundation to start high availability, including features and configurations for physical and logical replication and consistent physical backups with point-in-time recovery (PITR) capabilities. But PostgreSQL does not offer a complete HA solution. Therefore, DBAs must take a piecemeal approach to achieve HA with PostgreSQL, bringing together open source extensions and tools from multiple sources. 

This mish-mash of technologies creates the potential for a single point of failure (SPOF) and an interruption of services to end users — the very thing high availability is supposed to prevent. Because most SPOFs are a consequence of scale, the risk of downtime only gets worse as the environment grows. Where a high availability design once worked well, it can no longer keep up with more complex requirements.  

This blog highlights considerations for keeping your own PostgreSQL databases highly available and healthy. We’ll examine the intricacies of establishing high availability with PostgreSQL software and provide links to Percona HA PostgreSQL reference architectures for businesses of every size (startup/small, medium, large, and enterprise).

Defining PostgreSQL high availability

High availability is essential for any business that relies on digital interactions — and today, that means just about everyone. The tolerable downtime depends on business size, operations, and objectives, but that downtime must be minimal to minuscule (see the “Measuring high availability” chart below). HA in PostgreSQL databases delivers virtually continuous availability, fault tolerance, and disaster recovery. It enables businesses and organizations to meet uptime and service level objectives, protect business-critical data, and maintain performant databases.

In general terms, to achieve HA in PostgreSQL, there must be:

  • Redundancy: To ensure data redundancy and provide continuous performance when a primary server fails, multiple copies of the database reside in replica servers. The same changes made in the primary database are made in the replicas. This ensures continuous business operations. PostgreSQL software supports synchronous streaming replication, asynchronous streaming replication, and logical replication.
  • No single point of failure (SPOF): If the failure of a database infrastructure component could cause downtime, that component is considered an SPOF. The existence of even one SPOF means high availability has not been achieved. The majority of SPOFs are a consequence of scale, where designs that were once adequate (excellent, even) no longer meet evolving, more complex requirements. Downtime due to SPOFs can also be attributed to bottlenecks from architectures designed for applications instead of databases.
  • Failover: This functionality is all about switching — in both unplanned and planned situations — from a failed primary server to a standby server. When the primary server fails, a standby server takes over as the new primary. Failover is essential to minimizing downtime.

Also, in general terms, a high availability PostgreSQL solution must cover four key areas:

  • Infrastructure: This is the physical or virtual hardware database systems rely on to run. Without enough infrastructure (physical or virtualized servers, networking, etc.), there cannot be high availability.
  • Topology management: This is the software management related specifically to the database and its ability to stay consistent in the event of a failure.
  • Connection management: This is the software management related specifically to the networking and connectivity aspect of the database. Clustering solutions typically bundle with a connection manager. However, in asynchronous clusters, deploying a connection manager is mandatory for high availability.
  • Backup and continuous archiving: PostgreSQL asynchronous replication relies on the write-ahead log (WAL) stream, and it’s vital to have centralized storage that all nodes can access. This is the location where the primary node will archive the WAL segments, and the replicas will restore any segment that isn’t present on the primary anymore. It is of extreme importance to keep the replication working if any replication delay happens and the replica node isn’t able to work at the primary’s pace. Those files can also be used for point-in-time recovery if any disaster occurs.

Key considerations for determining high availability needs

Before going into more detail, it’s important to determine whether your workloads and services require an HA architecture and, if so, to what extent your HA should be. Answers to these questions will help you evaluate the need for a high availability PostgreSQL solution based on your business requirements, potential risks, and operational limitations:

  • How much downtime can your business tolerate before taking a financial hit, losing customers, or both?
  • Does your business’s industry have high availability requirements? If so, what are the legal liabilities?
  • How long does it take to recover from a database failure in your current setup? Can you afford that recovery time?
  • Can you afford the necessary hardware, software, and operational costs of maintaining a PostgreSQL HA solution?
  • Do you have the expertise and resources to design, implement, and manage a high availability solution for PostgreSQL?

Measuring high availability

Depending on the answers to the previous questions, achieving high availability can mean different things for different businesses. It depends on how much downtime an organization can withstand. With streaming services, for example, excessive downtime could result in significant financial and reputational losses for the business. Other organizations can tolerate a few minutes of downtime without negatively affecting end users.

The following table shows the amount of downtime for each level of availability, from “two nines” to “five nines” (the gold standard of HA). You’ll see that high availability doesn’t deliver 100% uptime, but it’s close.

Measuring high availability

Popularity of PostgreSQL

Before delving further into the inner workings of PostgreSQL high availability, let’s briefly examine the burgeoning popularity of this open source relational database software.

PostgreSQL has rapidly gained favor among professional developers in recent years. StackOverflow statistics show that 26% of developers preferred it in 2017, 34% in 2019, and 40% in 2021. Most recently, in StackOverflow’s 2022 Stack Developer Survey, PostgreSQL took a slight lead over MySQL (46.48% to 45.68%) as the most popular database platform among professional developers.

PostgreSQL is favored among relational database options for its complex data analysis, data science, graphing, and AI-related capabilities. PostgreSQL is known for powerful and advanced features, including synchronous and asynchronous replication, full-text searches of the database, and native support for JSON-style storage, key-value storage, and XML. 

PostgreSQL is highly extensible, enabling users to add custom functionality through plug-ins and extensions. It also features tools such as repmgr and Patroni for automatic failover and cluster management.

Being a more advanced database management system, PostgreSQL is well-suited for performing complex queries in a large environment quickly. Because it readily supports failover and full redundancy, it’s often preferred by financial institutions and manufacturers. It’s also preferred for use with geographic information systems (GIS) and geospatial data. PostgreSQL ranks as the fourth most popular database management system (DB-Engines, March 2023).

And it’s open source

Because PostgreSQL software is open source, it’s free of proprietary restrictions that can come with vendor lock-in. Developers can customize the source code and try new applications without a big budget hit. Companies can more easily scale infrastructure — up or down — to meet economic conditions and changing business objectives. With open source software, a business is not trapped into using one provider’s software, support, or services. Instead, the business may design and redesign systems as customer expectations change and business objectives evolve.

Also, with open source, there’s a global community of dedicated volunteers driving the development of PostgreSQL database technology. Open source standards and community support enable developers and DBAs to focus on accelerating PostgreSQL feature creation and enhancing availability, performance, scalability, and security. 

How does PostgreSQL high availability work?

Now, let’s go into more detail about high availability with PostgreSQL. Several methods of the aforementioned replication form the backbone of PostgreSQL high availability. They allow for data redundancy, fault tolerance, failover amid disasters, power outages, human incursions — just about any scenario. Those methods include:

Streaming replication

With streaming replication, the entire database cluster is replicated from one server, known as the primary, to one or more standby servers. The primary server continuously streams the write-ahead logs (WAL) to the standby servers, which apply the changes to their own database copies. Streaming replication comes in two modes:

  • Synchronous streaming replication: In this mode, the primary server waits for confirmation from at least one standby server before committing a transaction. That confirmation ensures the data is replicated to a certain level of redundancy with minimal data loss.
  • Asynchronous streaming replication: This is the default way of replicating data in PostgreSQL setups. In this mode, the primary server does not wait for confirmation before committing to transactions. The upside is enhanced performance, but there can be data loss in the event of a failure.

Logical replication

This data replication method occurs at the logical level of the database objects, such as tables, rather than at the physical level of behind-the-scenes storage. Unlike streaming replication, which replicates the entire cluster, logical replication allows for more control over what data is replicated or not replicated. 

The two main components of logical replication are publishers and subscribers. A publisher shares data and a subscriber receives and applies the replicated data. This approach provides more flexibility for replication configuration.

Beyond — and connected to — the replication methods, these components also are essential in high availability: 

Failover

As stated, this is when a standby server becomes the new primary server due to various reasons, planned or sudden and unexpected. In PostgreSQL, failover can be triggered manually, or it can be automated — with monitoring systems initiating the action.

Failover includes these steps:

  1. The monitoring system sends an alert that the primary server is unavailable, triggering the failover process.
  2. Another server is chosen as the new primary server.
  3. Applications are redirected to the new primary server, and business-as-usual continues using the new primary server.

Automatic switchover

This method is used to minimize downtown and replace the need for manual intervention. The database system triggers automatic failover the instant that predetermined conditions occur.

With automatic switchover, cluster management tools or outside integrated tools monitor the primary server 24/7, checking that it’s connected and doing 100% of its job. The automatic switchover process kicks in if any failure of the primary server is detected, and a standby server becomes the new primary server.

Cluster management tools

Several cluster management tools, such as Patroni and Replication Manager, simplify the configuration, monitoring, and management of PostgreSQL high availability setups. As introduced above, those tools automate tasks like failover, monitoring, and maintenance, making it easier to maintain a highly available PostgreSQL environment.

Load balancing

High availability often involves distributing the database workload across multiple servers to ensure optimal performance and avoid bottlenecks. PostgreSQL supports various load balancing techniques — including connection pooling and connection routing — to distribute incoming queries and transactions among multiple database servers.

Connection pooling includes the availability of established database connections that can be made by multiple applications. Pgbouncer and pgpool-II are connection managers that accept client connections, manage the connection pool, and direct requests to the appropriate backend server.

Proxy load balancers like pgpool-II, pgbouncer, and HAProxy have the capacity to route connections to different database servers. Such a proxy acts as the traffic cop between the applications and the database servers. Steered by load balancing configurations, the proxy forwards requests from applications to the appropriate backend server. Pgpool-II has the extra capacity of a seamless split of reads and writes, while the other two have to do it using a separate port, and the read/write split has to be managed by the application.

DNS load balancing distributes application connections across multiple PostgreSQL database servers using the Domain Name System (DNS). With DNS load balancing, DNS records with the same hostname but different IP addresses are configured. When applications connect to the PostgreSQL database, the DNS server connects the app to available IP addresses in some kind of prioritized or revolving manner. This method is fairly easy to set up, but it’s not so intuitively or easily adjusted to meet changing server setup configurations. 

Monitoring and alerting

To maintain high availability in PostgreSQL databases, it is crucial to have monitoring and alerting mechanisms in place. There are multiple tools available for monitoring the health and performance of PostgreSQL instances. Those tools ensure that notifications are sent as issues and potential issues are detected. 

Bringing it all together in PostgreSQL architectures

When implemented successfully, the components we’ve examined can make for successful high availability in PostgreSQL databases. But, as stated, the PostgreSQL community version does not come with ready-to-go HA. It takes work, the right architecture, and often some outside help.  

Let’s start with the architectures. 

There’s an unfortunate misconception that high availability solutions are too expensive for startups and small businesses. But when using open source tools, coupled with an architecture such as the one offered by Percona, high availability on PostgreSQL can be achieved without a big price tag or the need for an overly complex environment. It can be done by building the HA infrastructure within a single data center. 

As your business grows, so should your high availability architecture. For medium and large businesses, the consequences of downtime, both in terms of lost revenue and erosion of customer sentiment, can be significant. High availability requires more fault-tolerant, redundant systems and probably larger investments in IT staff. Still, when using open source tools, high availability can be achieved cost-effectively and without the threat of vendor lock-in that can come from paid enterprise SQLs. For medium and large businesses, Percona provides an architecture that spreads availability across data centers to add more layers of availability to the cluster. 

For enterprises, the challenges and potential consequences increase exponentially. An architecture must address a lot more. The Percona architecture, for example, features two disaster recovery sites and adds more layers to the infrastructure in order to stay highly available and keep the applications up and running. This architecture, based on tightly coupled database clusters spread across data centers and availability zones, can offer an HA level up to 99.999% when using synchronous streaming replication, the same hardware configuration in all nodes, and fast internode connection.

You can get more details — and view actual architectures — at the Percona Highly Available PostgreSQL web page or by downloading our white paper, Percona Distribution for PostgreSQL: High Availability With Streaming Replication. But for starters, here are elements of a minimalist high availability architecture for PostgreSQL. Such architecture will include at least these four components in the design:

  • Database nodes: In PostgreSQL high availability architectures — such as those depicting streaming replication, logical replication, or shared-disk clustering — multiple data nodes distribute the workload and maintain data consistency. Typically, each data node contains a copy or subset of the data.
  • HA controller: This mechanism identifies when a failure of the primary node occurs and makes sure that there’s not a network partition/split brain. This same component needs a mechanism to perform the failover in case of failure on the primary node. It will elect one of the replicas to be promoted as the new primary node.
  • Connection router: This mechanism changes the query routing so that application requests reach the new primary node.
  • Backup component: This software or mechanism provides fast, consistent, and reliable backup and restores for PostgreSQL and also guarantees that all WAL logs are shipped to a universally accessible storage area.

Again, the descriptions immediately above are for a minimalist HA architecture. It can, and usually does, get a lot more complicated. Percona provides proven architectures to get you going, and we offer cost-effective options for help. 

Steps to deploying and maintaining PostgreSQL high availability

Once the method of replication is determined and the architecture is designed, it’s time to deploy it. As with the architecture itself, deployment can be easier and more cost-effective when enlisting high availability support for PostgreSQL from outside experts. It depends on what expertise you have on staff. 

Every database environment is different, so deployment procedures can vary, but here are some general steps:

  1. Configure the primary server. This server, the primary read-write node, will orchestrate data replication to standby servers.
  2. Create a primary server backup on the standby server or servers. 
  3. Configure the standby server to run in hot standby mode. The hot standby allows the standby server to be used for reads. It prevents complications and losses in the event of a failure or scheduled maintenance of the primary server.
  4. Configure load balancing. Using pgpool-II, HAProxy, or other PostgreSQL load balancing tools, your database will be ready for high traffic and to distribute read queries across multiple servers.
  5. Implement backup and disaster recovery. Since HA alone does not guarantee data protection, you should implement mechanisms for data durability and disaster recovery.
  6.  Regularly schedule backups and test the restore process to ensure data integrity.
  7. Test the setup. Here are a couple options. Ideally, both could be used. 
  • End-to-end (E2E) testing, though time-consuming, will show whether all components and applications of your HA setup work as intended.
  • Chaos engineering is another option. This is when engineers introduce problems throughout the database infrastructure, so they can identify failure points and create solutions — avoiding costly downtime and potential loss of frustrated customers.

Security considerations for high availability PostgreSQL

Keeping databases secure demands attention, of course. Yet again, it should be emphasized that every environment is unique, and specific security requirements will vary accordingly. It’s important to check out PostgreSQL documentation. In some cases, if there is no on-staff expertise to maintain adequate security levels of your high availability cluster, it’s wise to consider support.

But whether you can keep your environment secure on your own or need that outside help, there are some general best practices for securing a cluster in a high availability environment:

Configure SSL/TLS encryption of communication with the database cluster. You can disable unnecessary procedures and make it so connections occur only across trusted networks.

Control access. Implementing role-based access control (RBAC) for PostgreSQL can be a helpful way of managing permissions and restricting access to sensitive data.

Use strong authentication. That means strong user passwords that are regularly changed. In more sensitive situations, it’s advisable to implement more secure methods like certificate-based authentication or LDAP integration. PostgreSQL supports a variety of external authentication methods, including GSSAPI, LDAP, and RADIUS. 

Audit, log, and regularly review. You can configure PostgreSQL to log queries, failed logins,  administrative activity, and any other events. Then, you can examine the logs at any time to spot potential security breaches or suspicious activity.

Conduct regular updates. Apply the latest PostgreSQL security patches and bug fixes ASAP when available. Stay active, or at least observant, as a participant in the global open source community. Stay informed about the latest PostgreSQL security information, including news about vulnerabilities and recommended patches.

Percona high availability architectures and support

There’s no getting around it; achieving high availability with PostgreSQL takes some work — whether you do it or have someone else do it. But if you’d like to avoid or ditch the piecemeal approach to high availability described earlier, Percona HA PostgreSQL architectures pull together open source tools and components to enhance and harden interoperability, ensuring your applications always have access to the data they need. You can implement an architecture yourself, call on Percona as needed, or have us manage your PostgreSQL HA database completely. 

Regardless of the level of support you choose (and you can change your mind at any time), you’ll have a proven architecture, and you’ll avoid vendor lock-in, potentially exorbitant costs, and other pitfalls of going with proprietary arrangements. 

A good place to start is our Highly Available PostgreSQL web page. There, you’ll find ready-to-use PostgreSQL high availability architectures designed specifically and separately to address the needs of small businesses and startups, medium and large businesses, and enterprise and globally distributed businesses.

Also, check out our white paper — Percona Distribution for PostgreSQL: High Availability With Streaming Replication. It will show you how to pull together open source tools and components for achieving the highest levels of HA on PostgreSQL, even as you scale. The white paper also provides the ready-to-use architectures.

 

High Availability PostgreSQL from Percona

 

Download white paper: Percona Distribution for PostgreSQL: High Availability With Streaming Replication

FAQs

The following are commonly asked questions and short answers about high availability in PostgreSQL databases. More detailed answers are presented in the sections above.

What is the difference between high availability and disaster recovery?

High availability provides continuous access to a PostgreSQL database through replication, failover mechanisms, and the absence of single points of failure. Disaster recovery is about restoring data and services after a catastrophic event by using backups, replication, and recovery procedures. 

Can high availability architecture eliminate all downtime?

High availability doesn’t deliver 100% uptime, but it’s close. The gold standard of HA is “five nines,” which means a database is available 99.999% of the time.

What are common challenges when implementing high availability architecture?

The PostgreSQL community version doesn’t come with high availability features and functionality out of the box. That often requires a piecemeal approach to achieving HA. DBAs must bring together a few open source extensions and tools from multiple sources.

What technologies can be used for high availability architecture?

A multitude of technologies must be used together to achieve high availability in a PostgreSQL database. Essential technologies include replication, failover, load balancing, and cluster management.

How often should high availability architecture be tested and validated?

The frequency of testing and validation depends on several factors, including what’s at stake for the business and the complexity of the HA architecture. Certainly, a quarterly in-depth check is advisable, and for more critical situations, monthly might might be advisable..

Jun
22
2023
--

The Ultimate Guide to Database High Availability

Database High Availability

To make data count and to ensure cloud computing is unabated, companies and organizations must have highly available databases. That doesn’t mean downtime is nonexistent, but it must be limited so that significant/irreparable damage (as defined by the business or industry) does not occur.

This guide provides an overview of what high availability means, the components involved, how to measure high availability, and how to achieve it.

Defining high availability

In general terms, high availability refers to the continuous operation of a system with little to no interruption to end users in the event of hardware or software failures, power outages, or other disruptions. A basic high availability database system provides failover (preferably automatic) from a primary database node to redundant nodes within a cluster.

HA is sometimes confused with “fault tolerance.” Although the two are related, the key difference is that an HA system provides quick recovery of all system components to minimize downtime. Some disruption might occur, but it will be minimal. Fault tolerance aims for zero downtime and data loss. As such, fault tolerance is more expensive to implement because it requires dedicated infrastructure that completely mirrors the primary system.

In general terms, again, when it comes to database reliability, the stakes are high. Database downtime can hurt or doom any company with anything to do with the internet. More specifically, a 2022 ITIC survey found that downtime costs more than $300,000 per hour for 91% of small-, mid-size, and large enterprises. Among just the mid-size and large respondents, 44% percent said a single hour of downtime could potentially cost them more than $1 million.

Introducing key parts of high availability

  • Data backup and recovery: A database backup is created and stored in another location to help keep the system operational after a natural disaster, a cyber attack, or other problem.
  • Data replication: Data is continually copied from one database to another to ensure that the system remains operational even if one database fails.
  • Clustering: Multiple nodes work together (and separately if necessary) to provide access to data, ensuring that the system remains operational even if one node fails. (More in the following sub-section.)
  • Load balancing: Requests are evenly distributed across multiple database servers, ensuring the system remains operational even if one server fails.
  • Automated failover: To keep the database operational and minimize downtime, it automatically switches to a backup server if the primary server fails.

More about high availability clusters

HA clusters — also known as failover clusters — are groups of interconnected servers that work together to ensure that an application or service remains available to users. Redundancy and failover protection ensure that if one server fails or goes offline, another server will automatically take over its workload. Typically, the servers are configured in a primary/replica configuration, with one server designated as the primary server that handles all incoming requests and the others designated as replica servers that monitor the primary and take over its workload if it fails.

Additionally, most database setups these days include a read-only mode or some capabilities that help distribute the load. With PostgreSQL, when using physical replication, all the replicas will be in read-only mode. Logical replication has the capability to allow the writes, but it’s not recommended to have writes in both nodes at the same time without a strong conflict-resolution solution. You can, alternatively, set specific tables or a database to read-only mode using the pg_hba.conf file, or you can set read-only access permissions. With MySQL, you can set the whole instance to a read-only mode using a specific parameter in the software’s configuration. You can also restrict write operations by restricting users to read-only privileges pertaining to specific databases or tables. With MongoDB, you can set read-only privileges in which users can query data but cannot perform write operations.

HA clusters are used in web hosting, email hosting, database hosting, and other services that require high availability.

How does high availability work?

High availability works through a combination of the following:

  • No single point of failure (SPOF): You must eliminate any single point of failure in the database environment, including physical or virtual hardware the database system relies on that would cause it to fail.
  • Redundancy: Critical components of a database are duplicated so that if one component fails, the functionality continues by using a redundant component. For example, in a server cluster, multiple servers are used to host the same application so that if one server fails, the application can continue to run on the other servers.
  • Load balancing: Traffic is distributed across multiple servers to prevent any one component from becoming overloaded. Load balancers can detect when a component is not responding and put traffic redirection in motion.
  • Failure detection: Monitoring mechanisms detect failures or issues that could lead to failures. Alert mechanisms report those failures or issues so that they are addressed immediately.
  • Failover: This involves automatically switching to a redundant component when the primary component fails. If a primary server fails, a backup server can take over and continue to serve requests.

Components of high availability infrastructure

Multiple copies of data

Data redundancy helps prevent data loss due to hardware or software failures. By storing multiple copies of data, database administrators can recover lost or corrupted data and maintain data integrity. Redundancy is also critical for disaster recovery. In the event of a natural disaster, cyber attack, or other catastrophic event, redundant systems can help ensure that the database can be quickly restored to a functioning state.

Multiple servers or databases

By copying and maintaining data on multiple servers or databases, database administrators can ensure that there is always a backup copy available. They can use that replicated data to quickly restore the database to a functioning state in case of a system failure, attack, or other potentially damaging event.

Load balancers

Load balancers help distribute incoming traffic across multiple database servers, ensuring that no single server is overloaded with requests. This helps prevent the possibility of a single point of failure, which could cause downtime or slow performance. It also supports the flexibility and scalability of the database infrastructure.

Failover mechanisms

The primary purpose of failover — the automatic process of switching from a failed or offline database server to a standby or backup server — is to ensure user access and functionality of databases without interruption, even if there is a hardware or software failure. By implementing failover mechanisms, organizations can ensure that their critical applications and services remain available even in the event of a failure, minimizing the impact on their business operations.

How to achieve high availability

Not all companies are the same, nor are their requirements for HA. When planning your database HA architecture, the size of your company is a great place to start to assess your needs. For example, if you’re a small business, paying for a disaster recovery site outside your local data center is probably unnecessary and may cause you to spend more money than the data loss is worth. All companies, regardless of size, should consider how to strike a balance between availability goals and cost.

  • Startups and small businesses: Most startups and small businesses can achieve an effective HA infrastructure within a single data center on a local node. This base architecture keeps the database available for your applications in case the primary node goes down, whether that involves automatic failover in case of a disaster or planned switchover during a maintenance window.
  • Medium to large businesses: If you have enough additional budget, consider adding a disaster recovery site outside your local data center. This architecture spans data centers to add more layers of availability to the database cluster. It keeps your infrastructure available and your data safe and consistent even if a problem occurs in the primary data center. In addition to the disaster recovery site, this design includes an external layer of nodes. If communication between the sites is lost, the external node layer acts as a “source of truth” and decides which replica to promote as a primary. In doing so, it keeps the cluster healthy by preventing a split-brain scenario and keeps the infrastructure highly available.
  • Enterprises: An enterprise HA architecture adds another layer of insurance for companies that offer globally distributed services and could face devastating revenue and reputational losses in the event of substantial downtime. This architecture features two disaster recovery sites, adding more layers for the infrastructure to stay highly available and keep applications up and running. Based on tightly coupled database clusters spread across data centers and geographic availability zones, this architecture can offer 99.999% uptime when used with synchronous streaming replication, the same hardware configuration in all nodes, and fast internode connections.

How to measure high availability

High availability does not guarantee 100% uptime, but it allows you to get pretty close. Within IT, the gold standard for high availability is 99.999%, or “five-nines” of availability, but the level of HA needed really depends on how much downtime you can bear. Streaming services, for example, run mission-critical systems in which excessive downtime could result in significant financial and reputational losses for the business. But many organizations can tolerate a few minutes of downtime without negatively affecting their end users.

The following table shows the amount of downtime for each level of availability, from “two nines” to “five nines.”

high availability

Here are additional metrics used to determine the reliability of a database, make adjustments that minimize downtime, and set benchmarks for meeting business continuity requirements. The desired value or range for each of these depends on the specific needs and risk tolerance of the company or organization that uses it:

  • Mean time between failures (MTBF) — a measure of the average time between two failures of a database or component. The MTBF is calculated by dividing the total operating time of the system by the number of failures that have occurred during that time.
  • Mean downtime (MDT) — the average amount of time that a database system is unavailable due to planned or unplanned downtime. The MDT is calculated by adding up the total amount of time that the system was down over a given period and dividing it by the number of downtimes.
  • Recovery time objective (RTO) — the maximum time that a system can be down before it starts to have a significant effect on business operations. Organizations use the RTO to determine how quickly they must recover their systems to meet business continuity requirements.
  • Recovery point objective (RPO) — the point in time by which an organization must recover its data in case of a disaster or system failure. For example, an RPO of 1 hour means that an organization can tolerate a maximum data loss of 1 hour’s worth of data. To ensure that the RPO is met, organizations typically implement backup and recovery procedures that enable them to restore data to a specific point in time.

High availability architecture for open source databases

No matter where a database infrastructure is located (on-premises, public or private cloud environment, or hybrid environment) or what technology is used — be it MySQL, PostgreSQL, or MongoDB — it’s vital to eliminate single points of failure. To be prepared for disaster recovery and attain high availability, server redundancy must be built into the database architecture.

And for that redundancy to be beneficial, the design of the database infrastructure must incorporate failover mechanisms that duplicate essential database functions and components.

The considerations described above (and in other sections) must be addressed across the following areas if a high availability architecture is to be achieved:

  • Infrastructure: This is the physical or virtual hardware that database systems rely on to run. Without enough infrastructure (physical or virtualized servers, networking, etc.), there cannot be high availability.
  • Topology management: This is the software management related specifically to the database and managing its ability to stay consistent in the event of a failure.
  • Connection management: This is the software management related specifically to the networking and connectivity aspect of the database. Clustering solutions typically bundle with a connection manager. However, in asynchronous clusters, deploying a connection manager is mandatory for high availability.
  • Backup and continuous archiving: This is of extreme importance if any replication delay happens and the replica node isn’t able to work at the primary’s pace. The backed-up and archived files can also be used for point-in-time recovery if any disaster occurs.

Which database is best for high availability?

When choosing a database for high availability, some proprietary options offer attractive specialized features, including built-in replication, automatic failover, and advanced clustering. Those ready-to-go capabilities can simplify the implementation of high availability. Dedicated vendor support can also be a lure. Proprietary vendors can provide expertise, guidance, and timely resolutions that ensure the stability and availability of a database.

But with the right expertise — whether it’s from within your organization or through outside assistance — open source databases can provide the same benefits, as well as other advantages, including:

  • Cost-effectiveness: Open source software is typically free to use, which can significantly reduce the overall cost of establishing high availability. Conversely, proprietary databases often require expensive licenses and ongoing maintenance fees.
  • Flexibility: With open source databases, you have access to the source code, allowing you to modify and optimize the database to meet your specific high availability requirements. This flexibility can be crucial in designing a scalable architecture.
  • Community support: Large and active communities of users, developers, and DBAs can be invaluable when it comes to troubleshooting issues, finding solutions, and staying up-to-date with the latest developments in high availability techniques. The transparency of open code enhances security as potential vulnerabilities can be identified and patched more quickly by the community.
  • Freedom from vendor lock-in: With open source software, you are not tied to a specific vendor or locked into a particular technology stack. You have the freedom to choose and integrate various tools and technologies that best suit your high availability strategy.

HA is an important goal, but switching to a proprietary database solely for HA limits the other benefits of open source database software. In addition to enabling a strong database HA architecture, open source avoids costly licensing fees, offers data portability, gives you the freedom to deploy anywhere, anytime you want, and delivers great software designed by a community of contributors who prioritize innovation and quality.

Getting started with open source database high availability

There are numerous open source options available, but covering all of them here would not be practical. (For a more detailed examination of open source options, read The Ultimate Guide to Open Source Databases.)

Here, for this examination of high availability, we are focusing on three of the most popular open source options:

  • MySQL: This relational (table-based) database software supports asynchronous replication, which provides redundancy and allows for failover in case the primary database becomes unavailable. MySQL supports the creation of read replicas, which can distribute the read workload, improve performance, and provide high availability for read-intensive applications. Another feature (among many) of note is MySQL Cluster, a distributed, shared-nothing database clustering solution. It combines synchronous replication, automatic data partitioning, and node-level failover to provide high availability and scalability.
  • PostgreSQL: This relational database software supports numerous high availability options, including asynchronous streaming replication, logical replication, and tools such as repmgr and Patroni for automatic failover and cluster management. PostgreSQL also supports synchronous replication in which the primary server waits for confirmation from at least one standby server before acknowledging a transaction. PostgreSQL point-in-time recovery uses continuous archiving to create a backup that allows you to recover the database to a specific state.
  • MongoDB: While not open source (see our blog Is MongoDB Open Source? Is Planet Earth Flat? for more), we’ll categorize it here for easy reading. This non-relational (document-based) database uses replica sets to provide high availability. MongoDB’s replica sets maintain multiple copies of data across different nodes. The primary node receives write operations and replicates them to the secondary nodes asynchronously or synchronously, depending on the chosen replication mode. As a result, data is redundantly stored, reducing the risk of data loss.

Some open source databases include great options for HA but generally don’t include a built-in HA solution. That means you’ll need to review the various extensions and tools available carefully. Those extensions and tools can be excellent for enriching their respective databases, but as your environment scales, the database might not keep pace with evolving, more complex requirements.

Greater complexity means that you and your team will need certain skills and knowledge. For example, it might be necessary to write application connectors, integrate multiple systems, and align business requirements with your HA solution. You should also understand open source databases, applications, and infrastructure. Consider the possibility of needing outside expertise to help you manage your HA architecture.

Comparing high availability to similar systems

High Availability vs. Disaster Recovery

High availability is focused on preventing downtime and ensuring that the database remains available, while disaster recovery is focused on recovering from a catastrophic event and minimizing negative effects on the business. High availability typically involves redundant hardware, software, application, and network components that can quickly take over if the primary component fails, while disaster recovery typically involves regular backups, replication to a secondary site, and a well-defined recovery plan that outlines the steps to be taken in the event of a disaster. Companies certainly can and should incorporate high availability and disaster recovery into database architecture.

High Availability vs. Fault Tolerance

High availability ensures that the database remains up and running almost all the time, while fault tolerance is focused on preventing data loss or corruption in the event of a fault or error. Redundancy and failover mechanisms figure prominently in high availability architecture, a design that gives users continued access even if there is a hardware or software failure. A fault-tolerant system is designed to correct and detect errors as they occur such that normal operation is not interrupted. This can mean that instead of failing completely, a system continues to operate, but possibly at a reduced level of efficiency.

High Availability vs. Redundancy

Redundancy is a key element of high availability, so it’s not an either-or situation. As stated, high availability refers to a database that’s functional almost all the time (i.e., “five nines” or some other standard depending on company size and requirements). In other words, there’s little or no downtime. Redundancy, a specific part of high availability, refers to the duplication of critical components or data to provide a backup in case the primary component or data is lost or corrupted. To establish redundancy, the same data must be accessible from at least two places. That means having a primary system and a secondary system. The secondary system includes failover monitoring, so when a failure occurs, data traffic is redirected to that secondary system. In addition to redundancy, high availability includes load balancing, failure detection, and failover, as well as ensuring there is no single point of failure (SPOF).

Best practices for achieving high availability

For medium, large, and enterprise-size companies (as well as startups and small companies, depending on resources), the following general practices can help when planning, implementing, maintaining, and testing high availability architectures. Please note that these are presented as general best practices. (For more detailed information, check out the Percona architecture and solution links in subsequent sections.) Best practices include:

Conduct research and identify objectives

Early on, determine what you need and what you can afford. Here are the early steps:

  • Perform analysis: A good place to start might be to conduct a Failure Mode Analysis (FMA). An FMA or similar analysis can identify possible failure points in the database infrastructure. This will help in determining the redundancy requirements of the different components.
  • Determine requirements: The level of HA that is needed depends on how much downtime a company can bear. For example, some companies that offer streaming services will suffer significant financial and reputational losses in just a couple of minutes of downtime. Other organizations can tolerate a few minutes of downtime without negatively affecting their end users.
  • Consider costs: Multiple factors will, of course, affect how much you should budget. Depending on your setup, costs can include:
    • Hardware devices (servers, storage devices, network switches, etc.)
    • Networking equipment (switches, routers, etc.)
    • Software for clustering, replication, load balancing, failover, etc.; with proprietary licensing, costs can increase significantly as you scale
    • Cloud egress fees that might be charged when moving data out of a vendor’s network
    • Staffing and training (an HA system requires specialized expertise)
    • Testing to ensure the system will function during likely and unlikely conditions
    • Scalability to allow for maximum growth

Focus on redundancy and failover

Whether a company’s database infrastructure is on-premises, in a public or private cloud environment, or is run in a hybrid environment, it is mission-critical that there are no single points of failure. To avoid them, achieve high availability, and be poised for disaster recovery, there must be server redundancy. And for redundancy to matter, the database infrastructure design must include failover. In other words, there must be mechanisms for putting the duplication of essential database functions and components into action.

To establish redundancy, the same data must be accessible from at least two places. That means having a primary system and a secondary system. The secondary system includes failover monitoring, so when a failure occurs, data traffic is redirected to that secondary system.

Use shared-nothing (or shared-none) clusters

High availability can be achieved using the concepts of “shared-all” and “shared-nothing” (also called “shared-none”) clusters. Open source technologies often prefer to use shared-nothing clusters because they are not only simpler to implement and deploy but also to scale. By incorporating that setup, the developer or DBA will ensure that each node within a cluster stores, manages, and processes its data. That will minimize problems that can occur when nodes share access to and the ability to modify the same memory or storage.

The most common shared-nothing architecture has a primary node responsible for receiving all data updates and propagating the updates to the replica nodes with little to no interference or conflict with other nodes. (Each node has its own cache buffer.) If one instance fails, another instance in another node can fill the role entirely, and high availability remains.

In addition to preventing single points of failure and system-wide shutdowns, a big advantage of a shared-nothing architecture is scalability. You can easily scale a shared-nothing system by adding nodes. Since those nodes don’t share data, there won’t be dueling processes that could slow or halt operations.

Apply load balancing

In addition to incursions from the outside and errors generated in-house, unusually high customer use can overwhelm a company’s database operations. That’s why it’s essential to include load balancing in your design. In times of high use, load balancing redirects workloads to components of the infrastructure best suited to pick up the extra work.

To keep services — and the databases that deliver them — running smoothly, a tiered architecture of server clusters helps. In the event that a server fails, a replica server in a different cluster can assume the role of the failed server. The load balancer is the key; it serves as a traffic cop of sorts, directing and redirecting flow from end users/applications to servers where data will flow as intended. If a server fails, the load balancer redirects traffic to a replicated server in another cluster.

Use a container-orchestration platform

Containers can provide multi-cloud compatibility and cost-efficiency, and they are a great tool for isolating applications. But containers are not faster than anything else with the same resources. They can, in fact, be a lot more complex to troubleshoot than the processes of virtual machines (VMs) and bare metal environments. Still, companies find themselves adding more and more containers across multiple hosts.

To get real benefits from those containers, it’s advisable to use a system of orchestrating and managing containerized applications. Kubernetes is a platform that can provide that system. It enables developers and DBAs to manage sizable containers across multiple hosts. Using Kubernetes can yield many benefits, including greater automation than that of traditional VMs, enhanced fault tolerance, and increased flexibility and scalability.

Deploy in multiple regions

Sometimes there’s a misperception that “cloud” means safe. Just because you’ve entrusted your database with a cloud provider instead of hosting on-site doesn’t mean you won’t lose service. Cloud provider data centers go down, and so do entire regions of data centers. So whether you host your own servers, you use a cloud provider, or you use a combination of both, it’s wise to replicate databases across regions, not in the same region.

With replication in different geographic regions, you can assure instantaneous failover in the event of an entire region going down.

Maintain deployment consistency

Using proven and secure code and processes is important in maintaining data integrity and reliability. Deployment consistency is essential in preventing the introduction of bugs and other intrusions that could lead to data corruption or data loss and could slow or interrupt applications and even entire systems. By maintaining database deployment consistency, organizations can ensure that their data is always up-to-date and accurate and that their applications perform reliably. Data consistency is especially important in financial and healthcare systems, where incorrect or outdated data can have serious and even fatal consequences.

Conduct end-to-end (E2E) testing

This practice tests the entire system to ensure that all database components and software applications (and any subsystems) work as intended — individually and in unison. E2E testing verifies that components meet performance, security, and usability requirements. It is often used to identify issues that arise from the interaction between different components of the system, such as when data is passed from one component to another. This type of testing requires a solid understanding of user behaviors and goals, which can be varied. E2E testing, therefore, can be time-consuming and potentially more expensive than some limited budgets allow.

Incorporate chaos engineering

Chaos engineering is testing that involves intentionally introducing failures into a software system to identify potential issues and weaknesses. By intentionally introducing problems throughout the database infrastructure, engineers can identify failure points and create solutions — avoiding costly downtime and potential loss of frustrated customers.

In well-orchestrated chaos engineering, there’s an experimental version of the system and a control version. A barrage of problems and outright attacks are injected into the system. The technical team then reviews the logging data to determine how the incursions affected the system.

Chaos engineering is designed to uncover unexpected, even shocking, surprises. Unlike targeted testing of traditional DevOps procedures, chaos engineering is widespread throughout the entire system. Since it’s performed on live systems, there is a chance that users can be briefly affected, but chaos engineering is a great motivator for following resilient design and maintenance practices. Most of all, chaos engineering is a way to fend off disaster.

Consider outside help

Consider the possibility that you might need outside expertise to help you plan, create, and manage an HA architecture. If an open source solution is right for you, but you need help implementing it, Percona recommends that you research to ensure the outside company is committed to the global open source community. Also, make sure you’ll have the freedom to innovate and to scale — up or down — as your company sees fit. Avoid unnecessary vendor lock-in. If you don’t use Percona, find another company that gives you those freedoms while helping you build the high availability infrastructure you need.

Percona architectures for highly available MySQL and PostgreSQL

Percona uses open source software to design architectures for high availability databases. Businesses of all sizes — small, medium, large, and enterprise — can use our designs to create their own architectures and ensure that applications and data stay available and in sync. If you need help, call on us for on-demand support as you implement an architecture. Or we can do it for you completely.

Here are resources (with viewable architectures) for help getting started:

High availability MySQL architectures

Check out Percona architecture and deployment recommendations, along with a technical overview, for a MySQL solution that provides a high level of high availability and assumes the usage of high read/write applications.

Percona Distribution for MySQL: High Availability with Group Replication

 

If you need even more high availability in your MySQL database, check out Percona XtraDB Cluster (PXC).

High availability PostgreSQL architectures

Access a PostgreSQL architecture description and deployment recommendations, along with a technical overview, of a solution that provides high availability for mixed-workload applications.

 

Percona Distribution for PostgreSQL: High Availability with Streaming Replication

 

High availability support you can count on

Percona designs, configures, and deploys high availability databases so that applications always have access to essential data. Then, we apply the same level of expertise and commitment to ensuring your databases continue to run smoothly and continuously. We’ll help you ensure that applications stay up, databases stay bug-free, and your entire system runs at optimal performance levels.

 

Percona High Availability Database Support

 

FAQ

The following are commonly asked questions and short answers about high availability in databases. More detailed answers are presented in the sections above.

What does high availability mean?

High availability refers to a database’s ability to remain operational and accessible to users without significant downtime or interruptions. To achieve high availability, a database system must be designed with redundancy and failover mechanisms and have no single points of failure. High availability is particularly important in the e-commerce, financial, and healthcare industries.

What is an example of high availability?

Database replication is an example of high availability. Multiple copies of the same database are maintained on different servers so that if one server fails, the database can still be accessed from another server. Different techniques are used to achieve database high availability, including primary-replica replication or multi-primary replication.

How does high availability work?

High availability of a database infrastructure works through a combination of redundancy, load balancing, failure detection, failover, and having no single point of failure (SPOF). Those elements combine to ensure that data is always accessible and that there is minimal downtime or interruption in service, even in the face of hardware or software failures.

What does 99.99% availability mean?

99.99% availability (or “four nines”) means that a database is down no more than 52.60 minutes per year, 4.38 minutes per month, 1.01 minutes per week, and 8.64 seconds per day.

What does 99.999% availability mean?

99.999% availability (or “five nines”) means that a database is down no more than 5.26 minutes per year, 26.30 seconds per month, 6.05 seconds per week, and 864.00 milliseconds per day.

Jun
12
2023
--

Ready-to-Use High Availability Architectures for MySQL and PostgreSQL

high availability architectures for MySQL and PostgreSQL

When it comes to access to their applications, users demand instant, reliable, and secure interactions — and that means databases must be highly available.

With database high availability (HA), services are largely uninterrupted, and end users are largely satisfied. Without high availability, there’s more-than-negligible downtime, and end users can become non-users (as in, former customers). A business can also incur reputational damage and face penalties for not meeting Service Level Agreements (SLAs).

Open source databases provide great foundations for high availability — without the pitfalls of vendor lock-in that can come with proprietary software. However, open source software doesn’t typically include built-in HA solutions. Sure, you can get there with the right extensions and tools, but it can be a long, burdensome, and potentially expensive process. So why not use a proven architecture instead of starting from scratch on your own?

This blog provides links to such architectures — for MySQL and PostgreSQL software. They’re proven and ready-to-go. You can use these Percona architectures to build highly available PostgreSQL or MySQL environments or have our experts do the heavy lifting for you. Either way, the architectures provide outlines for building databases that keep operations running optimally, even during peak usage or amid technical challenges caused by anything from brief outages to disasters.

First, let’s quickly examine what’s at stake and discuss standards for protecting those assets.

Importance of high availability architecture

As indicated, an HA architecture provides the blueprint for building a database that assures the continuity of critical business operations, even amid crashes, incursions, outages, and other threats. Conversely, choosing a piecemeal approach — one in which you attempt to build a database through the trial and error of various tools and extensions  — can leave your system vulnerable.

That vulnerability can be costly: A 2022 ITIC survey found that the cost of downtime is greater than $300,000 per hour for 91% of small-, mid-size, and large enterprises. Among just the mid-size and large respondents, 44% percent said a single hour of downtime could potentially cost them more than $1 million.

The ultimate goal of HA architecture

So what’s the ultimate goal of using an HA architecture? The obvious answer is this: To achieve high availability. That can mean different things for different businesses, but within IT, 99.999% (“five-nines”) of availability is the gold standard of database availability.

It really depends on how much downtime you can bear. With streaming services, for example, excessive downtime could result in significant financial and reputational losses for the business. Elsewhere, millions can be at stake for financial institutions, and lives can be at stake in the healthcare industry. Other organizations can tolerate a few minutes of downtime without negatively affecting or irking their end users. (Now, there’s a golden rule to go with the golden standard: Don’t irk the end user!)

The following table shows the amount of downtime for each level of availability, from “two nines” to “five nines.” You’ll see that doesn’t deliver 100% uptime, but it’s close.high availability

The immediate (working) goal and requirements of HA architecture

The more immediate (and “working” goal) of an HA architecture is to bring together a combination of extensions, tools, hardware, software, etc., and package them in a design (blueprint) of a database infrastructure that’s fit to perform optimally — amid demanding conditions. That design will depict an infrastructure of high availability nodes/clusters that work together (or separately, if necessary) so that if one goes down, another one takes over.

Proven architectures — including those we share in this blog — have met several high availability requirements. When those requirements are met, databases will include:

  • Redundancy: Critical components of a database are duplicated so that if one component fails, the functionality continues by using a redundant component. For example, in a server cluster, multiple servers are used to host the same application so that if one server fails, the application can continue to run on the other servers.
  • Load balancing: Traffic is distributed across multiple servers to prevent any one component from becoming overloaded. Load balancers can detect when a component is not responding and put traffic redirection in motion.
  • No single-point-of-failure (SPOF): This is both an exclusion and an inclusion for the architecture. There cannot be any single-point-of-failure in the database environment, including physical or virtual hardware the database system relies on that would cause it to fail. So there must be multiple components whose function, in part, is ensuring there’s no SPOF.
  • Failure detection: Monitoring mechanisms detect failures or issues that could lead to failures. Alert mechanisms report those failures or issues so that they are addressed immediately.
  • Failover: This involves automatically switching to a redundant component when the primary component fails. If a primary server fails, a backup server can take over and continue to serve requests.
  • Cluster and connection management: This includes software for automated provisioning, configuration, and scaling of database nodes. Clustering solutions typically bundle with a connection manager. However, in asynchronous clusters, deploying a connection manager is mandatory for high availability.
  • Automated backup, continuous archiving, and recovery: This is of extreme importance if any replication delay happens and the replica node isn’t able to work at the primary’s pace. The backed-up and archived files can also be used for point-in-time recovery if any disaster occurs.
  • Scalability: HA architecture should support scalability that enables automated management of increased workloads and data volume. This can be achieved through techniques like sharding, where the data is partitioned and distributed across multiple nodes, or by adding more nodes to the cluster as needed.

Get going (or get some help) using proven Percona architectures

Designing and implementing a highly available database environment requires considerable time and expertise. So instead of you having to select, configure, and test those configurations to build a highly available database environment, why not use ours? You can use Percona architectures on your own, call on us as needed, or have us do it all for you.

High availability MySQL architectures

Check out Percona architecture and deployment recommendations, along with a technical overview, for a MySQL solution that provides a high level of high availability and assumes the usage of high read/write applications.

Percona Distribution for MySQL: High Availability with Group Replication

 

If you need even more high availability in your MySQL database, check out Percona XtraDB Cluster (PXC).

High availability PostgreSQL architectures

Access a PostgreSQL architecture description and deployment recommendations, along with a technical overview, of a solution that provides high availability for mixed-workload applications.

 

Percona Distribution for PostgreSQL: High Availability with Streaming Replication

 

View a disaster recovery architecture for PostgreSQL, with deployment recommendations based on Percona best practices.

PostgreSQL: Disaster Recovery

 

Here are additional links to Percona architectures for high availability PostgreSQL databases:

Highly Available PostgreSQL From Percona

Achieving High Availability on PostgreSQL With Open Source Tools

Highly Availability in PostgreSQL with Patroni

High Availability MongoDB From Percona

Percona offers support for MongoDB clusters in any environment. Our experienced support team is available 24x7x365 to ensure continual high performance from your MongoDB systems.

 

Percona Operator for MongoDB Design Overview

Dec
14
2021
--

High Availability and Disaster Recovery Recipes for PostgreSQL on Kubernetes

High Availability and Disaster Recovery PostgreSQL on Kubernetes

Percona Distribution for PostgreSQL Operator allows you to deploy and manage highly available and production-grade PostgreSQL clusters on Kubernetes with minimal manual effort. In this blog post, we are going to look deeper into High Availability, Disaster Recovery, and Scaling of PostgreSQL clusters.

High Availability

Our default custom resource manifest deploys a highly available (HA) PostgreSQL cluster. Key components of HA setup are:

  • Kubernetes Services that point to pgBouncer and replica nodes
  • pgBouncer – a lightweight connection pooler for PostgreSQL
  • Patroni – HA orchestrator for PostgreSQL
  • PostgreSQL nodes – we have one primary and 2 replica nodes in hot standby by default

high availability postgresql

Kubernetes Service is the way to expose your PostgreSQL cluster to applications or users. We have two services:

  • clusterName-pgbouncer

    – Exposing your PostgreSQL cluster through pgBouncer connection pooler. Both reads and writes are sent to the Primary node. 

  • clusterName-replica

    – Exposes replica nodes directly. It should be used for reads only. Also, keep in mind that connections to this service are not pooled. We are working on a better solution, where the user would be able to leverage both connection pooling and read-scaling through a single service.

By default we use ClusterIP service type, but you can change it in

pgBouncer.expose.serviceType

or

pgReplicas.hotStandby.expose.serviceType,

respectively.

Every PostgreSQL container has Patroni running. Patroni monitors the state of the cluster and in case of Primary node failure switches the role of the Primary to one of the Replica nodes. PgBouncer always knows where Primary is.

As you see we distribute PostgreSQL cluster components across different Kubernetes nodes. This is done with Affinity rules and they are applied by default to ensure that single node failure does not cause database downtime.

Multi-Datacenter with Multi-AZ

Good architecture design is to run your Kubernetes cluster across multiple datacenters. Public clouds have a concept of availability zones (AZ) which are data centers within one region with a low-latency network connection between them. Usually, these data centers are at least 100 kilometers away from each other to minimize the probability of regional outage. You can leverage multi-AZ Kubernetes deployment to run cluster components in different data centers for better availability.

Multi-Datacenter with Multi-AZ

To ensure that PostgreSQL components are distributed across availability zones, you need to tweak affinity rules. Now it is only possible through editing Deployment resources directly:

$ kubectl edit deploy cluster1-repl2
…
-            topologyKey: kubernetes.io/hostname
+            topologyKey: topology.kubernetes.io/zone

Scaling

Scaling PostgreSQL to meet the demand at peak hours is crucial for high availability. Our Operator provides you with tools to scale PostgreSQL components both horizontally and vertically.

Vertical Scaling

Scaling vertically is all about adding more power to a PostgreSQL node. The recommended way is to change resources in the Custom Resource (instead of changing them in Deployment objects directly). For example, change the following in the

cr.yaml

to get 256 MBytes of RAM for all PostgreSQL Replica nodes:

  pgReplicas:
    hotStandby:
      resources:
        requests:
-         memory: "128Mi"
+         memory: "256Mi"

Apply

cr.yaml

:

$ kubectl apply -f cr.yaml

Use the same approach to tune other components in their corresponding sections.

You can also leverage Vertical Pod Autoscaler (VPA) to react to load spikes automatically. We create a Deployment resource for Primary and each Replica node. VPA objects should target these deployments. The following example will track one of the replicas Deployment resources of cluster1 and scale automatically:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: pxc-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment
    name:     cluster1-repl1  
    namespace:  pgo
  updatePolicy:
    updateMode: "Auto"

Please read more about VPA and its capabilities in its documentation.

Horizontal Scaling

Adding more replica nodes or pgBouncers can be done by changing size parameters in the Custom Resource. Do the following change in the default

cr.yaml

:

  pgReplicas:
    hotStandby:
-      size: 2
+      size: 3

Apply the change to get one more PostgreSQL Replica node:

$ kubectl apply -f cr.yaml

Starting from release 1.1.0 it is also possible to scale our cluster using kubectl scale command. Execute the following to have two PostgreSQL replica nodes in cluster1:

$ kubectl scale --replicas=2 perconapgcluster/cluster1
perconapgcluster.pg.percona.com/cluster1 scaled

In the latest release, it is not possible to use Horizontal Pod Autoscaler (HPA) yet and we will have it supported in the next one. Stay tuned.

Disaster Recovery

It is important to understand that Disaster Recovery (DR) is not High Availability. DR’s goal is to ensure business continuity in the case of a massive disaster, such as a full region outage. Recovery in such cases can be of course automated, but not necessarily – it strictly depends on the business requirements.

Disaster Recovery postgresql

Backup and Restore

I think it is the most common Disaster Recover protocol – take the backup, store it in some 3rd party premises, restore to another datacenter if needed.

This approach is simple, but comes with a long recovery time, especially if the database is big. Use this method only if it passes your Recovery Time Objectives (RTO).

Recovery Time Objectives

Our Operator handles backup and restore for PostgreSQL clusters. The disaster recovery is built around pgBackrest and looks like the following:

  1. Configure pgBackrest to upload backups to S3 or GCS (see our documentation for details).
  2. Create the backup manually (through pgTask) or ensure that a scheduled backup was created. 
  3. Once the Main cluster fails, create the new cluster in the Disaster Recovery data center. The cluster must be running in standby mode and pgBackrest must be pointing to the same repository as the main cluster:
spec:
  standby: true
  backup:
  # same config as on original cluster

Once data is recovered, the user can turn off standby mode and switch the application to DR cluster.

Continuous Restoration

This approach is quite similar to the above: pgBackrest instances continuously synchronize data between two clusters through object storage. This approach minimizes RTO and allows you to switch the application traffic to the DR site almost immediately. 

Continuous Restoration postgresql

Configuration here is similar to the previous case, but we always run a second PostgreSQL cluster in the Disaster Recovery data center. In case of main site failure just turn off the standby mode:

spec:
  standby: false

You can use a similar setup to migrate the data to and from Kubernetes. Read more about it in the Migrating PostgreSQL to Kubernetes blog post.

Conclusion

Kubernetes Operators provide ready-to-use service, and in the case of Percona Distribution for PostgreSQL Operator, the user gets a production-grade, highly available database cluster. In addition, the Operator provides day-2 operation capabilities and automates day-to-day routine.

We encourage you to try out our operator. See our GitHub repository and check out the documentation.

Found a bug or have a feature idea? Feel free to submit it in JIRA.

For general questions please raise the topic in the community forum

Are you a developer and looking to contribute? Please read our CONTRIBUTING.md and send the Pull.

Dec
09
2021
--

A Look Into Percona XtraDB Cluster Non-Blocking Operation for Online Schema Upgrade

Percona XtraDB Cluster Non-Blocking Operation

Percona XtraDB Cluster Non-Blocking OperationPercona XtraDB Cluster 8.0.25 (PXC) has introduced a new option to perform online schema modifications: NBO (Non-Blocking Operation).

When using PXC, the cluster relies on the wsrep_OSU_method parameter to define the Online Schema Upgrade (OSU) method the node uses to replicate DDL statements. 

Until now, we normally have three options:

  • Use Total Isolation Order (TOI, the default)
  • Use Rolling Schema Upgrade (RSU)
  • Use Percona’s online schema change tool (TOI + PTOSC)

Each method has some positive and negative aspects. TOI will lock the whole cluster from being able to accept data modifications for the entire time it takes to perform the DDL operation. RSU will misalign the schema definition between the nodes, and in any case, the node performing the DDL operation is still locked. Finally, TOI+PTOSC will rely on creating triggers and copying data, so in some cases, this can be very impactful. 

The new Non-Blocking Operation (NBO) method is to help to reduce the impact on the cluster and make it easier to perform some DDL operations.

At the moment we only support a limited set of operations with NBO like:

  • ALTER INDEX
  • CREATE INDEX
  • DROP INDEX

Any other command will result in an error message ER_NOT_SUPPORTED_YET.

But let us see how it works and what the impact is while we will also compare it with the default method TOI.

What we will do is work with four connections:

1 – to perform ddl
2 – to perform insert data in the table being altered
3 – to perform insert data on a different table
4-5 – checking the other two nodes operations

PXC must be at least Version 8.0.25-15.1.

The table we will modify is :

DC1-1(root@localhost) [windmills_s]>show create table windmills_test\G
*************************** 1. row ***************************
       Table: windmills_test
Create Table: CREATE TABLE `windmills_test` (
  `id` bigint NOT NULL AUTO_INCREMENT,
  `uuid` char(36) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `millid` smallint NOT NULL,
  `kwatts_s` int NOT NULL,
  `date` date NOT NULL,
  `location` varchar(50) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `active` tinyint NOT NULL DEFAULT '1',
  `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `strrecordtype` char(3) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`id`),
  KEY `IDX_millid` (`millid`,`active`),
  KEY `IDX_active` (`id`,`active`),
  KEY `kuuid_x` (`uuid`),
  KEY `millid_x` (`millid`),
  KEY `active_x` (`active`)
) ENGINE=InnoDB AUTO_INCREMENT=8199260 DEFAULT CHARSET=utf8mb3 COLLATE=utf8_bin ROW_FORMAT=DYNAMIC
1 row in set (0.00 sec)

And contains ~five million rows.

DC1-1(root@localhost) [windmills_s]>select count(*) from windmills_test;
+----------+
| count(*) |
+----------+
|  5002909 |
+----------+
1 row in set (0.44 sec)

The Commands

Connection 1:

ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE;
  ALTER TABLE windmills_test drop INDEX idx_1, ALGORITHM=INPLACE;

 

Connection 2:

while [ 1 = 1 ];do da=$(date +'%s.%3N');/opt/mysql_templates/PXC8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "insert into windmills_test  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills7 limit 1;" -e "select count(*) from windmills_s.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done

 

Connection 3:

while [ 1 = 1 ];do da=$(date +'%s.%3N');/opt/mysql_templates/PXC8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "insert into windmills8  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills7 limit 1;" -e "select count(*) from windmills_s.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done

 

Connections 4-5:

while [ 1 = 1 ];do echo "$(date +'%T.%3N')";/opt/mysql_templates/PXC8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "show full processlist;"|egrep -i -e "(windmills_test|windmills_s)"|grep -i -v localhost;sleep 1;done

Operations

  • Start inserts from connections
  • Start commands in connections 4 – 5 on the other nodes
  • Execute: 
    • For TOI 
      • DC1-1(root@localhost) [windmills_s]>SET SESSION wsrep_OSU_method=TOI;
    • For NBO
      • DC1-1(root@localhost) [windmills_s]>SET SESSION wsrep_OSU_method=NBO;
    • For both
      • DC1-1(root@localhost) [windmills_s]>ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=shared;

Let’s Run It

Altering a Table with TOI

DC1-1(root@localhost) [windmills_s]>ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE;
Query OK, 0 rows affected (1 min 4.74 sec)
Records: 0  Duplicates: 0  Warnings: 0

 

Inserts in the altering table (connection 2):

.450
.492
64.993 <--- Alter blocks all inserts on the table we are altering
.788
.609

 

Inserts on the other table (connection 3):

.455
.461
64.161 <--- Alter blocks all inserts on all the other tables as well
.641
.483

 

On the other nodes at the same time of the ALTER we can see:

Id  User             db         Command Time  State             Info                                                                            Time_ms Rows_sent Rows_examined 
15	system user		windmills_s	Query	102	  altering table	ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE	102238	0	       0    <--- time from start

In short, we have the whole cluster locked for ~64 seconds. During this period of time, all the operations to modify data or structure were on hold. 

Let’s Now Try With NBO

Inserts in the altering table:

.437
.487
120.758 <---- Execution time increase
.617
.510

 

Inserts on the other table:

.468
.485
25.061 <---- still a metalock, but not locking the other tables for the whole duration 
.494
.471

 

On the other nodes at the same time of the ALTER we can see:

Id      User         db             Command Time  State             Info                                                                            Time_ms Rows_sent Rows_examined 
110068	system user	 windmills_s	Connect	86	  altering table	ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE	 120420	 0	          0

 

In this case, what is also interesting to note is that:

  1. We have a moment of metalock:
    1. 110174 pmm 127.0.0.1:42728 NULL Query 2 Waiting for table metadata lock  SELECT x FROM information_schema.tables  WHERE TABLE_SCHEMA = 'windmills_s' 1486 10    0
    2. 110068  system user connecting host windmills_s Connect 111 closing tables ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE 111000 0 0
  2. The execution time is longer

Summarizing:

TOI            NBO
Time on hold for insert for altering table   	~64 sec    	~120 sec
Time on hold for insert for another table   	~64 sec      ~25 sec 
metalock                       			whole time  	 only at the end

What is Happening? What are the Differences and Why Does it Take Longer with NBO?

Let’s see at a very high level how the two work:

  • TOI: when you issue a DDL like ADD INDEX a metadata lock is taken on the table and it will be released only at the end of the operation. During this time, you cannot:
    • Perform DMLs on any cluster node
    • Alter another table in the cluster
  • NBO: the metadata lock is taken at the start and at the end for a very brief period of time. The ADD INDEX operation will then work on each node independently. The lock taken at the end is to have all the nodes agree on the operation and commit or rollback (using cluster error voting). This final phase costs a bit more in time and is what adds a few seconds to the operation execution. But during the operation:
    • You can alter another table (using NBO)
    • You can continue to insert data, except in the table(s) you are altering.
    • On node crash, the operation will continue on the other nodes, and if successful it will persist.  

In short, the cluster server behavior changes significantly when using NBO, offering significant flexibility compared to TOI. The cost in time should not linearly increase with the dimension of the table, but more in relation to the single node efficiency in performing the ALTER operation.       

Conclusion

NBO can be significantly helpful to reduce the impact of DDL on the cluster, for now, limited to the widely used creation/modification/drop of an index. But in the future … we may expand it. 

The feature is still a technology preview, so do not trust it in production, but test it and let us know what you think. 

Final comment: another distribution has introduced NBO, but only if you buy the enterprise version.

Percona, which is truly open source with facts not just words, has implemented NBO in standard PXC, and the code is fully open source. This is not the first one, but just another of the many features Percona is offering for free while others ask you to buy the enterprise version.

Enjoy the product and let us have your feedback! Great MySQL to all! 

May
04
2021
--

Inconsistent Voting in Percona XtraDB Cluster

Cluster Error Voting

AKA Cluster Error Voting…

Cluster Error VotingWhat is Cluster Error Voting (CEV)?

“Cluster Error Voting is a new feature implemented by Alexey Yurchenko, and it is a protocol for nodes to decide how the cluster will react to problems in replication. When one or several nodes have an issue applying an incoming transaction(s) (e.g., suspected inconsistency), this new feature helps. In a 5-node cluster, if 2-nodes fail to apply the transaction, they get removed, and a DBA can go in to fix what went wrong so that the nodes can rejoin the cluster. (Seppo Jaakola)”

This feature was ported to Percona XtraDB Cluster (PXC) in version 8.0.21. As indicated above, it is about increasing the resilience of the cluster, especially when TWO nodes fail to operate and may drop from the cluster abruptly. The protocol is activated in a cluster with any number of nodes.

Before CEV, if a node has a problem/error during a transaction, the node having the issue would report the error in his own log and exit the cluster:

2021-04-23T15:18:38.568903Z 11 [ERROR] [MY-010584] [Repl] Slave SQL: Could not execute Write_rows event on table test.test_voting; Duplicate entry '21' for key 'test_voting.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 0, Error_code: MY-001062
2021-04-23T15:18:38.568976Z 11 [Warning] [MY-000000] [WSREP] Event 3 Write_rows apply failed: 121, seqno 16
2021-04-23T15:18:38.569717Z 11 [Note] [MY-000000] [Galera] Failed to apply write set: gtid: 224fddf7-a43b-11eb-84d5-2ebf2df70610:16 server_id: d7ae67e4-a43c-11eb-861f-8fbcf4f1cbb8 client_id: 40 trx_id: 115 flags: 3
2021-04-23T15:18:38.575439Z 11 [Note] [MY-000000] [Galera] Closing send monitor...
2021-04-23T15:18:38.575578Z 11 [Note] [MY-000000] [Galera] Closed send monitor.
2021-04-23T15:18:38.575647Z 11 [Note] [MY-000000] [Galera] gcomm: terminating thread
2021-04-23T15:18:38.575737Z 11 [Note] [MY-000000] [Galera] gcomm: joining thread
2021-04-23T15:18:38.576132Z 11 [Note] [MY-000000] [Galera] gcomm: closing backend
2021-04-23T15:18:38.577954Z 11 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,3206d174,5)
memb {
	727c277a,1
	}
joined {
	}
left {
	}
partitioned {
	3206d174,1
	d7ae67e4,1
	}
)
2021-04-23T15:18:38.578109Z 11 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0
2021-04-23T15:18:38.578158Z 11 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view ((empty))
2021-04-23T15:18:38.578640Z 11 [Note] [MY-000000] [Galera] gcomm: closed
2021-04-23T15:18:38.578747Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

While the other nodes will “just” report the node as out of the view:

2021-04-23T15:18:38.561402Z 0 [Note] [MY-000000] [Galera] forgetting 727c277a (tcp://10.0.0.23:4567)
2021-04-23T15:18:38.562751Z 0 [Note] [MY-000000] [Galera] Node 3206d174 state primary
2021-04-23T15:18:38.570411Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,3206d174,6)
memb {
	3206d174,1
	d7ae67e4,1
	}
joined {
	}
left {
	}
partitioned {
	727c277a,1
	}
)
2021-04-23T15:18:38.570679Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2021-04-23T15:18:38.574592Z 0 [Note] [MY-000000] [Galera] forgetting 727c277a (tcp://10.0.0.23:4567)
2021-04-23T15:18:38.574716Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2021-04-23

With CEV, we have a different process. Let us review it with images first.

Let us start with a cluster…

3 Nodes, where only one works as Primary.

Primary writes and, as expected, writesets are distributed on all nodes.

insert into test_voting values(null,REVERSE(UUID()), NOW()); <-- Few times

DC1-1(root@localhost) [test]>select * from test_voting;
+----+--------------------------------------+---------------------+
| id | what                                 | when                |
+----+--------------------------------------+---------------------+
|  3 | 05de43720080-938a-be11-305a-6d135601 | 2021-04-24 14:43:34 |
|  6 | 05de43720080-938a-be11-305a-7eb60711 | 2021-04-24 14:43:36 |
|  9 | 05de43720080-938a-be11-305a-6861c221 | 2021-04-24 14:43:37 |
| 12 | 05de43720080-938a-be11-305a-d43f0031 | 2021-04-24 14:43:38 |
| 15 | 05de43720080-938a-be11-305a-53891c31 | 2021-04-24 14:43:39 |
+----+--------------------------------------+---------------------+
5 rows in set (0.00 sec)

Some inexperienced DBA does a manual operation on a secondary using the very unsafe feature wsrep_on…

And then, by mistake or because he did not understand what he is doing…

insert into test_voting values(17,REVERSE(UUID()), NOW()); <-- with few different ids

At the end of the operation of the Secondary node, he will have:

DC1-2(root@localhost) [test]>select * from test_voting;
+----+--------------------------------------+---------------------+
| id | what                                 | when                |
+----+--------------------------------------+---------------------+
|  3 | 05de43720080-938a-be11-305a-6d135601 | 2021-04-24 14:43:34 |
|  6 | 05de43720080-938a-be11-305a-7eb60711 | 2021-04-24 14:43:36 |
|  9 | 05de43720080-938a-be11-305a-6861c221 | 2021-04-24 14:43:37 |
| 12 | 05de43720080-938a-be11-305a-d43f0031 | 2021-04-24 14:43:38 |
| 15 | 05de43720080-938a-be11-305a-53891c31 | 2021-04-24 14:43:39 |
| 16 | 05de43720080-a39a-be11-405a-82715600 | 2021-04-24 14:50:17 |
| 17 | 05de43720080-a39a-be11-405a-f9d62e22 | 2021-04-24 14:51:14 |
| 18 | 05de43720080-a39a-be11-405a-f5624662 | 2021-04-24 14:51:20 |
| 19 | 05de43720080-a39a-be11-405a-cd8cd640 | 2021-04-24 14:50:23 |
+----+--------------------------------------+---------------------+

This is not in line with the rest of the cluster that still has the previous data. Then our guy put the node back:

At this point, the Primary does another insert in that table and:

Houston, we have a problem! 

The secondary node already has the entry with that ID and cannot perform the insert:

2021-04-24T13:52:51.930184Z 12 [ERROR] [MY-010584] [Repl] Slave SQL: Could not execute Write_rows event on table test.test_voting; Duplicate entry '18' for key 'test_voting.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 0, Error_code: MY-001062
2021-04-24T13:52:51.930295Z 12 [Warning] [MY-000000] [WSREP] Event 3 Write_rows apply failed: 121, seqno 4928120

But instead of exit from the cluster, it will raise a verification through voting:

2021-04-24T13:52:51.932774Z 0 [Note] [MY-000000] [Galera] Member 0(node2) initiates vote on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120,878ded7898c83a72:  Duplicate entry '18' for key 'test_voting.PRIMARY', Error_code: 1062;
2021-04-24T13:52:51.932888Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
   878ded7898c83a72:   1/3
Waiting for more votes.
2021-04-24T13:52:51.936525Z 0 [Note] [MY-000000] [Galera] Member 1(node3) responds to vote on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120,0000000000000000: Success
2021-04-24T13:52:51.936626Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
   0000000000000000:   1/3
   878ded7898c83a72:   1/3
Waiting for more votes.
2021-04-24T13:52:52.003615Z 0 [Note] [MY-000000] [Galera] Member 2(node1) responds to vote on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120,0000000000000000: Success
2021-04-24T13:52:52.003722Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
   0000000000000000:   2/3
   878ded7898c83a72:   1/3
Winner: 0000000000000000

As you can see, each node informs the cluster about the success or failure of the operation, and the majority wins.

Once the majority had identified the operation was legit, as such, the node that asked for the voting will need to get out from the cluster:

2021-04-24T13:52:52.038510Z 12 [ERROR] [MY-000000] [Galera] Inconsistency detected: Inconsistent by consensus on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120
	 at galera/src/replicator_smm.cpp:process_apply_error():1433
2021-04-24T13:52:52.062666Z 12 [Note] [MY-000000] [Galera] Closing send monitor...
2021-04-24T13:52:52.062750Z 12 [Note] [MY-000000] [Galera] Closed send monitor.
2021-04-24T13:52:52.062796Z 12 [Note] [MY-000000] [Galera] gcomm: terminating thread
2021-04-24T13:52:52.062880Z 12 [Note] [MY-000000] [Galera] gcomm: joining thread
2021-04-24T13:52:52.063372Z 12 [Note] [MY-000000] [Galera] gcomm: closing backend
2021-04-24T13:52:52.085853Z 12 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,65a111c6-bb0f,23)
memb {
	65a111c6-bb0f,2
	}
joined {
	}
left {
	}
partitioned {
	aae38617-8dd5,2
	dc4eaa39-b39a,2
	}
)
2021-04-24T13:52:52.086241Z 12 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0
2021-04-24T13:52:52.086391Z 12 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view ((empty))
2021-04-24T13:52:52.150106Z 12 [Note] [MY-000000] [Galera] gcomm: closed
2021-04-24T13:52:52.150340Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

It is also nice to notice that now we have a decent level of information about what happened in the other nodes; the log below is from the Primary:

2021-04-24T13:52:51.932829Z 0 [Note] [MY-000000] [Galera] Member 0(node2) initiates vote on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120,878ded7898c83a72:  Duplicate entry '18' for key 'test_voting.PRIMARY', Error_code: 1062;
2021-04-24T13:52:51.978123Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
…<snip>
2021-04-24T13:52:51.981647Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
   0000000000000000:   2/3
   878ded7898c83a72:   1/3
Winner: 0000000000000000
2021-04-24T13:52:51.981887Z 11 [Note] [MY-000000] [Galera] Vote 0 (success) on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120 is consistent with group. Continue.
2021-04-24T13:52:52.064685Z 0 [Note] [MY-000000] [Galera] declaring aae38617-8dd5 at tcp://10.0.0.31:4567 stable
2021-04-24T13:52:52.064885Z 0 [Note] [MY-000000] [Galera] forgetting 65a111c6-bb0f (tcp://10.0.0.21:4567)
2021-04-24T13:52:52.066916Z 0 [Note] [MY-000000] [Galera] Node aae38617-8dd5 state primary
2021-04-24T13:52:52.071577Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,aae38617-8dd5,24)
memb {
	aae38617-8dd5,2
	dc4eaa39-b39a,2
	}
joined {
	}
left {
	}
partitioned {
	65a111c6-bb0f,2
	}
)
2021-04-24T13:52:52.071683Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2021-04-24T13:52:52.075293Z 0 [Note] [MY-000000] [Galera] forgetting 65a111c6-bb0f (tcp://10.0.0.21:4567)
2021-04-24T13:52:52.075419Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2

At this point, a DBA can start to investigate and manually fix the inconsistency and have the node rejoin the cluster. In the meantime, the rest of the cluster continues to operate:

+----+--------------------------------------+---------------------+
| id | what                                 | when                |
+----+--------------------------------------+---------------------+
|  3 | 05de43720080-938a-be11-305a-6d135601 | 2021-04-24 14:43:34 |
|  6 | 05de43720080-938a-be11-305a-7eb60711 | 2021-04-24 14:43:36 |
|  9 | 05de43720080-938a-be11-305a-6861c221 | 2021-04-24 14:43:37 |
| 12 | 05de43720080-938a-be11-305a-d43f0031 | 2021-04-24 14:43:38 |
| 15 | 05de43720080-938a-be11-305a-53891c31 | 2021-04-24 14:43:39 |
| 18 | 05de43720080-938a-be11-405a-d02c7bc5 | 2021-04-24 14:52:51 |
+----+--------------------------------------+---------------------+

Conclusion

Cluster Error Voting (CEV) is a nice feature to have. It helps us better understand what goes wrong and increases the stability of the cluster, and with the voting has a better way to manage the node expulsion.

Another aspect is visibility; never underestimate the fact information is available also on other nodes. Having it available on multiple nodes may help investigations if the log on the failing node gets lost (for any reason).

We still do not have active tuple certification, but it is a good step, especially given the history we have seen of data drift in PXC/Galera in these 12 years of utilization.

My LAST comment is that while I agree WSREP_ON can be a very powerful tool in the hands of experts, as indicated in my colleague’s blog How to Perform Compatible Schema Changes in Percona XtraDB Cluster (Advanced Alternative). That option remains DANGEROUS, and you should never use it UNLESS your name is Przemys?aw Malkowski, and you really know what you are doing.

Great MySQL to everybody!

References

https://www.percona.com/doc/percona-xtradb-cluster/8.0/release-notes/Percona-XtraDB-Cluster-8.0.21-12.1.html

Galera Clustering in MariaDB 10.5 and beyond – Seppo Jaakola – MariaDB Server Fest 2020

 

Register for Percona Live ONLINE
A Virtual Event about Open Source Databases

Apr
14
2021
--

Percona Distribution for MySQL: High Availability with Group Replication Solution

This blog provides high availability (HA) guidelines using group replication architecture and deployment recommendations in MySQL, based on our best practices.

Every architecture and deployment depends on the customer requirements and application demands for high availability and the estimated level of usage. For example, using high read or high write applications, or both, with a need for 99.999% availability.

Here, we give architecture and deployment recommendations along with a technical overview for a solution that provides a high level of high availability and assumes the usage of high read/write applications (20k or more queries per second).

Layout

MySQL High Availability with Group Replication

Components

This architecture is composed of two main layers:

  • Connection and distribution layer
  • RDBMS (Relational Database Management System) layer

Connection Layer

The connection layer is composed of:

  • Application to proxy redirection mechanism which can be anything from a Virtual IP managed by Keepalived local service to a DNS resolution service like Amazon Route 53. Its function is to redirect the traffic to the active Proxy node.
  • Proxy connection distribution is composed of two or more nodes. Its role is to redirect the traffic to the active nodes of the Group Replication cluster. In cases like ProxySQL where the proxy is a level 7 proxy and is able to perform Read/Write split, this layer is also in charge of redirecting writes to the Primary node and reads to the Replicas, and of HA to prevent a single point of failure

Data Layer

The data layer is composed of:

  • Primary node serving writes (or source) – this is the node that will accept writes and DDL modifications. Data will be processed following the ACID paradigm (atomicity, consistency, isolation, durability) and replicated to all other nodes.
  • Replica nodes are the elements serving read requests. Some replica nodes can be elected Primary in case of Primary node failure. A replica node should be able to leave and join back a healthy cluster without impacting the service.
  • Replication mechanism to distribute changes across nodes and in this solution is done with Group Replication. Group Replication is a tightly coupled solution which means that the database cluster is based on a Datacentric approach (single state of the data, distributed commit). In this case, the data is consistent in time across nodes and replication requires a high performant link. Given that, geographic distribution is strongly discouraged and Disaster Recovery (DR) is not implicitly supported by the main Group Replication mechanism. 

The node characteristics (CPU/RAM/Storage) are not relevant to the main solution design.  They instead must reflect the estimated workload the solution will have to cover, which is a case-by-case identification. 

What is important to keep in mind is that all nodes that are part of the cluster must have the same characteristics.  If they don’t, the cluster will be imbalanced and service will be affected.

As a generic indication, we recommend using solutions with at least 8 cores and 16GB RAM when production.  

High Availability

How do we measure availability and at what point does it become “high” availability?

Generally speaking, the measurement of availability is done by establishing a measurement time frame and dividing it by the time that it was available. This ratio will rarely be 1, which is equal to 100% availability. At Percona we don’t consider a solution to be highly available if it is not at least 99% or “two nines” available.  

Availability % Downtime per year Downtime per month Downtime per week Downtime per day
99% (“two nines”) 3.65 days 7.31 hours 1.68 hours 14.40 minutes
99.5% (“two nines five”) 1.83 days 3.65 hours 50.40 minutes 7.20 minutes
99.9% (“three nines”) 8.77 hours 43.83 minutes 10.08 minutes 1.44 minutes
99.95% (“three nines five”) 4.38 hours 21.92 minutes 5.04 minutes 43.20 seconds
99.99% (“four nines”) 52.60 minutes 4.38 minutes 1.01 minutes 8.64 seconds
99.995% (“four nines five”) 26.30 minutes 2.19 minutes 30.24 seconds 4.32 seconds
99.999% (“five nines”) 5.26 minutes 26.30 seconds 6.05 seconds 864.00 milliseconds

How is High Availability Achieved?

There are three key components to high availability:

  1. Infrastructure – This is the physical or virtual hardware that database systems rely on to run. Without enough infrastructure (VM’s, networking, etc) there cannot be high availability. The easiest example is: there is no way to make a single server highly available.
  2. Topology Management – This is the software management related specifically to the database and managing its ability to stay consistent in the event of a failure. Many clustering or synchronous replication solutions offer this capability out of the box. However, for asynchronous replication, this is handled by additional software. 
  3. Connection Management – This is the software management related specifically to the networking and connectivity aspect of the database. Clustering solutions typically bundle with a connection manager, however in asynchronous clusters deploying a connection manager is mandatory for high availability.

This Solution Provides:

The proposed solution, based on a tightly coupled database cluster, offers an HA level of 99.995% when coupled with the Group replication setting group_replication_consistency=AFTER.

group replication

Failovers

If properly planned and architected, a database failure or configuration change that requires a restart shouldn’t affect the stability of the database infrastructure. Failovers are an integral part of a stability strategy and aligning the business requirements for availability and uptime with failover methodologies is critical to achieving those goals. Below are the 3 main types of failovers that can occur in database environments.

  • Planned Failover: A planned failover is a failover that has been scheduled in advance or occurs at a regular interval. There can be many reasons for planned failovers including patching, large data operations, retiring existing infrastructure, or simply testing the failover strategy.
  • Unplanned Failover: An unplanned failover is what occurs when a database unexpectedly becomes unresponsive or experiences instability. This could also include emergency changes that do not fall under the planned failover cadence or scheduling parameters. Unplanned failovers are generally considered higher-risk operations due to the high stress and high potential for either data corruption or data fragmentation.
  • Regional or Disaster Recovery Failover: Unplanned failovers still work with the assumption that additional database infrastructure is immediately available and in a usable state. In a regional or DR failover, we would be making the assumption that there is a large-scale infrastructure outage that requires the business to move its operations away from its current availability zone.
  • This solution covers both planned and unplanned failovers.

Maintenance Windows

Major vs Minor Maintenance: Although it may not be obvious at first, not all maintenance activities are created equal and do not have the same dependencies. It is good to separate maintenance that demands downtime or failover from maintenance that can be done without impacting those important stability metrics. When defining these maintenance dependencies there can be a change in the actual maintenance process that allows for a different cadence.

Maintenance Without Service Interruption: With rolling restart and using proper version upgrade it is possible to cover both major and minor maintenance without service interruption.

Uptime

When referring to database stability, uptime is likely the largest indicator of stability and oftentimes is the most obvious symptom of an unstable database environment. Uptime is composed of 3 key components and, contrary to common perception, is based on what happens when the database software is not able to take incoming requests rather than maintaining the ability to take requests with errors.

Recovery Time Objective (RTO): This can be boiled down to a very simple question “How long can the business sustain a database outage?”. Once the business is aligned with a goal of a minimum viable recovery time objective, it is much more straightforward to plan and invest in the infrastructure required to meet that requirement. It is important to acknowledge that while everyone desires 100% uptime, there is a need for realistic expectations that align with the business needs and not a technical desire.

Recovery Point Objective (RPO): There is a big distinction between the Recovery Point and the Recovery Time for database infrastructure. The database can be available, but not to the exact state that it was when it became unavailable. That is where Recovery Point comes in. The question we ask here is “How much data can the business lose during a database outage?”. All businesses have their own requirements here and it is worthy to note that it is always the goal to never sustain any data loss. But this is framed in a worst-case scenario how much data could be lost and the business maintains the ability to continue.

Disaster Recovery: While RTO and RPO are great for unplanned outages or small-scale hiccups to the infrastructure, when we talk about Disaster Recovery this is a major large-scale outage not strictly for the database infrastructure. How capable is the business of restarting operations with the assumption that all resources are completely unavailable in the main availability zone? The assumption here is that there is no viable restoration point or time that aligns with the business requirements. While each DR scenario is unique based on available infrastructure, backup strategy, and technology stack, there are some common threads for any scenario. 

This solution helps improve uptime:

Using this solution will help you to significantly reduce both RPO and RTO. Given the tightly coupled cluster solution approach, the failure of a single node will not result in service interruption.
Increasing the number of nodes will also improve the cluster resilience by the formula:
F = (N -1) / 2

Where:

F – Number of admissible failures

N – number of nodes in the cluster

Examples:

In a cluster of 5 nodes, F = (5 – 1)/2 = 2. 

The cluster can support up to 2 failures. 

In a cluster of 4 nodes, F = (4 – 1)/2 = 1.  

The cluster can support up to 1 failure. 

This solution also allows for a more restrictive backup policy, dedicating a node to the backup cycle, which will help in keeping RPO low. As previously mentioned, DR is not covered by default by the solution which will require an additional replication setup and controller. 

Measurement and Monitoring

To ensure database infrastructure is performing as intended or at its best, it is necessary to measure specific metrics and alert when some of these metrics are not in line with expectations. Periodic review of these measurements is also encouraged to promote stability and understand potential risks associated with the database infrastructure. Below are the 3 aspects of Database performance measurement and monitoring

Measurement: To understand how a database infrastructure is performing there is a need to measure multiple aspects of the infrastructure. With measurement, it’s important to understand the impact of the sample sizes, sample timing, and sample types.

Metrics: Metrics refer to the actual parts of the database infrastructure being measured. When we discuss metrics, more isn’t always better as it could introduce unintentional noise or allow for troubleshooting to become overly burdensome.

Alerting: When one or many metrics of the database infrastructure is not within a normal or acceptable range, an alert should be generated so that the team responsible for the appropriate portion of the database infrastructure can investigate and remedy it

Monitoring for this solution is covered by:

Percona Monitoring and Management has a specific dashboard to monitor the Group Replication state and cluster status as a whole. (https://www.percona.com/doc/percona-monitoring-and-management/2.x/introduction.html) has a specific dashboard to monitor Group Replication state, and cluster status as a whole.   

How to Implement the Infrastructure

In this section, we are providing the step by step instructions on how to implement the above solution. 

The Elements

The following will be used:

  • 1 Virtual IP for ProxySQL failover – 192.168.4.194
  • 2 ProxySQL nodes
    • Proxy1 192.168.4.191
    • Proxy2 192.168.4.192
  • 4 MySQL nodes in Single Primary mode
    • Gr1 192.168.4.81 – Initial Primary
    • Gr2 192.168.4.82 – Replica / failover
    • Gr3 192.168.4.83 – Replica / failover
    • Gr4 192.168.4.84 – Replica / Backup
  • Ports. All ports must be open if a firewall is in place or any other restriction like AppArmor or SELinux. 
    • Proxysql
      • 6033
      • 6032
      • 3306
    • MySQL – GR
      • 3306
      • 33060
      • 33061

Software Installation

First, you need to install the Percona Distribution for MySQL, the Percona Server for MySQL-based variant, on each node. Follow the instructions at https://www.percona.com/doc/percona-distribution-mysql/8.0/installing.html to install Percona Server for MySQL v8.0.

Configure the Nodes

Before anything, make sure that all the nodes use the same time-zone and time:

[root@gr1 ps8]# date
    Tue Aug 18 08:22:12 EDT 2020

Check also for ntpd service to be present and enabled. Be sure that each node resolves the other nodes by name:

root@gr1 ps8]# for i in 1 2 3 4 ; do ping -c 1 gr$i > /dev/null;echo $?; done

If not able to resolve, add the entries in the /etc/hosts file.

Once instances are up and running check the Percona Server for MySQL version on each node:

(root@node1) [(none)]>\s
--------------
/opt/mysql_templates/PS-8P/bin/mysql  Ver 8.0.20-11 for Linux on x86_64 (Percona Server (GPL), Release 11, Revision 159f0eb)

Step 1

Create a proper user for administration:

CREATE user dba@localhost identified by 'dbapw';
CREATE user dba@'192.168.%' identified by 'dbapw'; 

GRANT ALL on *.* to dba@localhost with grant option;
GRANT ALL on *.* to dba@'192.168.%' with grant option;

Exit from the client as user root and login as user dba.

Be sure to have a good and unique SERVER_ID value:

(dba@node1) [(none)]>show global variables like 'server_id';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id     |     1 | <--- Not good given the same for all nodes
+---------------+-------+
1 row in set (0.01 sec)

It’s now time to add group replication settings to the instances.

Step 2

Stop all running nodes, then in the my.cnf add:

#####################
    #Replication + binlog settings
    #####################
    auto-increment-increment                                    =1
    auto-increment-offset                                       =1

    log-bin                                             =<path_to_logs>/binlog
    log-bin-index                                       =binlog.index
    binlog-checksum                                             =NONE
    binlog-format                                               =ROW
    binlog-row-image                                            =FULL
    log-slave-updates 						=1
    binlog-transaction-dependency-tracking                      =WRITESET


    enforce-gtid-consistency                                    =TRUE
    gtid-mode                                                   =ON

    master-info-file                                            =master.info
    master-info-repository                                      =TABLE
    relay_log_info_repository                                   =TABLE
    relay-log                                            =<path_to_logs>/relay

    sync-binlog                                                 =1

    ### SLAVE SECTION
    skip-slave-start
    slave-parallel-type                                        = LOGICAL_CLOCK
    slave-parallel-workers                                      = 4
    slave-preserve-commit-order                                 = 1

    ######################################
    #Group Replication
    ######################################
    plugin_load_add                                    ='group_replication.so'
    plugin-load-add                                    ='mysql_clone.so'
    group_replication_group_name       ="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa" <-- Not good use something that will help you 
                                                                            to identify the GR transactions and from where they 
                                                                             come from IE "dc1euz1-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
    group_replication_start_on_boot                     =off
    group_replication_local_address                     = "192.168.4.81/2/3/4:33061"  <---- CHANGE THIS TO MATCH EACH NODE LOCAL IP
    group_replication_group_seeds                       = "192.168.4.81:33061,192.168.4.82:33061,192.168.4.83:33061,192.168.4.84:33061"
    group_replication_bootstrap_group                   = off
    transaction-write-set-extraction                    = XXHASH64

Restart all nodes and connect to them.

Step 3

Create a user for replication (on all nodes):

SET SQL_LOG_BIN=0;
    CREATE USER replica@'192.168.4.%' IDENTIFIED BY 'replicapw';   #<--- Please note the filter by IP is more restrictive 
    GRANT REPLICATION SLAVE ON *.* TO replica@'192.168.4.%';
    FLUSH PRIVILEGES;
    SET SQL_LOG_BIN=1;

Link the nodes with replication channel (on all nodes):

CHANGE MASTER TO MASTER_USER='replica', MASTER_PASSWORD='replicapw' FOR CHANNEL 'group_replication_recovery';

Check the current status:

(dba@node1) [(none)]>\u performance_schema
    (dba@node1) [performance_schema]>show tables like '%repl%';
    +-------------------------------------------+
    | Tables_in_performance_schema (%repl%)     |
    +-------------------------------------------+
    | replication_applier_configuration         |
    | replication_applier_filters               |
    | replication_applier_global_filters        |
    | replication_applier_status                |
    | replication_applier_status_by_coordinator |
    | replication_applier_status_by_worker      |
    | replication_connection_configuration      |
    | replication_connection_status             |
    | replication_group_member_stats            |
    | replication_group_members                 | <------------------------
    +-------------------------------------------+

   (dba@node1) [performance_schema]>select * from replication_group_members\G
CHANNEL_NAME: group_replication_applier
     MEMBER_ID: 
   MEMBER_HOST: 
   MEMBER_PORT: 
  MEMBER_STATE: 
   MEMBER_ROLE: OFFLINE
MEMBER_VERSION: 
1 row in set (0.00 sec)

At this stage, you should be able to start the first (Primary) cluster node.

Only on GR1:

(dba@node1)[none]> SET GLOBAL group_replication_bootstrap_group=ON;
(dba@node1)[none]> START GROUP_REPLICATION;
(dba@node1)[none]> SET GLOBAL group_replication_bootstrap_group=OFF;

And then check it:

(dba@node1) [none]>select * from performance_schema.replication_group_members\G
     CHANNEL_NAME: group_replication_applier
     MEMBER_ID: 90a353b8-e6dc-11ea-98fa-08002734ed50
   MEMBER_HOST: gr1
   MEMBER_PORT: 3306
  MEMBER_STATE: ONLINE
   MEMBER_ROLE: PRIMARY
MEMBER_VERSION: 8.0.20

Once the Primary is running, connect on the second node GR2 and start Group replication:

(dba@node2) [none]>START GROUP_REPLICATION;
Query OK, 0 rows affected (4.60 sec)

Check if it registered correctly:

(dba@node2) [performance_schema]>select * from replication_group_members\G
*************************** 1. row ***************************
  CHANNEL_NAME: group_replication_applier
     MEMBER_ID: 58ffd118-e6dc-11ea-8af8-08002734ed50
   MEMBER_HOST: gr2
   MEMBER_PORT: 3306
  MEMBER_STATE: ONLINE
   MEMBER_ROLE: SECONDARY
MEMBER_VERSION: 8.0.20
*************************** 2. row ***************************
  CHANNEL_NAME: group_replication_applier
     MEMBER_ID: 90a353b8-e6dc-11ea-98fa-08002734ed50
   MEMBER_HOST: gr1
   MEMBER_PORT: 3306
  MEMBER_STATE: ONLINE
   MEMBER_ROLE: PRIMARY
MEMBER_VERSION: 8.0.20

Test if replication works:

On GR1

(dba@node1) [performance_schema]>create schema test;
Query OK, 1 row affected (0.76 sec)

(dba@node1) [performance_schema]>\u test
Database changed

(dba@node1) [test]>create table test1 (`id` int auto_increment primary key);
Query OK, 0 rows affected (0.32 sec)

(dba@node1) [test]>insert into test1 values(null);
Query OK, 1 row affected (0.34 sec)

On GR2

(dba@node2) [performance_schema]>use \test
 Database changed
 (dba@node2) [test]>select * from test1;
 +----+
 | id |
 +----+
 |  1 |
 +----+
 1 row in set (0.00 sec)

Start group replication of the other two nodes GR3 and GR4:

(dba@node3) [performance_schema]>START GROUP_REPLICATION;
(dba@node4) [performance_schema]>START GROUP_REPLICATION;

Proxy Setup

Step 1

In our solution we will use two ProxySQL nodes:

  • Proxy1 192.168.4.191
  • Proxy2 192.168.4.192

First, you need to install ProxySQL on the nodes you have selected, in our case the two above.

To install the software follow the instructions in How to Install ProxySQL From the Percona RepositoryOnce you have installed the software, we first need to grant access to the ProxySQL monitor user to our Percona Server for MySQL nodes.

Create monitor user in MySQL group replication nodes:

Create monitor user in MySQL group replication nodes:
create user monitor@'192.168.4.%' identified by 'monitor';
grant usage on *.* to 'monitor'@'192.168.4.%';
grant select on sys.* to 'monitor'@'192.168.4.%';

Then define some basic variables:

update global_variables set Variable_Value='admin:admin;cluster1:clusterpass'  where Variable_name='admin-admin_credentials';
update global_variables set variable_value='cluster1' where variable_name='admin-cluster_username';
update global_variables set variable_value='clusterpass' where variable_name='admin-cluster_password';
update global_variables set Variable_Value=0  where Variable_name='mysql-hostgroup_manager_verbose';
update global_variables set Variable_Value='true'  where Variable_name='mysql-query_digests_normalize_digest_text';
update global_variables set Variable_Value='8.0.20'  where Variable_name='mysql-server_version';
update global_variables set Variable_Value='utf8'  where Variable_name='mysql-default_charset';
update global_variables set Variable_Value=300  where Variable_name='mysql-tcp_keepalive_time';
update global_variables set Variable_Value='true'  where Variable_name='mysql-use_tcp_keepalive';
update global_variables set Variable_Value='true'  where Variable_name='mysql-verbose_query_error';
update global_variables set Variable_Value='true'  where Variable_name='mysql-show_processlist_extended';
update global_variables set Variable_Value=50000  where Variable_name='mysql-max_stmts_cache';
update global_variables set Variable_Value='false'  where Variable_name='admin-web_enabled';
update global_variables set Variable_Value='0'  where Variable_name='mysql-set_query_lock_on_hostgroup';

load admin variables to run;save admin variables to disk;
load mysql variables to run;save mysql variables to disk;

The user name and password need to reflect your standards. The ones used above are just an example. Then set up the nodes as a cluster:

INSERT INTO proxysql_servers (hostname,port,weight,comment) VALUES('192.168.4.191',6032,100,'PRIMARY');
INSERT INTO proxysql_servers (hostname,port,weight,comment) VALUES('192.168.4.192',6032,100,'SECONDARY');
load proxysql servers to run;save proxysql servers to disk;

Step 2

Define user(s), servers, and query rules to perform read/write split. Create one or more valid user(s), for instance, if you have a user named app_gr with the password test, that has access to your group replication cluster:

insert into mysql_users (username,password,active,default_hostgroup,default_schema,transaction_persistent,comment) values ('app_gr','test',1,400,'mysql',1,'application test user GR');
LOAD MYSQL USERS TO RUNTIME;SAVE MYSQL USERS TO DISK;

Define servers:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.81',400,3306,10000,2000,'GR1');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.81',401,3306,100,2000,'GR1');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.82',401,3306,10000,2000,'GR2');    
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.83',401,3306,10000,2000,'GR2');        
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.84',401,3306,1,2000,'GR2');        
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;

Define query rules to get read-write split:

INSERT INTO mysql_query_rules (rule_id,proxy_port,username,destination_hostgroup,active,retries,match_digest,apply) values(4040,6033,'app_gr',400,1,3,'^SELECT.*FOR UPDATE',1);
INSERT INTO mysql_query_rules (rule_id,proxy_port,username,destination_hostgroup,active,retries,match_digest,multiplex,apply) values(4042,6033,'app_gr',401,1,3,'^SELECT.*$',1,1);
LOAD MYSQL QUERY RULES TO RUN;SAVE MYSQL QUERY RULES TO DISK;

Step 3

Once we have all the configuration ready, we need to have a special view in the SYS schema in our Percona server nodes. The view working for the server version 8 and above can be found here (https://github.com/Percona-Lab/group_replication_tools/blob/master/GR_sys_view_forProxysql_v1.sql

Run that sql on the PRIMARY node of the Group Replication cluster.

Step 4

Now we are ready to activate the native support for Group Replication in ProxySQL. We will use the following group definition:

Writer HG-> 400
Reader HG-> 401
BackupW HG-> 402
Offline HG-> 9401 

INSERT INTO mysql_group_replication_hostgroups (writer_hostgroup,backup_writer_hostgroup,reader_hostgroup, offline_hostgroup,active,max_writers,writer_is_also_reader,max_transactions_behind) 
values (400,402,401,9401,1,1,1,100);
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;

Few comments here about the parameters (for full reference see here https://proxysql.com/documentation/main-runtime#mysql_group_replication_hostgroups ). We recommend setting the number of writers always to 1, and witer_is_also_reader to 1 as well to obtain the most reliable results.

max_writers: 1
writer_is_also_reader: 1

The max_transactions_behind is a subjective parameter that you should calculate on the basis of your needs. If for instance you cannot have a stale read, it will be safe to set this value to a low number (ie 50) and to set in all Group replication nodes:

set global group_replication_consistency=AFTER;

If instead, you have no issue or strict requirements about some stale read, you can relax the parameter and ignore the group_replication_consistency setting. Our recommended setting is group_replication_consistency=AFTER and max_transactions_behind: 100.

Proxy HA

The final step is to enable High Availability for the ProxySQL layer. In this approach, we will use the well-known keepalived service. First, install the keepalived software using yum or apt-get on each ProxySQL node:

Sudo yum install -y keepalived
Or
Sudo apt-get install -y keepalived

Then modify the /etc/keepalived/keepalived.conf file accordingly to your setup. In our case:

  • Proxy1 192.168.4.0/24 dev enp0s9 proto kernel scope link src 192.168.4.191
  • Proxy2 192.168.4.0/24 dev enp0s9 proto kernel scope link src 192.168.4.192
  • VIP       192.168.4.194

We want to have the primary to be Proxy1 and the failover node to be Proxy2. Given that the config will look like:

cat /etc/keepalived/keepalived.conf 
global_defs {
  # Keepalived process identifier
  router_id  proxy_HA
}
# Script used to check if Proxy is running
vrrp_script check_proxy {
  script "killall -0 proxysql"
  interval 2
  weight 2
}
# Virtual interface
# The priority specifies the order in which the assigned interface to take over in a failover
vrrp_instance VI_01 {
  state MASTER 
  interface enp0s9
  virtual_router_id 51
  priority 100  <----- This needs to be different for each ProxySQL node, like 100/99 
 
  # The virtual ip address shared between the two load balancers
  virtual_ipaddress {
   192.168.4.194  dev enp0s9
  }
  track_script {
    check_proxy
  }
}

Once done, start the keepalived service, and from now on the VIP will be associated with the Proxy1 unless service is down.

In the system log:

proxysql1 Keepalived_vrrp[17422]: VRRP sockpool: [ifindex(4), proto(112), unicast(0), fd(10,11)]
proxysql1 Keepalived_vrrp[17422]: VRRP_Script(check_proxy) succeeded
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Transition to MASTER STATE
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Received advert with higher priority 101, ours 100
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Entering BACKUP STATE
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Changing effective priority from 100 to 102
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) forcing a new MASTER election
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Transition to MASTER STATE
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Entering MASTER STATE
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) setting protocol VIPs.
proxysql1 Keepalived_vrrp[17422]: Sending gratuitous ARP on enp0s9 for 192.168.4.194
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Sending/queueing gratuitous RPs on enp0s9 for 192.168.4.194
proxysql1 Keepalived_vrrp[17422]: Sending gratuitous ARP on enp0s9 for 192.168.4.194
..
proxysql1 Keepalived_vrrp[17422]: Sending gratuitous ARP on enp0s9 for 192.168.4.194
proxysql1 avahi-daemon[989]: Registering new address record for 192.168.4.194 on enp0s9.IPv4.

Disaster Recovery Implementation

The implementation of a DR (Disaster Recovery) site will follow the same direction provided for the main site. There are only some generic rules that should be followed:

  • A DR site should be located in a different geographic location than the main site (several hundred kilometers/miles away).
  • The connection link between the main site and the DR site can only be established using asynchronous replication (standard MySQL replication setup ).

Monitoring

There are few ways to monitor a Group Replication cluster. The easiest way is to have Percona Monitoring and Management (Version 2.10 or later) deployed to do it for you. For an easy installation of Percona Monitoring and Management check out this quickstart.

Percona Monitoring and Management

The only important thing to remember is that when registering the Percona Server for MySQL node or the MySQL node, you should specify the replication_set flag.

Ie:  pmm-admin add mysql --username=pmm --password=pmm --query-source=perfschema --replication-set=gr_test_lab  group_rep4 127.0.0.1:3306

Then you can use the Group replication Dashboard and monitor your cluster with a lot of details.

The sections are:

  • Overview(3 panels)

MySQL Group Replication

  • Replication Delay Details(3 panels)

  • Transactions(8 panels)

  • Conflicts

From Command Line

From the command line you need to manually query the tables in Performance schema:

+----------------------------------------------+
| replication_applier_configuration            |
| replication_applier_filters                  |
| replication_applier_global_filters           |
| replication_applier_status                   |
| replication_applier_status_by_coordinator    |
| replication_applier_status_by_worker         |
| replication_connection_configuration         |
| replication_connection_status                |
| replication_group_member_stats               |
| replication_group_members                    |
+----------------------------------------------+

For instance, to get the lag in the number of transactions on a node:

select @last_exec:=SUBSTRING_INDEX(SUBSTRING_INDEX( @@global.GTID_EXECUTED,':',-1),'-',-1) last_executed;select  @last_rec:=SUBSTRING_INDEX(SUBSTRING_INDEX(Received_transaction_set,':',-1),'-',-1) last_received FROM performance_schema.replication_connection_status WHERE Channel_name = 'group_replication_applier'; select (@last_rec - @last_exec) as real_lag;
+---------------+
| last_executed |
+---------------+
| 125624        |
+---------------+
1 row in set, 1 warning (0.03 sec)

+---------------+
| last_received |
+---------------+
| 125624        |
+---------------+
1 row in set, 1 warning (0.00 sec)

+----------+
| real_lag |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

Or use a more composite query:

SELECT
  conn_status.channel_name as channel_name,
  conn_status.service_state as IO_thread,
  applier_status.service_state as SQL_thread,
  conn_status.LAST_QUEUED_TRANSACTION as last_queued_transaction,
  applier_status.LAST_APPLIED_TRANSACTION as last_applied_transaction,
  LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP - 
                            LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP 'rep delay (sec)',
  LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP - 
                           LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP 'transport time', 
  LAST_QUEUED_TRANSACTION_END_QUEUE_TIMESTAMP - 
                           LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP 'time RL',
  LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP - 
                           LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP 'apply time',  
  if(GTID_SUBTRACT(LAST_QUEUED_TRANSACTION, LAST_APPLIED_TRANSACTION) = "","0" ,        abs(time_to_sec(if(time_to_sec(APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP)=0,0,timediff(APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,now()))))) `lag_in_sec` 
FROM
  performance_schema.replication_connection_status AS conn_status
JOIN performance_schema.replication_applier_status_by_worker AS applier_status
  ON applier_status.channel_name = conn_status.channel_name
ORDER BY lag_in_sec, lag_in_sec desc\G

Which will provide information about each applier:

*************************** 1. row ***************************
channel_name: group_replication_applier
IO_thread: ON
SQL_thread: ON
last_queued_transaction: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:125624
last_applied_transaction: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:125621
rep delay (sec): 3.153038
transport time: 0.061327
time RL: 0.001005
apply time: 0.388680
lag_in_sec: 0

As you can see, Percona Monitoring and Management will give you a better view without compromising the details. 

Conclusions

Using these steps and recommendations, you can set up database infrastructure with high availability based on group replication and use Percona Monitoring and Managemen to monitor the infrastructure’s performance and health. 

Keep in mind that we are constantly working on making our recommendations better. As such, what is illustrated here is subject to changes and revision especially on the basis of the increasing adoption of Group Replication. This is because the more GR is used the more edge cases or deviation we will identify. Those are a significant help for us to refine our best practices.

Nov
30
2020
--

Support for Percona XtraDB Cluster in ProxySQL (Part One)

Support for Percona XtraDB Cluster in ProxySQL

Support for Percona XtraDB Cluster in ProxySQL

How native ProxySQL stands in failover support (both v2.0.15 and v2.1.0)

In recent times I have been designing several solutions focused on High Availability and Disaster Recovery. Some of them using Percona Server for MySQL with group replication, some using Percona XtraDB Cluster (PXC). What many of them had in common was the use of ProxySQL for the connection layer. This is because I consider the use of a layer 7 Proxy preferable, given the possible advantages provided in ReadWrite split and SQL filtering. 

The other positive aspect provided by ProxySQL, at least for Group Replication, is the native support which allows us to have a very quick resolution of possible node failures.

ProxySQL has Galera support as well, but in the past, that had shown to be pretty unstable, and the old method to use the scheduler was still the best way to go.

After Percona Live Online 2020 I decided to try it again and to see if at least the basics were now working fine. 

What I Have Tested

I was not looking for complicated tests that would have included different levels of transaction isolation. I was instead interested in the more simple and basic ones. My scenario was:

1 ProxySQL node v2.0.15  (192.168.4.191)
1 ProxySQL node v2.1.0  (192.168.4.108)
3 PXC 8.20 nodes (192.168.4.22/23/233) with internal network (10.0.0.22/23/33) 

ProxySQL was freshly installed. 

All the commands used to modify the configuration are here. Tests were done first using ProxySQL v2.015 then v2.1.0. Only if results diverge I will report the version and results. 

PXC- Failover Scenario

As mentioned above I am going to focus on the fail-over needs, period. I will have two different scenarios:

  • Maintenance
  • Node crash 

From the ProxySQL point of view I will have three scenarios always with a single Primary:

  • Writer is NOT a reader (option 0 and 2)
  • Writer is also a reader

The configuration of the native support will be:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.22',100,3306,10000,2000,'Preferred writer');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.23',100,3306,1000,2000,'Second preferred ');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.233',100,3306,100,2000,'Las chance');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.22',101,3306,100,2000,'last reader');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.23',101,3306,10000,2000,'reader1');    
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.233',101,3306,10000,2000,'reader2');

Galera host groups:

  • Writer: 100
  • Reader: 101
  • Backup_writer: 102
  • Offline_hostgroup: 9101

Before going ahead let us analyze the Mysql Servers settings. As you can notice I am using the weight attribute to indicate ProxySQL which is my preferred write. But I also use weight for the READ Host Group to indicate which servers should be used and how.

Given that we have that:

  • Write
    • 192.168.4.22  is the preferred Primary
    • 192.168.4.23  is the first failover 
    • 192.168.4.233 is the last chance 
  • Read
    • 192.168.4.233/23 have the same weight and load should be balanced between the two of them
    • The 192.168.4.22 given is the preferred writer should NOT receive the same load in reads and have a lower weight value.  

The Tests

First Test

The first test is to see how the cluster will behave in the case of 1 Writer and 2 readers, with the option writer_is_also_reader = 0.
To achieve this the settings for proxysql will be:

insert into mysql_galera_hostgroups (writer_hostgroup,backup_writer_hostgroup,reader_hostgroup, offline_hostgroup,active,max_writers,writer_is_also_reader,max_transactions_behind) values (100,102,101,9101,1,1,0,10);

As soon as I load this to runtime, ProxySQL should move the nodes to the relevant Host Group. But this is not happening, instead, it keeps the readers in the writer HG and SHUN them.

+---------+--------------+---------------+--------------+
| weight  | hostgroup_id | srv_host      | status       |
+---------+--------------+---------------+--------------+
| 100     | 100          | 192.168.4.233 | SHUNNED      |
| 1000    | 100          | 192.168.4.23  | SHUNNED      |
| 10000   | 100          | 192.168.4.22  | ONLINE       |
| 100     | 102          | 192.168.4.233 | ONLINE       |
| 1000    | 102          | 192.168.4.23  | ONLINE       |
+---------+--------------+---------------+--------------+

This is, of course, wrong. But why does it happen?

The reason is simple. ProxySQL is expecting to see all nodes in the reader group with READ_ONLY flag set to 1. 

In ProxySQL documentation we can read:

writer_is_also_reader=0: nodes with read_only=0 will be placed either in the writer_hostgroup and in the backup_writer_hostgroup after a topology change, these will be excluded from the reader_hostgroup.

This is conceptually wrong. 

A PXC cluster is a tightly coupled replication cluster, with virtually synchronous replication. One of its benefits is to have the node “virtually” aligned with respect to the data state. 

In this kind of model, the cluster is data-centric, and each node shares the same data view.

What it also means is that if correctly set the nodes will be fully consistent in data READ.

The other characteristic of the cluster is that ANY node can become a writer anytime. 
While best practices indicate that it is better to use one Writer a time as Primary to prevent certification conflicts, this does not mean that the nodes not currently elected as Primary, should be prevented from becoming a writer.

Which is exactly what READ_ONLY flag does if activated.

Not only, the need to have READ_ONLY set means that we must change it BEFORE we have the node able to become a writer in case of fail-over. 

This, in short, means the need to have either a topology manager or a script that will do that with all the relative checks and logic to be safe. Which in time of fail-over means it will add time and complexity when it’s not really needed and that goes against the concept of the tightly-coupled cluster itself.

Given the above, we can say that this ProxySQL method related to writer_is_also_reader =0, as it is implemented today for Galera, is, at the best, useless. 

Why is it working for Group Replication? That is easy; because Group Replication internally uses a mechanism to lock/unlock the nodes when non-primary, when using the cluster in single Primary mode. That internal mechanism was implemented as a security guard to prevent random writes on multiple nodes, and also manage the READ_ONLY flag. 

Second Test

Let us move on and test with writer_is_also_reader = 2. Again from the documentation:

writer_is_also_reader=2 : Only the nodes with read_only=0 which are placed in the backup_writer_hostgroup are also placed in the reader_hostgroup after a topology change i.e. the nodes with read_only=0 exceeding the defined max_writers.

Given the settings as indicated above, my layout before using Galera support is:

+---------+--------------+---------------+--------------+
| weight  | hostgroup_id | srv_host      | status       |
+---------+--------------+---------------+--------------+
| 100     | 100          | 192.168.4.233 | ONLINE       |
| 1000    | 100          | 192.168.4.23  | ONLINE       |
| 10000   | 100          | 192.168.4.22  | ONLINE       |
| 10000   | 101          | 192.168.4.233 | ONLINE       |
| 10000   | 101          | 192.168.4.23  | ONLINE       |
| 100     | 101          | 192.168.4.22  | ONLINE       |
+---------+--------------+---------------+--------------+

After enabling Galera support:

+--------+-----------+---------------+----------+---------+
| weight | hostgroup | srv_host      | srv_port | status  |
+--------+-----------+---------------+----------+---------+
| 100    | 100       | 192.168.4.233 | 3306     | SHUNNED |
| 1000   | 100       | 192.168.4.23  | 3306     | SHUNNED |
| 10000  | 100       | 192.168.4.22  | 3306     | ONLINE  |
| 100    | 101       | 192.168.4.233 | 3306     | ONLINE  |
| 1000   | 101       | 192.168.4.23  | 3306     | ONLINE  |
| 100    | 102       | 192.168.4.233 | 3306     | ONLINE  |
| 1000   | 102       | 192.168.4.23  | 3306     | ONLINE  |
+--------+-----------+---------------+----------+---------+

So node ending with 22 (the Primary elected) is not in the reader pool. Which can be ok, I assume. 

But what is not OK at all is that the READERS have now a completely different weight. Nodes x.23 and x.233 are NOT balancing the load any longer, because the weight is not the same or the one I define. It is instead copied over from the WRITER settings. 

Well of course this is wrong and not what I want. Anyhow, let’s test the READ failover.

I will use sysbench read-only:

sysbench ./src/lua/windmills/oltp_read.lua  --mysql-host=192.168.4.191 --mysql-port=6033 --mysql-user=app_test --mysql-password=test --mysql-db=windmills_s --db-driver=mysql --tables=10 --table_size=10000  --rand-type=zipfian --rand-zipfian-exp=0.5 --skip_trx=true  --report-interval=1  --mysql_storage_engine=innodb --auto_inc=off --histogram --table_name=windmills  --stats_format=csv --db-ps-mode=disable --point-selects=50 --range-selects=true --threads=50 --time=2000   run

mysql> select * from  runtime_mysql_galera_hostgroups \G
*************************** 1. row ***************************
       writer_hostgroup: 100
backup_writer_hostgroup: 102
       reader_hostgroup: 101
      offline_hostgroup: 9101
                 active: 1
            max_writers: 1
  writer_is_also_reader: 2
max_transactions_behind: 10
                comment: NULL

Test Running

+--------+-----------+---------------+----------+---------+----------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed |
+--------+-----------+---------------+----------+---------+----------+
| 100    | 100       | 192.168.4.233 | 3306     | SHUNNED | 0        |
| 1000   | 100       | 192.168.4.23  | 3306     | SHUNNED | 0        |
| 10000  | 100       | 192.168.4.22  | 3306     | ONLINE  | 0        |
| 100    | 101       | 192.168.4.233 | 3306     | ONLINE  | 1        |
| 1000   | 101       | 192.168.4.23  | 3306     | ONLINE  | 51       |
| 100    | 102       | 192.168.4.233 | 3306     | ONLINE  | 0        |
| 1000   | 102       | 192.168.4.23  | 3306     | ONLINE  | 0        |
+--------+-----------+---------------+----------+---------+----------+

As indicated above the reads are not balanced.  Removing node x.23 using wsrep_reject_queries=all:

+---------+--------------+---------------+--------------+----------+
| weight  | hostgroup_id | srv_host	 | status       | ConnUsed |
+---------+--------------+---------------+--------------+----------+
| 100     | 100          | 192.168.4.233 | SHUNNED      | 0        |
| 10000   | 100          | 192.168.4.22  | ONLINE       | 0        |
| 100     | 101          | 192.168.4.233 | ONLINE       | 48       |
| 100     | 102          | 192.168.4.233 | ONLINE       | 0        |
+---------+--------------+---------------+--------------+----------+

The remaining node x.233 is taking all the writes, good. If I set wsrep_reject_queries=all also on x.233:

+---------+--------------+---------------+--------------+
| weight  | hostgroup_id | srv_host      | status       |   
+---------+--------------+---------------+--------------+
| 10000   | 100          | 192.168.4.22  | ONLINE	|
| 100     | 9101         | 192.168.4.233 | SHUNNED	|
| 10000   | 9101         | 192.168.4.23  | ONLINE	|
+---------+--------------+---------------+--------------+

And application failed:

FATAL: mysql_drv_query() returned error 9001 (Max connect timeout reached while reaching hostgroup 101 after 10000ms) for query ‘SELECT id, millid, date,active,kwatts_s FROM windmills2 WHERE id=9364’

Now, this may be like this by design, but I have serious difficulties understanding what the reasoning is here, given we allow a platform to fail serving while we still have a healthy server. 

Last but not least I am not allowed to decide WHICH the backup_writers are, ProxySQL will choose them from my writer list of servers. SO why not also include the one I have declared as Primary, at least in case of needs?  ¯\_(?)_/¯

Third Test

Ok last try with writer_is_also_reader = 1.

mysql> select * from  runtime_mysql_galera_hostgroups \G
*************************** 1. row ***************************
       writer_hostgroup: 100
backup_writer_hostgroup: 102
       reader_hostgroup: 101
      offline_hostgroup: 9101
                 active: 1
            max_writers: 1
  writer_is_also_reader: 1
max_transactions_behind: 10
                comment: NULL
1 row in set (0.01 sec)

And now I have:

+---------+--------------+---------------+--------------+----------+
| weight  | hostgroup_id | srv_host	 | status       | ConnUsed |
+---------+--------------+---------------+--------------+----------+
| 100     | 100          | 192.168.4.233 | SHUNNED      | 0        |
| 1000    | 100          | 192.168.4.23  | SHUNNED      | 0        |
| 10000   | 100          | 192.168.4.22  | ONLINE       | 0        |
| 100     | 101          | 192.168.4.233 | ONLINE       | 0        |
| 1000    | 101          | 192.168.4.23  | ONLINE       | 0        |
| 10000   | 101          | 192.168.4.22  | ONLINE       | 35       | <-- :(
| 100     | 102          | 192.168.4.233 | ONLINE	| 0        |
| 1000    | 102          | 192.168.4.23  | ONLINE	| 0        |
+---------+--------------+---------------+--------------+----------+

Then remove on Reader at the time as before:

+---------+--------------+---------------+--------------+----------+
| weight  | hostgroup_id | srv_host	 | status       | ConnUsed |
+---------+--------------+---------------+--------------+----------+
| 100     | 100          | 192.168.4.233 | SHUNNED	| 0        |
| 10000   | 100          | 192.168.4.22  | ONLINE       | 0        |
| 100     | 101          | 192.168.4.233 | ONLINE	| 0        |
| 10000   | 101          | 192.168.4.22  | ONLINE	| 52       | <-- :(
| 100     | 102          | 192.168.4.233 | ONLINE       | 0        |
| 10000   | 9101         | 192.168.4.23  | ONLINE	| 0        |
+---------+--------------+---------------+--------------+----------+

+---------+--------------+---------------+--------------+----------+
| weight  | hostgroup_id | srv_host	 | status	| ConnUsed |
+---------+--------------+---------------+--------------+----------+
| 10000   | 100          | 192.168.4.22  | ONLINE       | 0        |
| 100     | 101          | 192.168.4.22  | ONLINE       | 39       | <-- :(
| 100     | 9101         | 192.168.4.233 | SHUNNED	| 0        |
| 10000   | 9101         | 192.168.4.23  | ONLINE	| 0        |
+---------+--------------+---------------+--------------+----------+

Now as you may have already realized, the point here is that, YES I have my node x.22 (Primary) able to get the READS as well, but the node was taking the whole load from the beginning. This is because of the shift ProxySQL had done in regards to the weight. 

This happens because while internally ProxySQL initially populates the internal table mysql_servers_incoming with the data from the mysql_servers, after several steps that information is overwritten using the data coming from the writer also for the readers. 

Messing up the desired results.

Fourth Test

Failover due to maintenance. In this case, I will set the writer pxc_maint_mode = MAINTENANCE to failover to another writer.
The sysbench command used:

sysbench ./src/lua/windmills/oltp_read_write.lua  --mysql-host=192.168.4.191 --mysql-port=6033 --mysql-user=app_test --mysql-password=test --mysql-db=windmills_s --db-driver=mysql --tables=10 --table_size=10000  --rand-type=zipfian --rand-zipfian-exp=0.5 --skip_trx=false  --report-interval=1  --mysql_storage_engine=innodb --auto_inc=off --histogram --table_name=windmills  --stats_format=csv --db-ps-mode=disable --point-selects=50 --range-selects=true --threads=50 --time=2000   run

After started sysbench I set the writer in maintenance mode:

+-----------------------------+-------------+
| Variable_name               | Value       |
+-----------------------------+-------------+
| pxc_encrypt_cluster_traffic | OFF         |
| pxc_maint_mode              | MAINTENANCE |
| pxc_maint_transition_period | 10          |
| pxc_strict_mode             | ENFORCING   |
+-----------------------------+-------------+

ProxySQL is setting the node as SHUNNED, but is not able to pass over the connection given sysbench uses sticky connections.

+---------+--------------+---------------+--------------+----------+
| weight  | hostgroup_id | srv_host      | status       | ConnUsed |
+---------+--------------+---------------+--------------+----------+
| 100     | 100          | 192.168.4.233 | SHUNNED      | 0        |
| 1000    | 100          | 192.168.4.23  | ONLINE       | 0        |
| 10000   | 100          | 192.168.4.22  | SHUNNED      | 50       |
| 100     | 101          | 192.168.4.233 | ONLINE       | 2        |
| 1000    | 101          | 192.168.4.23  | ONLINE       | 13       |
| 100     | 102          | 192.168.4.233 | ONLINE       | 0        |
| 10000   | 9101         | 192.168.4.22  | ONLINE       | 0        |
+---------+--------------+---------------+--------------+----------+

THIS IS EXPECTED!
If your application uses sticky connections and never refreshes, you must restart the application. Adding to the sysbench command –reconnect=50 I can see that the connections are a shift to the new master as expected:

+---------+--------------+---------------+--------------+----------+
| weight  | hostgroup_id | srv_host      | status       | ConnUsed |
+---------+--------------+---------------+--------------+----------+
| 100     | 100          | 192.168.4.233 | SHUNNED      | 0        |
| 1000    | 100          | 192.168.4.23  | ONLINE       | 26       | <-- New Primary
| 10000   | 100          | 192.168.4.22  | SHUNNED      | 19       | <-- shift
| 100     | 101          | 192.168.4.233 | ONLINE       | 0        |
| 10000   | 101          | 192.168.4.23  | ONLINE       | 21       |
| 100     | 102          | 192.168.4.233 | ONLINE       | 0        |
| 10000   | 9101         | 192.168.4.23  | ONLINE       | 0        | <-- ??
| 10000   | 9101         | 192.168.4.22  | ONLINE       | 0        |
+---------+--------------+---------------+--------------+----------+

As we can see ProxySQL does the failover to node x.23 as expected. But it also adds the node in the HG 9101, which is supposed to host the offline servers.

So why move the Primary there? 

Once maintenance is over, disable pxc_main_mode will restore the master. In short, ProxySQL will fail-back. 

The whole process will be not impactful if the application is NOT using sticky connection, otherwise, the application will have to deal with:

  • Error with the connection
  • Retry cycle to re-run the drop DML

Failover Because of a Crash

To check the next case I will add –mysql-ignore-errors=all to sysbench, to be able to see how many errors I will have and for how long, when in the need to failover. To simulate a crash I will KILL -9 the mysqld process on the writer.

After Kill:

98,50,53.00,6472.71,6070.73,221.99,179.99,1327.91,0.00,1.00 <--
99,50,0.00,2719.17,2719.17,0.00,0.00,0.00,0.00,50.00        <--start
100,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
101,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
102,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
103,50,0.00,2849.89,2549.90,193.99,106.00,0.00,0.00,0.00
104,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
105,50,49.85,2663.99,2556.31,23.93,83.75,7615.89,0.00,6.98  <-- done

In this case, it takes 6 seconds for a failover.

+--------+-----------+---------------+----------+---------+----------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | 
+--------+-----------+---------------+----------+---------+----------+
| 100    | 100       | 192.168.4.233 | 3306     | SHUNNED | 0        | 
| 1000   | 100       | 192.168.4.23  | 3306     | ONLINE  | 48       | 
| 100    | 101       | 192.168.4.233 | 3306     | ONLINE  | 1        | 
| 1000   | 101       | 192.168.4.23  | 3306     | ONLINE  | 18       | 
| 100    | 102       | 192.168.4.233 | 3306     | ONLINE  | 0        | 
| 10000  | 9101      | 192.168.4.22  | 3306     | SHUNNED | 0        | 
+--------+-----------+---------------+----------+---------+----------+

So all good here. But during one of my tests ONLY on v2.0.15 and when using the same weight, I had the following weird behavior. Once the failover is done I found that ProxySQL is sending connections to BOTH remaining nodes.

Check below the data taken one after the other nodeS start to take over, keep in mind here the PRIMARY was node 192.168.4.233:

+--------+-----------+---------------+----------+---------+----------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed |
+--------+-----------+---------------+----------+---------+----------+
| 10000  | 100       | 192.168.4.233 | 3306     | SHUNNED | 0        |
| 10000  | 100       | 192.168.4.23  | 3306     | ONLINE  | 10       |<--
| 10000  | 100       | 192.168.4.22  | 3306     | SHUNNED | 40       |<--
| 10000  | 101       | 192.168.4.233 | 3306     | SHUNNED | 0        |
| 10000  | 101       | 192.168.4.23  | 3306     | ONLINE  | 3        |
| 10000  | 101       | 192.168.4.22  | 3306     | ONLINE  | 12       |
| 10000  | 102       | 192.168.4.22  | 3306     | ONLINE  | 0        |
+--------+-----------+---------------+----------+---------+----------+
...
+--------+-----------+---------------+----------+---------+----------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed |
+--------+-----------+---------------+----------+---------+----------+
| 10000  | 100       | 192.168.4.233 | 3306     | SHUNNED | 0        |
| 10000  | 100       | 192.168.4.23  | 3306     | ONLINE  | 37       |<--
| 10000  | 100       | 192.168.4.22  | 3306     | SHUNNED | 13       |<--
| 10000  | 101       | 192.168.4.233 | 3306     | SHUNNED | 0        |
| 10000  | 101       | 192.168.4.23  | 3306     | ONLINE  | 7        |
| 10000  | 101       | 192.168.4.22  | 3306     | ONLINE  | 12       |
| 10000  | 102       | 192.168.4.22  | 3306     | ONLINE  | 0        |
+--------+-----------+---------------+----------+---------+----------+
...
+--------+-----------+---------------+----------+---------+----------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed |
+--------+-----------+---------------+----------+---------+----------+
| 10000  | 100       | 192.168.4.233 | 3306     | SHUNNED | 0        |
| 10000  | 100       | 192.168.4.23  | 3306     | ONLINE  | 49       |<--
| 10000  | 100       | 192.168.4.22  | 3306     | SHUNNED | 0        |<--
| 10000  | 101       | 192.168.4.233 | 3306     | SHUNNED | 0        |
| 10000  | 101       | 192.168.4.23  | 3306     | ONLINE  | 10       |
| 10000  | 101       | 192.168.4.22  | 3306     | ONLINE  | 10       |
| 10000  | 102       | 192.168.4.22  | 3306     | ONLINE  | 0        |
+--------+-----------+---------------+----------+---------+----------+

In the end, only one node will remain as Primary, but for an X amount of time, both were serving also if only ONE node was declared ONLINE.

A Problem Along the Road… (only with v2.0.15)

While I was trying to “fix” the issue with the weight for READERS…

Let’s say we have this:

+--------+-----------+---------------+----------+---------+----------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed |
+--------+-----------+---------------+----------+---------+----------+
| 10000  | 100       | 192.168.4.23  | 3306     | ONLINE  | 686      |
| 10000  | 100       | 192.168.4.22  | 3306     | SHUNNED | 0        |
| 10000  | 101       | 192.168.4.233 | 3306     | ONLINE  | 62       |
| 10000  | 101       | 192.168.4.23  | 3306     | ONLINE  | 43       |
| 10000  | 101       | 192.168.4.22  | 3306     | ONLINE  | 19       |
| 10000  | 102       | 192.168.4.22  | 3306     | ONLINE  | 0        |
+--------+-----------+---------------+----------+---------+----------+

And I want to release some of the READ load from WRITER (currently 192.168.4.23).

If I simply do:

update mysql_servers set weight=100 where hostgroup_id=101 and hostname='192.168.4.23';

+--------------+---------------+------+-----------+--------+--------+
| hostgroup_id | hostname      | port | gtid_port | status | weight | 
+--------------+---------------+------+-----------+--------+--------+
| 100          | 192.168.4.23  | 3306 | 0         | ONLINE | 10000  | 
| 101          | 192.168.4.22  | 3306 | 0         | ONLINE | 10000  | 
| 101          | 192.168.4.23  | 3306 | 0         | ONLINE | 100    | 
| 101          | 192.168.4.233 | 3306 | 0         | ONLINE | 10000  | 
+--------------+---------------+------+-----------+--------+--------+

Now I load it into runtime, and… if I am lucky:

+--------+-----------+---------------+----------+---------+
| weight | hostgroup | srv_host      | srv_port | status  |
+--------+-----------+---------------+----------+---------+
| 10000  | 100       | 192.168.4.23  | 3306     | ONLINE  |
| 10000  | 100       | 192.168.4.22  | 3306     | SHUNNED |
| 10000  | 101       | 192.168.4.233 | 3306     | ONLINE  |
| 100    | 101       | 192.168.4.23  | 3306     | ONLINE  |
| 10000  | 101       | 192.168.4.22  | 3306     | ONLINE  |
| 10000  | 102       | 192.168.4.22  | 3306     | ONLINE  |
+--------+-----------+---------------+----------+---------+

And then it is changed to:

+--------+-----------+---------------+----------+---------+
| weight | hostgroup | srv_host      | srv_port | status  |
+--------+-----------+---------------+----------+---------+
| 10000  | 100       | 192.168.4.23  | 3306     | ONLINE  |
| 10000  | 100       | 192.168.4.22  | 3306     | SHUNNED |
| 10000  | 101       | 192.168.4.233 | 3306     | ONLINE  |
| 10000  | 101       | 192.168.4.23  | 3306     | ONLINE  |
| 10000  | 101       | 192.168.4.22  | 3306     | ONLINE  |
| 10000  | 102       | 192.168.4.22  | 3306     | ONLINE  |
+--------+-----------+---------------+----------+---------+

As you can notice ProxySQL initially set it to the value I choose. After, it changed back to what was set in the HG 100. But worse, is that if I am not lucky:

+--------+-----------+---------------+----------+---------+----------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed |
+--------+-----------+---------------+----------+---------+----------+
| 100    | 100       | 192.168.4.23  | 3306     | SHUNNED | 0        |
| 10000  | 100       | 192.168.4.22  | 3306     | ONLINE  | 0        |
| 10000  | 101       | 192.168.4.233 | 3306     | ONLINE  | 718      |
| 100    | 101       | 192.168.4.23  | 3306     | ONLINE  | 0        |
| 10000  | 101       | 192.168.4.22  | 3306     | SHUNNED | 0        |
| 100    | 102       | 192.168.4.23  | 3306     | ONLINE  | 0        |
+--------+-----------+---------------+----------+---------+----------+

it changes the value (randomly) also for the HG 102 which will impact also the WRITER HG causing a failover. At this point, I stopped testing. Too many things making a failover scenario too unpredictable. 

Conclusions

ProxySQL has a great concept behind it and is for sure covering a really needed gap existing in the MySQL environment, optimizing and powering up the connection layer between the application layer to the data layer.  

But, in regards to the Galera support, we are not there. The support provided is not only limited, it is fallacious, and could lead to serious and/or unexpected problems. Also using the option writer_is_also_reader=1, which is the only one worthy of usage, we still see too many issues in how the nodes are managed in case of serious events as failover.

ProxySQL v2.1.0 seems to have fixed some instabilities, but we still have too many open issues to trust the Galera native support. My advice is to stay away from it and use the scheduler to deal with the Galera cluster. Write a robust script that will cover your specific needs if you must customize the actions. A scheduler will serve you well. 

If too lazy to do so, there is a sample in Percona-Lab. This is the old script used in ProxySQL 1.4.x modified to work with ProxySQL 2.x. I have also written one a long time ago that can help as well here. Both come without any guarantee and I advise you to use them as examples for your own, see Part 2 of this post for details

Finally, let me say that ProxySQL is a great tool, but no tool can cover all. People like me that have been around for long enough have seen this happening many times, and it is of no surprise. 

Great MySQL to all.

References

https://www.percona.com/doc/percona-xtradb-cluster/LATEST/install/index.html

https://galeracluster.com/

https://proxysql.com/blog/proxysql-native-galera-support/

https://www.percona.com/blog/2019/02/20/proxysql-native-support-for-percona-xtradb-cluster-pxc/

https://proxysql.com/documentation/galera-configuration/

Nov
17
2020
--

Tame Black Friday Gremlins — Optimize Your Database for High Traffic Events

Optimize Your Database for High Traffic Events

Optimize Your Database for High Traffic EventsIt’s that time of year! The Halloween decorations have come down and the leaves have started to change and the Black Friday/Cyber Monday buying season is upon us!

For consumers, it can be a magical time of year, but for those of us that have worked in e-commerce or retail, it usually brings up…different emotions. It’s much like the Gremlins — cute and cuddly unless you break the RULES:

  1. Don’t expose them to sunlight,
  2. Don’t let them come in contact with water,
  3. NEVER feed them after midnight!

I love this analogy and how it parallels the difficulties that we experience in the database industry — especially this time of year. When things go well, it’s a great feeling. When things go wrong, they can spiral out of control in destructive and lasting ways.

Let’s put these fun examples to work and optimize your database!

Don’t Expose Your Database to “Sunlight”

One sure-fire way to make sure that your persistent data storage cannot do its job, and effectively kill it is to let it run out of storage. Before entering the high-traffic holiday selling season, make sure that you have ample storage space to make it all the way to the other side. This may sound basic, but so is not putting a cute, fuzzy pet in the sunlight — it’s much harder than you think!

Here are some great ways to ensure the storage needs for your database are met (most obvious to least obvious):

  1. If you are on a DBaaS such as Amazon RDS, leverage something like Amazon RDS Storage Auto Scaling
  2. In a cloud or elastic infrastructure:
    1. make sure network-attached storage is extensible on the fly, or
    2. properly tune the database mount point to be leveraging logical volume management or software raid to add additional volumes (capacity) on the fly.
  3. In an on-premise or pre-purchased infrastructure, make sure you are overprovisioned — even by end of season estimates — by ~25%.
  4. Put your logs somewhere else than the main drive. The database may not be happy about running out of log space, but logs can be deleted easily — data files cannot!

Don’t Let Your Database Come in “Contact With Water”

We don’t want to feed or allow simple issues to multiply. Actions we take to get out of a bind in the near term can cause problems that require more attention in the future — just like when you put water on a Gremlin, it will multiply!

What are some of these scenarios?

  1. Not having a documented plan of action can cause confusion and chaos if something doesn’t go quite right. Having a plan documented and distributed will keep things from getting overly complicated when issues occur.
  2. Throwing hardware at a problem. Unless you know how it will actually fix an issue, it could be like throwing gasoline on a fire and throw your stack into disarray with blocked and unblocked queries. It also mandates database tuning to be effective.
  3. Understanding (or misunderstanding) how users behave when or if the database slows down:
    1. Do users click to retry five times in five seconds causing even more load?
    2. Is there a way to divert attention to retry later?
    3. Can your application(s) ignore retries within a certain time frame?
  4. Not having just a few sources of truth, with as much availability as possible:
    1. Have at least one failover candidate
    2. Have off-server transaction storage (can you rebuild in a disaster?)
    3. If you have the two above, then delayed replicas are your friend!

Never “Feed” Your Database After “Midnight”

What’s the one thing that can ensure that all heck breaks loose on Black Friday? CHANGE is the food here, and typically, BLACK FRIDAY is the midnight.

Have you ever felt like there is just one thing that you missed and want to get off your backlog? It could be a schema change, a data type change, or an application change from an adjacent team. The ‘no feeding’ rule is parallel to CODE FREEZE in production.

Most companies see this freeze start at the beginning of November when the most stable prod is the one that is already out there, not the one that you have to make stable after a new release:

  1. Change Management is your friend; change that needs to happen should still have a way to happen.
  2. Observability is also your friend; know in absolute terms what is happening to your database and stack so you don’t throw a wrench in it (Percona Monitoring and Management can help).
  3. Educate business stakeholders on the release or change process BEFORE the event, not DURING the event.
  4. Don’t be afraid to “turn it off” when absolute chaos is happening. Small downtime is better than an unusable site over a longer period of time.

Conclusion

Black Friday, Cyber Monday, and the Holidays can be the most wonderful time of the year — and now that we’ve covered the rules, some of the “Gremlins” can stay small and fuzzy and your business won’t get wrecked by pesky database issues or outages.

How Percona Can Help

Percona experts optimize your database performance with open source database support, highly-rated training, managed services, and professional services.

Contact Us to Tame Your Database Gremlins!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com