With the average cost of unplanned downtime running from $300,000 to $500,000 per hour, businesses are increasingly using high availability (HA) technologies to maximize application uptime. Unfortunately, using certain open source database software as part of an HA architecture can present significant challenges. Despite all its upside, PostgreSQL software presents such challenges.
PostgreSQL provides the foundation to start high availability, including features and configurations for physical and logical replication and consistent physical backups with point-in-time recovery (PITR) capabilities. But PostgreSQL does not offer a complete HA solution. Therefore, DBAs must take a piecemeal approach to achieve HA with PostgreSQL, bringing together open source extensions and tools from multiple sources.
This mish-mash of technologies creates the potential for a single point of failure (SPOF) and an interruption of services to end users — the very thing high availability is supposed to prevent. Because most SPOFs are a consequence of scale, the risk of downtime only gets worse as the environment grows. Where a high availability design once worked well, it can no longer keep up with more complex requirements.
This blog highlights considerations for keeping your own PostgreSQL databases highly available and healthy. We’ll examine the intricacies of establishing high availability with PostgreSQL software and provide links to Percona HA PostgreSQL reference architectures for businesses of every size (startup/small, medium, large, and enterprise).
Defining PostgreSQL high availability
High availability is essential for any business that relies on digital interactions — and today, that means just about everyone. The tolerable downtime depends on business size, operations, and objectives, but that downtime must be minimal to minuscule (see the “Measuring high availability” chart below). HA in PostgreSQL databases delivers virtually continuous availability, fault tolerance, and disaster recovery. It enables businesses and organizations to meet uptime and service level objectives, protect business-critical data, and maintain performant databases.
In general terms, to achieve HA in PostgreSQL, there must be:
- Redundancy: To ensure data redundancy and provide continuous performance when a primary server fails, multiple copies of the database reside in replica servers. The same changes made in the primary database are made in the replicas. This ensures continuous business operations. PostgreSQL software supports synchronous streaming replication, asynchronous streaming replication, and logical replication.
- No single point of failure (SPOF): If the failure of a database infrastructure component could cause downtime, that component is considered an SPOF. The existence of even one SPOF means high availability has not been achieved. The majority of SPOFs are a consequence of scale, where designs that were once adequate (excellent, even) no longer meet evolving, more complex requirements. Downtime due to SPOFs can also be attributed to bottlenecks from architectures designed for applications instead of databases.
- Failover: This functionality is all about switching — in both unplanned and planned situations — from a failed primary server to a standby server. When the primary server fails, a standby server takes over as the new primary. Failover is essential to minimizing downtime.
Also, in general terms, a high availability PostgreSQL solution must cover four key areas:
- Infrastructure: This is the physical or virtual hardware database systems rely on to run. Without enough infrastructure (physical or virtualized servers, networking, etc.), there cannot be high availability.
- Topology management: This is the software management related specifically to the database and its ability to stay consistent in the event of a failure.
- Connection management: This is the software management related specifically to the networking and connectivity aspect of the database. Clustering solutions typically bundle with a connection manager. However, in asynchronous clusters, deploying a connection manager is mandatory for high availability.
- Backup and continuous archiving: PostgreSQL asynchronous replication relies on the write-ahead log (WAL) stream, and it’s vital to have centralized storage that all nodes can access. This is the location where the primary node will archive the WAL segments, and the replicas will restore any segment that isn’t present on the primary anymore. It is of extreme importance to keep the replication working if any replication delay happens and the replica node isn’t able to work at the primary’s pace. Those files can also be used for point-in-time recovery if any disaster occurs.
Key considerations for determining high availability needs
Before going into more detail, it’s important to determine whether your workloads and services require an HA architecture and, if so, to what extent your HA should be. Answers to these questions will help you evaluate the need for a high availability PostgreSQL solution based on your business requirements, potential risks, and operational limitations:
- How much downtime can your business tolerate before taking a financial hit, losing customers, or both?
- Does your business’s industry have high availability requirements? If so, what are the legal liabilities?
- How long does it take to recover from a database failure in your current setup? Can you afford that recovery time?
- Can you afford the necessary hardware, software, and operational costs of maintaining a PostgreSQL HA solution?
- Do you have the expertise and resources to design, implement, and manage a high availability solution for PostgreSQL?
Measuring high availability
Depending on the answers to the previous questions, achieving high availability can mean different things for different businesses. It depends on how much downtime an organization can withstand. With streaming services, for example, excessive downtime could result in significant financial and reputational losses for the business. Other organizations can tolerate a few minutes of downtime without negatively affecting end users.
The following table shows the amount of downtime for each level of availability, from “two nines” to “five nines” (the gold standard of HA). You’ll see that high availability doesn’t deliver 100% uptime, but it’s close.
Popularity of PostgreSQL
Before delving further into the inner workings of PostgreSQL high availability, let’s briefly examine the burgeoning popularity of this open source relational database software.
PostgreSQL has rapidly gained favor among professional developers in recent years. StackOverflow statistics show that 26% of developers preferred it in 2017, 34% in 2019, and 40% in 2021. Most recently, in StackOverflow’s 2022 Stack Developer Survey, PostgreSQL took a slight lead over MySQL (46.48% to 45.68%) as the most popular database platform among professional developers.
PostgreSQL is favored among relational database options for its complex data analysis, data science, graphing, and AI-related capabilities. PostgreSQL is known for powerful and advanced features, including synchronous and asynchronous replication, full-text searches of the database, and native support for JSON-style storage, key-value storage, and XML.
PostgreSQL is highly extensible, enabling users to add custom functionality through plug-ins and extensions. It also features tools such as repmgr and Patroni for automatic failover and cluster management.
Being a more advanced database management system, PostgreSQL is well-suited for performing complex queries in a large environment quickly. Because it readily supports failover and full redundancy, it’s often preferred by financial institutions and manufacturers. It’s also preferred for use with geographic information systems (GIS) and geospatial data. PostgreSQL ranks as the fourth most popular database management system (DB-Engines, March 2023).
And it’s open source
Because PostgreSQL software is open source, it’s free of proprietary restrictions that can come with vendor lock-in. Developers can customize the source code and try new applications without a big budget hit. Companies can more easily scale infrastructure — up or down — to meet economic conditions and changing business objectives. With open source software, a business is not trapped into using one provider’s software, support, or services. Instead, the business may design and redesign systems as customer expectations change and business objectives evolve.
Also, with open source, there’s a global community of dedicated volunteers driving the development of PostgreSQL database technology. Open source standards and community support enable developers and DBAs to focus on accelerating PostgreSQL feature creation and enhancing availability, performance, scalability, and security.
How does PostgreSQL high availability work?
Now, let’s go into more detail about high availability with PostgreSQL. Several methods of the aforementioned replication form the backbone of PostgreSQL high availability. They allow for data redundancy, fault tolerance, failover amid disasters, power outages, human incursions — just about any scenario. Those methods include:
Streaming replication
With streaming replication, the entire database cluster is replicated from one server, known as the primary, to one or more standby servers. The primary server continuously streams the write-ahead logs (WAL) to the standby servers, which apply the changes to their own database copies. Streaming replication comes in two modes:
- Synchronous streaming replication: In this mode, the primary server waits for confirmation from at least one standby server before committing a transaction. That confirmation ensures the data is replicated to a certain level of redundancy with minimal data loss.
- Asynchronous streaming replication: This is the default way of replicating data in PostgreSQL setups. In this mode, the primary server does not wait for confirmation before committing to transactions. The upside is enhanced performance, but there can be data loss in the event of a failure.
Logical replication
This data replication method occurs at the logical level of the database objects, such as tables, rather than at the physical level of behind-the-scenes storage. Unlike streaming replication, which replicates the entire cluster, logical replication allows for more control over what data is replicated or not replicated.
The two main components of logical replication are publishers and subscribers. A publisher shares data and a subscriber receives and applies the replicated data. This approach provides more flexibility for replication configuration.
Beyond — and connected to — the replication methods, these components also are essential in high availability:
Failover
As stated, this is when a standby server becomes the new primary server due to various reasons, planned or sudden and unexpected. In PostgreSQL, failover can be triggered manually, or it can be automated — with monitoring systems initiating the action.
Failover includes these steps:
- The monitoring system sends an alert that the primary server is unavailable, triggering the failover process.
- Another server is chosen as the new primary server.
- Applications are redirected to the new primary server, and business-as-usual continues using the new primary server.
Automatic switchover
This method is used to minimize downtown and replace the need for manual intervention. The database system triggers automatic failover the instant that predetermined conditions occur.
With automatic switchover, cluster management tools or outside integrated tools monitor the primary server 24/7, checking that it’s connected and doing 100% of its job. The automatic switchover process kicks in if any failure of the primary server is detected, and a standby server becomes the new primary server.
Cluster management tools
Several cluster management tools, such as Patroni and Replication Manager, simplify the configuration, monitoring, and management of PostgreSQL high availability setups. As introduced above, those tools automate tasks like failover, monitoring, and maintenance, making it easier to maintain a highly available PostgreSQL environment.
Load balancing
High availability often involves distributing the database workload across multiple servers to ensure optimal performance and avoid bottlenecks. PostgreSQL supports various load balancing techniques — including connection pooling and connection routing — to distribute incoming queries and transactions among multiple database servers.
Connection pooling includes the availability of established database connections that can be made by multiple applications. Pgbouncer and pgpool-II are connection managers that accept client connections, manage the connection pool, and direct requests to the appropriate backend server.
Proxy load balancers like pgpool-II, pgbouncer, and HAProxy have the capacity to route connections to different database servers. Such a proxy acts as the traffic cop between the applications and the database servers. Steered by load balancing configurations, the proxy forwards requests from applications to the appropriate backend server. Pgpool-II has the extra capacity of a seamless split of reads and writes, while the other two have to do it using a separate port, and the read/write split has to be managed by the application.
DNS load balancing distributes application connections across multiple PostgreSQL database servers using the Domain Name System (DNS). With DNS load balancing, DNS records with the same hostname but different IP addresses are configured. When applications connect to the PostgreSQL database, the DNS server connects the app to available IP addresses in some kind of prioritized or revolving manner. This method is fairly easy to set up, but it’s not so intuitively or easily adjusted to meet changing server setup configurations.
Monitoring and alerting
To maintain high availability in PostgreSQL databases, it is crucial to have monitoring and alerting mechanisms in place. There are multiple tools available for monitoring the health and performance of PostgreSQL instances. Those tools ensure that notifications are sent as issues and potential issues are detected.
Bringing it all together in PostgreSQL architectures
When implemented successfully, the components we’ve examined can make for successful high availability in PostgreSQL databases. But, as stated, the PostgreSQL community version does not come with ready-to-go HA. It takes work, the right architecture, and often some outside help.
Let’s start with the architectures.
There’s an unfortunate misconception that high availability solutions are too expensive for startups and small businesses. But when using open source tools, coupled with an architecture such as the one offered by Percona, high availability on PostgreSQL can be achieved without a big price tag or the need for an overly complex environment. It can be done by building the HA infrastructure within a single data center.
As your business grows, so should your high availability architecture. For medium and large businesses, the consequences of downtime, both in terms of lost revenue and erosion of customer sentiment, can be significant. High availability requires more fault-tolerant, redundant systems and probably larger investments in IT staff. Still, when using open source tools, high availability can be achieved cost-effectively and without the threat of vendor lock-in that can come from paid enterprise SQLs. For medium and large businesses, Percona provides an architecture that spreads availability across data centers to add more layers of availability to the cluster.
For enterprises, the challenges and potential consequences increase exponentially. An architecture must address a lot more. The Percona architecture, for example, features two disaster recovery sites and adds more layers to the infrastructure in order to stay highly available and keep the applications up and running. This architecture, based on tightly coupled database clusters spread across data centers and availability zones, can offer an HA level up to 99.999% when using synchronous streaming replication, the same hardware configuration in all nodes, and fast internode connection.
You can get more details — and view actual architectures — at the Percona Highly Available PostgreSQL web page or by downloading our white paper, Percona Distribution for PostgreSQL: High Availability With Streaming Replication. But for starters, here are elements of a minimalist high availability architecture for PostgreSQL. Such architecture will include at least these four components in the design:
- Database nodes: In PostgreSQL high availability architectures — such as those depicting streaming replication, logical replication, or shared-disk clustering — multiple data nodes distribute the workload and maintain data consistency. Typically, each data node contains a copy or subset of the data.
- HA controller: This mechanism identifies when a failure of the primary node occurs and makes sure that there’s not a network partition/split brain. This same component needs a mechanism to perform the failover in case of failure on the primary node. It will elect one of the replicas to be promoted as the new primary node.
- Connection router: This mechanism changes the query routing so that application requests reach the new primary node.
- Backup component: This software or mechanism provides fast, consistent, and reliable backup and restores for PostgreSQL and also guarantees that all WAL logs are shipped to a universally accessible storage area.
Again, the descriptions immediately above are for a minimalist HA architecture. It can, and usually does, get a lot more complicated. Percona provides proven architectures to get you going, and we offer cost-effective options for help.
Steps to deploying and maintaining PostgreSQL high availability
Once the method of replication is determined and the architecture is designed, it’s time to deploy it. As with the architecture itself, deployment can be easier and more cost-effective when enlisting high availability support for PostgreSQL from outside experts. It depends on what expertise you have on staff.
Every database environment is different, so deployment procedures can vary, but here are some general steps:
- Configure the primary server. This server, the primary read-write node, will orchestrate data replication to standby servers.
- Create a primary server backup on the standby server or servers.
- Configure the standby server to run in hot standby mode. The hot standby allows the standby server to be used for reads. It prevents complications and losses in the event of a failure or scheduled maintenance of the primary server.
- Configure load balancing. Using pgpool-II, HAProxy, or other PostgreSQL load balancing tools, your database will be ready for high traffic and to distribute read queries across multiple servers.
- Implement backup and disaster recovery. Since HA alone does not guarantee data protection, you should implement mechanisms for data durability and disaster recovery.
- Regularly schedule backups and test the restore process to ensure data integrity.
- Test the setup. Here are a couple options. Ideally, both could be used.
- End-to-end (E2E) testing, though time-consuming, will show whether all components and applications of your HA setup work as intended.
- Chaos engineering is another option. This is when engineers introduce problems throughout the database infrastructure, so they can identify failure points and create solutions — avoiding costly downtime and potential loss of frustrated customers.
Security considerations for high availability PostgreSQL
Keeping databases secure demands attention, of course. Yet again, it should be emphasized that every environment is unique, and specific security requirements will vary accordingly. It’s important to check out PostgreSQL documentation. In some cases, if there is no on-staff expertise to maintain adequate security levels of your high availability cluster, it’s wise to consider support.
But whether you can keep your environment secure on your own or need that outside help, there are some general best practices for securing a cluster in a high availability environment:
Configure SSL/TLS encryption of communication with the database cluster. You can disable unnecessary procedures and make it so connections occur only across trusted networks.
Control access. Implementing role-based access control (RBAC) for PostgreSQL can be a helpful way of managing permissions and restricting access to sensitive data.
Use strong authentication. That means strong user passwords that are regularly changed. In more sensitive situations, it’s advisable to implement more secure methods like certificate-based authentication or LDAP integration. PostgreSQL supports a variety of external authentication methods, including GSSAPI, LDAP, and RADIUS.
Audit, log, and regularly review. You can configure PostgreSQL to log queries, failed logins, administrative activity, and any other events. Then, you can examine the logs at any time to spot potential security breaches or suspicious activity.
Conduct regular updates. Apply the latest PostgreSQL security patches and bug fixes ASAP when available. Stay active, or at least observant, as a participant in the global open source community. Stay informed about the latest PostgreSQL security information, including news about vulnerabilities and recommended patches.
Percona high availability architectures and support
There’s no getting around it; achieving high availability with PostgreSQL takes some work — whether you do it or have someone else do it. But if you’d like to avoid or ditch the piecemeal approach to high availability described earlier, Percona HA PostgreSQL architectures pull together open source tools and components to enhance and harden interoperability, ensuring your applications always have access to the data they need. You can implement an architecture yourself, call on Percona as needed, or have us manage your PostgreSQL HA database completely.
Regardless of the level of support you choose (and you can change your mind at any time), you’ll have a proven architecture, and you’ll avoid vendor lock-in, potentially exorbitant costs, and other pitfalls of going with proprietary arrangements.
A good place to start is our Highly Available PostgreSQL web page. There, you’ll find ready-to-use PostgreSQL high availability architectures designed specifically and separately to address the needs of small businesses and startups, medium and large businesses, and enterprise and globally distributed businesses.
Also, check out our white paper — Percona Distribution for PostgreSQL: High Availability With Streaming Replication. It will show you how to pull together open source tools and components for achieving the highest levels of HA on PostgreSQL, even as you scale. The white paper also provides the ready-to-use architectures.
High Availability PostgreSQL from Percona
Download white paper: Percona Distribution for PostgreSQL: High Availability With Streaming Replication
FAQs
The following are commonly asked questions and short answers about high availability in PostgreSQL databases. More detailed answers are presented in the sections above.
What is the difference between high availability and disaster recovery?
High availability provides continuous access to a PostgreSQL database through replication, failover mechanisms, and the absence of single points of failure. Disaster recovery is about restoring data and services after a catastrophic event by using backups, replication, and recovery procedures.
Can high availability architecture eliminate all downtime?
High availability doesn’t deliver 100% uptime, but it’s close. The gold standard of HA is “five nines,” which means a database is available 99.999% of the time.
What are common challenges when implementing high availability architecture?
The PostgreSQL community version doesn’t come with high availability features and functionality out of the box. That often requires a piecemeal approach to achieving HA. DBAs must bring together a few open source extensions and tools from multiple sources.
What technologies can be used for high availability architecture?
A multitude of technologies must be used together to achieve high availability in a PostgreSQL database. Essential technologies include replication, failover, load balancing, and cluster management.
How often should high availability architecture be tested and validated?
The frequency of testing and validation depends on several factors, including what’s at stake for the business and the complexity of the HA architecture. Certainly, a quarterly in-depth check is advisable, and for more critical situations, monthly might might be advisable..