Dec
29
2023
--

Human Factors Behind Incidents: Why Settings Like “idle_session_timeout” Can Be a Bad Idea

idle_session_timeout

PostgreSQL 14 introduced the parameter
idle_session_timeout, and, unfortunately, many DBAs jumped to start using it without understanding or by ignoring the consequences. In a short span of time, it has become one of the most misused parameters in many PostgreSQL installations. There is nothing wrong with
idle_session_timeout  from a technical perspective; even without this parameter, I know frustrated DBAs run scheduled shell scripts to keep terminating old idle sessions. The new parameter made their work easy to nuke all idle connections. The problem is the collateral damage it can cause.

Note for novice DBAs: There are human and organizational factors for almost every incident, rather than just technical reasons. Trying to do a quick/dirty fix by changing parameters will have far-reaching implications. Not everything can be addressed on the database side. This is more evident in connection-related incidents. Blame games and bad decisions following one incident/outage can lead to further bad decisions and further problems.

For example, many such incidents will have a common root cause” ” something like unexpected, sudden connection explosion — a big bang of connections with hundreds or thousands of idle connections. The RCA investigation may end with only the technical side of the problem, like loose limits invited disaster to happen. But the right question to be addressed would be: Why do DBAs end up setting such loose limits? What are the factors that influenced and forced DBAs to do so? Unfortunately, it gets addressed very rarely in many organizations.

Common factors contributing to connection-related outages

1. Unrestricted connection settings: Leaving connection settings, such as max_connections and per-user limits, wide open exposes the system to malicious Denial-of-Service (DoS) attacks. These attacks can rapidly consume available connections, leaving legitimate users unable to connect and disrupting system functionality.

2. Inadequate application-side pooling: Suboptimal application-side pooling can lead to opening a large number of connections, and they may remain open and unused, consuming valuable resources. These connections can accumulate, eventually exhausting server resources and causing the system to become unresponsive.

3. Unexpected connection leaks: Unanticipated connection leaks can also arise from programming errors or faulty application logic. These leaks can quickly deplete the available connection pool, hindering system performance and potentially leading to crashes.

4. Human factor: In an ideal case, there must be stringent restrictions and enforcement on what load and number of connections can be onboarded to the Database. It should be a tight scrutiny like airline security.

The problem generally begins as a conflict of interests. App-Dev often demands unrestricted, uncontrolled access to the database. A “NO” from the Ops/DBAs at the right time may save the entire system from outages. However, gatekeeping is not an easy job. Pressure builds up, and DBAs give up in most cases and allow settings that the database server cannot accommodate. The first decision goes wrong here. Obviously, an incident is expected down the line.

5. Organizational factor of wrong decisions: A surprising set of wrong decisions generally happens after the investigation following major incidents, The investigation could be pointing fingers at an excessive number of connections. Its common that Ops/DBA becomes responsible for managing the “idle” connections!   because “it is a database problem”. I would say the very wrong person takes charge.

The point forgotten is that connections are owned by clients but maintained by a database. It’s like a bank account. The customer owns it, but the bank maintains it.  We shouldn’t be terminating or closing connections from the database side. Instead of Investigating who owns them and addressing them at their root, DBAs often go for shortcuts like idle_session_timeout or a script to terminate all “idle” connections. The boundaries of ownership are violated, inviting the next level of problems.

There are a lot of misunderstandings about “idle” connections among the user community. We should remember that a connection will be “idle” immediately after establishing it. It becomes “active” when there is an SQL for executing. It goes back to “idle” when the processing is over. Most of the connections from any connection pooler also will be “idle” most of the time. So, there is nothing wrong with a connection being “idle”. The point I want to make is that a connection appearing as “idle” doesn’t make it a candidate for termination/killing. This may sound too silly for an experienced user, but the reality is that I ended up sitting on emergency calls to explain this.

What matters is the number of connections. The right question to ask in an RCA call is, “Who is responsible for creating the large number of connections that are idle most of the time?

Common symptoms of poor connection management

  1. Gradual server performance degradation due to the high number of connections
  2. Often, the server becomes unresponsive and/or crashes
  3. The server is running out of memory, and OOM killer terminates PostgreSQL
  4. The server is running out of available connections
  5. Application reporting connection failures and unexpected drops of connection crashes

Problems of backend termination

I often get the question, “What is the problem with terminating connections from backends?”. In layman’s terms, it will be like a bank closing your account and taking all the money in it. That can become a big surprise for the database client/application. Most of the applications may not be designed and tested for handling such surprises. Even a
psql  will get it with a surprise

postgres=> l+
FATAL:  terminating connection due to idle-session timeout
FATAL:  server conn crashed?
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

So unless the application is designed, developed, and tested for handling unexpected connection drops, it won’t be able to handle this. The majority of applications just exit with error reporting. In critical systems, this causes serious outages for business.

Despite being tailored for PostgreSQL, pgBouncer is not immune to sudden connection terminations:

2023-11-14 16:17:21.967 UTC [4630] WARNING S-0x10b7940: postgres/authuser@127.0.0.1:5432 got packet 'E' from server when not linked
2023-11-14 16:17:21.967 UTC [4630] LOG S-0x10b7940: postgres/authuser@127.0.0.1:5432 closing because: server conn crashed? (age=218s)
2023-11-14 16:17:22.927 UTC [4630] LOG stats: 0 xacts/s, 0 queries/s, 0 client parses/s, 0 server parses/s, 0 binds/s, in 0 B/s, out 11 B/s, xact 0 us, query 0 us, wait 12 us
2023-11-14 16:17:40.533 UTC [4630] LOG S-0x10b7bf8: postgres/pgbtestapp@[::1]:5432 closing because: server conn crashed? (age=236s)
2023-11-14 16:17:40.533 UTC [4630] LOG C-0x10af130: postgres/pgbtestapp@[::1]:39470 closing because: server conn crashed? (age=28s)
2023-11-14 16:17:40.533 UTC [4630] WARNING C-0x10af130: postgres/pgbtestapp@[::1]:39470 pooler error: server conn crashed?
2023-11-14 16:18:06.370 UTC [4630] LOG S-0x10b7bf8: postgres/authuser@127.0.0.1:5432 new connection to server (from 127.0.0.1:46632)

Even if we ignore all other problems, the connection pooler will end up re-establishing pretty much every connection in the pool. This completely destroys the very purpose of the connection pooler.

It is not that the PostgreSQL community is not aware of this possible abuse of parameters. This has been discussed in the community, and the implications on the pooler are known and well documented.

However, users still overlook the implications, and serious mess-ups happen.

From the discussions in the PostgreSQL community, the only strong argument in favor of this parameter is one-off users who directly log in to the database to execute custom statements. They may accidentally leave their sessions open for a long time, which causes damage.

There were discussions about making it more explicit like: Consider setting this for specific users instead of as a server default. Client connections managed by connection poolers or initiated indirectly, like those by a remote postgres_fdw using server, should probably be excluded from this timeout.

What can we do about it without causing any serious consequences?

Obviously, DBAs may have questions like, what if user accounts / application connections are really abusive? As mentioned above, the solution is not just technical,  but understanding the system and implementing appropriate strategies is important.

Segregation of database accounts and role-based settings

Those accounts used by individual users for interactive logins need to be differentiated from the accounts used by the application – The service accounts, In terms of user/role-level settings.PostgreSQL allows us to have parameter settings at different levels and scope. More stringent restrictions are to be placed on the interactive login accounts. It is very much OK to have settings like idle_session_timeout for those accounts, Preferably not exceeding 5 minutes. Most importantly, the idle_in_transaction_session_timeout also does not exceed a few seconds.
On the other hand, I would recommend NOT to use
idle_session_timeout  for service accounts. But using
idle_in_transaction_session_timeout  for a couple of minutes is acceptable as it helps us to find problematic application logic.

Use network timeouts

There are alternative approaches if we suspect that the connections are just leaked without the application holding them.  TCP timeouts could be the best solution for such cases. PostgreSQL has TCP-related timeouts as a parameter option, which can be specified like any other PostgreSQL parameter option.

tcp_keepalives_idle: This parameter specifies the number of seconds of inactivity, after which a keepalive message will be sent to the client by the PostgreSQL server. For example, if we set a value “120”, Sever will send a keepalive message to the client after two minutes of inactivity and wait for a response. If there is an acknowledgment from the client, the connection is validated.  What if there is no acknowledgment from a client? That is where the next parameter helps

tcp_keepalives_interval:  This is the number of seconds after which the TCP keepalive message will be retransmitted if there is no acknowledgment from the client. For example, setting a value of 20 results in sending keepalive packets every 20 seconds until an acknowledgment is obtained. What if there is no acknowledgment for any of the keepalive attempts? This is where the next parameter comes into play.

tcp_keepalives_count: This is the number of keepalive messages that can be lost before the client connection is considered as “dead”. And terminated.

So, the overall settings summary discussed is:

tcp_keepalives_idle = 120
tcp_keepalives_interval = 20
tcp_keepalives_count = 6

But you may adjust the values according to the needs of your environment.

There is one more parameter: tcp_user_timeout, Which, as some caveat, is documented and reportedly affects other parameter values. So this is not something we should consider on day one.

PostgreSQL 14 onwards, we have an additional option: client_connection_check_interval, which can detect whether the connection to the client has gone away or not.  This is most useful if a client dies during a long-running statement like those complex reporting queries from OLAP systems.  By default, this check is off (value 0). Setting this value to 5 to 10 seconds will be of great value if you are expecting cases of client connectivity. Please note that the value is specified in milliseconds. PostgreSQL allows us to set these values at even user level.

With all TCP timeouts in place, you may see entries as follows in PostgreSQL logs if some timeout happens.

2023-12-20 04:37:12.026 UTC [1230] LOG:  could not receive data from client: Connection timed out
2023-12-20 04:37:12.026 UTC [1230] LOG:  disconnection: session time: 0:22:23.419 user=jobin database=postgres host=fd42:8aba:4133:fb7d:216:3eff:fe3c:7bcb port=52984

Summary

There should be a cap on what a system can accept as a load. At the PostgreSQL level, it is the max_connections. I would consider it the primary duty of any Production DBA to protect the system from any possible abuse.

Dear DBAs, Please consider the following points to keep the system stable and protect it from any abuses.

  1. Configure connection settings appropriately. Set connection limits, such as max_connections and per-user limits, to reasonable values that align with expected usage patterns. Regularly review and adjust these settings as needed. The max_connections setting of a PostgreSQL instance should not be too far from 10x of available CPUs.
  2. Encourage the use of robust application-side pooling: Implement well-configured application-side pooling mechanisms to manage connections and prevent leaks efficiently. Employ libraries or frameworks that provide reliable pooling capabilities. In case the application framework doesn’t have a good connection pooler, consider external connection poolers like pgBouncer.
  3. Connections are owned by those who create it. If there is an excessive number of idle connections. The part of the system that is causing the creation of connections needs to be fixed. Terminating them silently from the backend is not a solution, but it invites more problems.
  4. Never terminate connections from the backend unless there is a request to do so from the application/client who owns the connections.
  5. Avoid using parameters like idle_session_timeout at the global level; use it only at the user/role level, especially for interactive login accounts.
  6. Avoid idle_session_timeout for application/service accounts. It can be used for accounts that are used for interactive logins where the consequence of timeout is predictable.
  7. Alternate options, as discussed above, to handle leaked connections.
  8. Monitor the connections. It is precious.

And most importantly, the crucial part. Empower yourself or your DBAs with the ability to say NO to any connection abuses. Junior DBAs are more vulnerable to pressure from other parts of organizations to take the wrong steps.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

 

Download Percona Distribution for PostgreSQL Today!

Dec
28
2023
--

Introducing the MySQL Audit Log Filter Plugin

MySQL Audit Log Filter Plugin

We’re happy to introduce Audit Log Filter — our newly upgraded audit plugin. Its functionality has been significantly improved in comparison to Audit Log and currently mirrors the functionality of the MySQL Enterprise Audit plugin. Starting with Percona Server for MySQL 8.0.34-26, the Audit Log Filter is available in a technical preview mode. You’re welcome to try it out and share your feedback with us.

New functionality

See what you can do with the Audit Log Filter plugin: 

  • Configure rule-based auditable events filtering. You can filter events based on a user account, audit event class name, audit event subclass name, and audit event fields (database name, table name, operation status, and so on).
  • The plugin contains built-in functions for its own configuration. Use these functions to create, modify, and remove filtering rules and assign them to user accounts.
  • Create rules to replace sensitive data in SQL statements written to the log.
  • Block events that match specific criteria.
  • Write information that you got with SQL query via Query Attributes to the audit log.
  • Observe the following information in the optional data fields:
    • query execution time 
    • the number of bytes sent or received 
    • the number of rows returned to the client, and 
    • the number of examined rows 

Note: This information can be printed on the audit log along with the audit event data.

  • Encrypt the audit log files by using AES-256 encryption.
  • Compress the audit log files to reduce the storage space occupied with log files.
  • Dynamically enable or disable the auditing. A server restart is not required to add or adjust existing filtering rules.

Install the plugin

To install Audit Log Filter, proceed with the following steps:

  1. Locate the audit_log_filter_linux_install.sql script in the share directory of your Percona Server for MySQL installation. 
  2. Run the script as follows:
$ mysql -u root -p < audit_log_filter_linux_install.sql

The script creates the audit_log_filter and audit_log_user tables in the MySQL database and installs the plugin. You will need these tables to configure the plugin later. They are empty right after installation, so the plugin cannot execute any tasks immediately.

Configure the plugin

The audit_log_filter plugin uses JSON-encoded rules to configure its own behavior. To start working with the plugin, we suggest you create a filtering rule for logging all auditable events. Assign it as default. Filtering rules may be manipulated using an SQL interface based on function calls.

SELECT audit_log_filter_set_filter(‘log_all’, ‘{“filter”: {“log”: true}}’);
SELECT audit_log_filter_set_user(‘%’, ‘log_all’);

The filter assigned to % is used for connections from any account that has no explicitly assigned filtering rule. Once the configuration above is done, the plugin starts logging all auditable events into the audit_filter.log log file located in the data directory by default.

We hope you’re excited to try the new plugin out! See the recommended documentation for more information about its capacity and usage examples.

Recommended documentation

Audit Log Filter in the Percona Server for MySQL guide

MySQL Enterprise Audit in the Oracle guide

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

 

Try Percona Distribution for MySQL today!

Dec
27
2023
--

Cloud Native Predictions for 2024

Cloud Native Predictions for 2024

The evolution of cloud-native technology has been nothing short of revolutionary. As we step into 2024, the cornerstone of cloud-native technology, Kubernetes, will turn ten years old. It continues to solidify its position and is anticipated to reach USD 5575.67 million by 2028, with a forecasted Compound Annual Growth Rate (CAGR) of 18.51% in the coming years, as reported by Industry Research Biz

The Cloud Native landscape continues to encompass both micro-trends and IT macro-trends, influencing and transforming the way businesses operate and deliver value to their customers.

As we at Percona wind down 2023 and look into what the next year holds, our attention is drawn to the cloud-native landscape and how it is maturing, growing, and evolving. 

KubeCon NA 2023 recap

The theme for KubeCon NA was very clear — AI and Large Language Models (LLMs). Keynotes were focused on how Kubernetes and Cloud Native help businesses embrace the new AI era. And it is understandable, as Kubernetes slowly becomes what it is intended to be – the Platform.

The field of Platform Engineering has witnessed significant advancements, as evidenced by the publication of the CNCF platform whitepaper and the introduction of a dedicated Platform Engineering day at the upcoming KubeCon event. At Percona, we observe a growing trend among companies utilizing Kubernetes as a means to offer services to their teams, fostering expedited software delivery and driving business growth.

Declarative GitOps management, with ArgoCD and Flux, is the community way of adding orchestration on top of orchestration. In our conversations with developers and engineers during the conference, we confirmed the CNCF GItOps Microsurvey data – 91% are already using GitOps.

According to the Dynatrace Kubernetes in the Wild 2023 report, a significant 71% (with 48% year-over-year growth!) of respondents are currently utilizing databases in Kubernetes (k8s).  This finding aligns with the observations made at the Data on Kubernetes (DoK) day, where discussions surrounding this topic transitioned from niche, tech-oriented conversations a year ago to more widespread, enterprise-level interest in adopting diverse use cases. These indicators suggest that the adoption of databases on k8s is in its early stages and is likely to continue growing in the future.

Predictions

Multi-cloud is a requirement

While this wave has been building for years, in 2024, we expect it to peak. According to a 2023 Forrester survey commissioned by Hashicorp, 61% of respondents had implemented, were expanding, or were upgrading their multi-cloud strategy. We expect that number to rise higher in 2024.

Nearly every vendor at Kubecon and every person we spoke to had some form of a multi-cloud requirement or strategy. Sometimes, this comes from necessity through acquisition or mergers. Oftentimes, it is a pillar of modern infrastructure strategy to avoid cloud vendor lock-in. At this point, it is ubiquitous, and if it is not part of your strategy, you are falling behind.

The business value of adopting this strategy is multi-fold:

  • Freedom from vendor lock-in, which leads to increased negotiating power
  • Agility in capitalizing on cloud-vendor advancements to innovate faster
  • Increased application and database architecture RPO and RTO
  • Adhering to security and governance requirements of customers

Percona’s Operators for MySQL, MongoDB, and PostgreSQL are designed with this value in mind. We want adopters of our technology to be able to deploy their critical open source databases and applications across any public or private cloud environment. All of the database automation for running a highly available, resilient, and secure database is built into the operator to simplify the operation and management of your clusters. 

Simplify and secure

Looking through various State of Kubernetes reports (VMWare, RedHat, SpectroCloud), it becomes clear that Complexity and Security are the top concerns for platform engineering teams.  

Simplification might come from different angles. Deployment is mostly solved already, whereas management and operations are still not. We expect to see various tooling and core patches to automate scaling, upgrades, migrations, troubleshooting, and more. 

Operators are an integral part of solving the complexity problem, where they take away the need for learning k8s primitives and application configuration internals. They also remove toil and allow engineers to focus on application development vs platform engineering work. Not only will new operators appear, but existing operators will mature and provide capabilities that meet or exceed managed services that users can get on public clouds. 

The latest report on Kubernetes adoption, security, and market trends in 2023 revealed that 67% reported delaying or slowing down deployment due to Kubernetes security concerns. Additionally, 37% of respondents experienced revenue or customer loss due to a container/Kubernetes security incident.

Considering the open source software vulnerability as one of the top concerns and the rapid increase in supply chain attacks (the SolarWinds attack and vulnerabilities like Log4Shell and Spring4Shell), along with container and Kubernetes strategies, there’s a growing emphasis on cybersecurity and operational understanding in development. 

Another significant issue within security concerns is the escalating complexity of modern systems, especially in platforms like Kubernetes, which highlights the need for unified threat models and scanning tools to address vulnerabilities. Standardization and collaboration are key to sharing common knowledge and patterns across teams and infrastructures. Creating repositories for memory-safe patterns in cloud systems to improve overall security.

A majority of RedHat’s security research respondents have a DevSecOps initiative underway. Most organizations are embracing DevSecOps, a term that covers processes and tooling enabling security to be integrated into the application development life cycle rather than treated as a separate process. However, 17% of organizations operate security separately from DevOps, lacking any DevSecOps initiatives. Consequently, they might miss out on the benefits of integrating security into the SDLC, such as enhanced efficiency, speed, and quality in software delivery.

AI and MLOps

Kubernetes has become a new web server for many production AI workloads, focusing on facilitating the development and deployment of AI applications, including model training. The newly formed Open Source AI Alliance, led by META and IBM, promises to support open-source AI. It comprises numerous organizations from various sectors, including software, hardware, nonprofit, public, and academic. The goal is to collaboratively develop tools and programs facilitating open development and run scalable and distributed training jobs for popular frameworks such as PyTorch, TensorFlow, MPI, MXNet, PaddlePaddle, and XGBoost.

While integrating AI and machine learning into cloud-native architectures, there’s an increasing demand from users for AI to be open and collaborative. The emergence of trends stemming from ‘AI Advancements and Ethical Concerns’ cannot be ignored.

Addressing ethical concerns and biases will necessitate the implementation of transparent AI frameworks and ethical guidelines during application development. Customers will increasingly prioritize AI efficiency and education to tackle legal and ethical concerns. This marks the end of an era of chaos, paving the way for efficiency gains, quicker innovation, and standardized practices.

Conclusion

At Percona, we prioritize staying ahead of market trends by adhering to industry best practices and leveraging our team’s expertise.

We’ve always made sure to focus on security in our software development, and weaving multi-cloud deployment into our products has been a crucial part of our strategy. Our commitment to open source software drives us to take additional precautions, ensuring operational security through best practices and principles, such as of least privilege, security in layers, and separation of roles/responsibilities through policy and software controls. And with multi-cloud in mind, we consistently incorporate new sharding functionalities into our roadmaps, such as the upcoming Shard-per-location support in the Percona Operator for MongoDB.

At the same time, we are not hesitating to rock the cloud-native community by incorporating top-notch features to address any new rising trends. You mentioned ‘More Simple Kubernetes’? Well, here we are – with storage autoscaling for databases in Kubernetes, slated for release in Q1, 2024 after a year of hard work. This fully automated scaling and tuning will enable a serverless-like experience in our Operators and Everest. Developers will receive the endpoint without needing to consider resources and tuning at all. It’s worry-free and doesn’t require human intervention.

Finally, the rising popularity of generative AI and engines like OpenAI or Bard has prompted our team to bring vector-handling capabilities to Percona-powered database software by adding support for the pgvector extension.

Our team always focuses on innovation to accelerate progress for everyone, and we will continue to push the boundaries further for our community and the rest of the world.

The Percona Kubernetes Operators automate the creation, alteration, or deletion of members in your Percona Distribution for MySQL, MongoDB, or PostgreSQL environment.

 

Learn More About Percona Kubernetes Operators

Dec
26
2023
--

Grafana Dashboards: A PoC Implementing the PostgreSQL Extension pg_stat_monitor

Grafana Dashboards: A PoC Implementing the PostgreSQL Extension pg_stat_monitor

This PoC demonstrates how to install and configure pg_stat_monitor in order to extract useful and actionable metrics from a PostgreSQL database and display them on a Grafana dashboard.

About the environment

  • Grafana: version 10.0.0
  • Grafana database backend: Prometheus version 2.15.2+d
  • PostgreSQL version 13
  • pgbench version 13

In order to investigate the potential opportunities for implementing constructive and useful metrics derived from PostgreSQL into Grafana, it is necessary to generate loading using pgbench.

Configuring Grafana

For our purposes, the Grafana datasource used in this PoC is also the Postgres data cluster that is generating the data to be monitored.

Grafana Pgbench

pg_stat_monitor

About

pg_stat_monitor is a Query Performance Monitoring tool for PostgreSQL. It collects various statistics data such as query statistics, query plan, SQL comments, and other performance insights. The collected data is aggregated and presented in a single view. 

pg_stat_monitor takes its inspiration from pg_stat_statements. Unlike pg_stat_statements, which aggregates its metrics from the last time it was zeroed, pg_stat_monitor possesses the ability to bucket its output within a set number of aggregated results, thus saving user efforts from doing it themselves.

pg_stat_monitor tracks the following operations:

  • statements
  • queries
  • functions
  • stored procedures and other non-utility statements

Features

  • Time Interval Grouping: Instead of supplying one set of ever-increasing counts, pg_stat_monitor computes stats for a configured number of time intervals – time buckets. This allows for much better data accuracy, especially in the case of high-resolution or unreliable networks.
  • Multi-Dimensional Grouping: While pg_stat_statements groups counters by userid, dbid, queryid, pg_stat_monitor uses a more detailed group for higher precision. This allows a user to drill down into the performance of queries.
  • Capture Actual Parameters in the Queries: pg_stat_monitor allows you to choose if you want to see queries with placeholders for parameters or actual parameter data. This simplifies debugging and analysis processes by enabling users to execute the same query.
  • Query Plan: Each SQL is now accompanied by its actual plan that was constructed for its execution. That’s a huge advantage if you want to understand why a particular query is slower than expected.
  • Tables Access Statistics for a Statement: This allows us to easily identify all queries that accessed a given table. This set is at par with the information provided by the pg_stat_statements.
  • Histogram: Visual representation is very helpful as it can help identify issues. With the help of the histogram function, one can now view a timing/calling data histogram in response to an SQL query. And yes, it even works in PostgreSQL.

Installation (example: CENTOS8, pg14)

The simplest way to get pg_stat_monitor is to install it via Percona Distribution for PostgreSQL.

The following instructions demonstrate installing Percona Distribution for PostgreSQL and  pg_stat_monitor on a CENTOS8 OS Linux distribution:

# Install The Percona Repository
dnf install -y https://repo.percona.com/yum/percona-release-latest.noarch.rpm
percona-release setup ppg14

# Install The postgres Community Repository
dnf install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-8-x86_64/pgdg-redhat-repo-latest.noarch.rpm
dnf update -y
dnf install -y pg_stat_monitor_14

# perform standard initialization and systemd configurations
/usr/pgsql-14/bin/postgresql-14-setup initdb

# configure postgres to use pg_stat_monitor
echo "
shared_preload_libraries=pg_stat_monitor
" >> /var/lib/pgsql/14/data/postgresql.auto.conf

# complete postgres configuration
systemctl enable postgresql-14
systemctl start postgresql-14

Create extension

The pg_stat_monitor extension can be created in any database, but for the purposes of this PoC, it is placed in the database pgbench.

postgres=# create database pgbench;
postgres=# c pgbench
pgbench=# create extension pg_stat_monitor;

pgbench=# d
             List of relations
 Schema |      Name       | Type |  Owner   
--------+-----------------+------+----------
 public | pg_stat_monitor | view | postgres

View "public.pg_stat_monitor"
       Column        |           Type           | Collation | Nullable | Default 
---------------------+--------------------------+-----------+----------+---------
 bucket              | bigint                   |           |          | 
 bucket_start_time   | timestamp with time zone |           |          | 
 userid              | oid                      |           |          | 
 username            | text                     |           |          | 
 dbid                | oid                      |           |          | 
 datname             | text                     |           |          | 
 client_ip           | inet                     |           |          | 
 pgsm_query_id       | bigint                   |           |          | 
 queryid             | bigint                   |           |          | 
 toplevel            | boolean                  |           |          | 
 top_queryid         | bigint                   |           |          | 
 query               | text                     |           |          | 
 comments            | text                     |           |          | 
 planid              | bigint                   |           |          | 
 query_plan          | text                     |           |          | 
 top_query           | text                     |           |          | 
 application_name    | text                     |           |          | 
 relations           | text[]                   |           |          | 
 cmd_type            | integer                  |           |          | 
 cmd_type_text       | text                     |           |          | 
 elevel              | integer                  |           |          | 
 sqlcode             | text                     |           |          | 
 message             | text                     |           |          | 
 calls               | bigint                   |           |          | 
 total_exec_time     | double precision         |           |          | 
 min_exec_time       | double precision         |           |          | 
 max_exec_time       | double precision         |           |          | 
 mean_exec_time      | double precision         |           |          | 
 stddev_exec_time    | double precision         |           |          | 
 rows                | bigint                   |           |          | 
 shared_blks_hit     | bigint                   |           |          | 
 shared_blks_read    | bigint                   |           |          | 
 shared_blks_dirtied | bigint                   |           |          | 
 shared_blks_written | bigint                   |           |          | 
 local_blks_hit      | bigint                   |           |          | 
 local_blks_read     | bigint                   |           |          | 
 local_blks_dirtied  | bigint                   |           |          | 
 local_blks_written  | bigint                   |           |          | 
 temp_blks_read      | bigint                   |           |          | 
 temp_blks_written   | bigint                   |           |          | 
 blk_read_time       | double precision         |           |          | 
 blk_write_time      | double precision         |           |          | 
 resp_calls          | text[]                   |           |          | 
 cpu_user_time       | double precision         |           |          | 
 cpu_sys_time        | double precision         |           |          | 
 wal_records         | bigint                   |           |          | 
 wal_fpi             | bigint                   |           |          | 
 wal_bytes           | numeric                  |           |          | 
 bucket_done         | boolean                  |           |          | 
 plans               | bigint                   |           |          | 
 total_plan_time     | double precision         |           |          | 
 min_plan_time       | double precision         |           |          | 
 max_plan_time       | double precision         |           |          | 
 mean_plan_time      | double precision         |           |          | 
 stddev_plan_time    | double precision         |           |          |

About pgbench

pgbench is a simple program executing benchmark tests on PostgreSQL by running the same sequence of SQL commands over and over. pgbench is capable of executing multiple concurrent database sessions and can calculate the average transaction rate (TPS) at the end of a run. Although the default configuration simulates loading based loosely upon TPC-B, it is nevertheless easy to test other use cases by writing one’s own transaction script files.

Querying the data

While it is reasonable to create panels showing real-time load in order to explore better the types of queries that can be run against pg_stat_monitor, it is more practical to copy and query the data into tables after the benchmarking has completed its run.

Table: pg_stat_monitor_archive

Save the data generated from a recently completed benchmark run into an archive table:

select * into pg_stat_monitor_archive 
from pg_stat_monitor 
order by bucket_start_time asc

Table "public.pg_stat_monitor_archive"
      Column        |           Type           | Collation | Nullable | Default  
--------------------+--------------------------+-----------+----------+---------
bucket              | bigint                   |           |          |  
bucket_start_time   | timestamp with time zone |           |          |  
userid              | oid                      |           |          |  
username            | text                     |           |          |  
dbid                | oid                      |           |          |  
datname             | text                     |           |          |  
client_ip           | inet                     |           |          |  
pgsm_query_id       | bigint                   |           |          |  
queryid             | bigint                   |           |          |  
toplevel            | boolean                  |           |          |  
top_queryid         | bigint                   |           |          |  
query               | text                     |           |          |  
comments            | text                     |           |          |  
planid              | bigint                   |           |          |  
query_plan          | text                     |           |          |  
top_query           | text                     |           |          |  
application_name    | text                     |           |          |  
relations           | text[]                   |           |          |  
cmd_type            | integer                  |           |          |  
cmd_type_text       | text                     |           |          |  
elevel              | integer                  |           |          |  
sqlcode             | text                     |           |          |  
message             | text                     |           |          |  
calls               | bigint                   |           |          |  
total_exec_time     | double precision         |           |          |  
min_exec_time       | double precision         |           |          |  
max_exec_time       | double precision         |           |          |  
mean_exec_time      | double precision         |           |          |  
stddev_exec_time    | double precision         |           |          |  
rows                | bigint                   |           |          |  
shared_blks_hit     | bigint                   |           |          |  
shared_blks_read    | bigint                   |           |          |  
shared_blks_dirtied | bigint                   |           |          |  
shared_blks_written | bigint                   |           |          |  
local_blks_hit      | bigint                   |           |          |  
local_blks_read     | bigint                   |           |          |  
local_blks_dirtied  | bigint                   |           |          |  
local_blks_written  | bigint                   |           |          |  
temp_blks_read      | bigint                   |           |          |  
temp_blks_written   | bigint                   |           |          |  
blk_read_time       | double precision         |           |          |  
blk_write_time      | double precision         |           |          |  
resp_calls          | text[]                   |           |          |  
cpu_user_time       | double precision         |           |          |  
cpu_sys_time        | double precision         |           |          |  
wal_records         | bigint                   |           |          |  
wal_fpi             | bigint                   |           |          |  
wal_bytes           | numeric                  |           |          |  
bucket_done         | boolean                  |           |          |  
plans               | bigint                   |           |          |  
total_plan_time     | double precision         |           |          |  
min_plan_time       | double precision         |           |          |  
max_plan_time       | double precision         |           |          |  
mean_plan_time      | double precision         |           |          |  
stddev_plan_time    | double precision         |           |          |

Table: pg_stat_monitor_qry

Extract this metric of interest, i.e., time vs total execution time:

select bucket_start_time, pgsm_query_id, queryid, total_exec_time
into pg_stat_monitor_qry
from pg_stat_monitor 
order by bucket_start_time asc

pgbench=# d pg_stat_monitor_qry
                      Table "public.pg_stat_monitor_qry"
      Column       |           Type           | Collation | Nullable | Default 
-------------------+--------------------------+-----------+----------+---------
 bucket_start_time | timestamp with time zone |           |          | 
 pgsm_query_id     | bigint                   |           |          | 
 queryid           | bigint                   |           |          | 
 total_exec_time   | double precision         |           |          |

Table: pg_stat_monitor_shared_blk_io

Extract this metric of interest, i.e., time vs shared_blk io:

select bucket_start_time, pgsm_query_id, queryid,
       shared_blks_hit, shared_blks_read,
       shared_blks_dirtied, shared_blks_written
    into pg_stat_monitor_shared_blk_io
    from pg_stat_monitor_archive
    order by bucket_start_time asc;

pgbench=# d pg_stat_monitor_shared_blk_io
                 Table "public.pg_stat_monitor_shared_blk_io"
      Column        |           Type           | Collation | Nullable | Default  
---------------------+--------------------------+-----------+----------+---------
bucket_start_time   | timestamp with time zone |           |          |  
pgsm_query_id       | bigint                   |           |          |  
queryid             | bigint                   |           |          |  
shared_blks_hit     | bigint                   |           |          |  
shared_blks_read    | bigint                   |           |          |  
shared_blks_dirtied | bigint                   |           |          |  
shared_blks_written | bigint                   |           |          |

Table: pg_stat_monitor_blk_io

Note: this metric requires runtime parameter track_io_timing to be set on.

Extract this metric of interest, i.e., time vs. blk io:

select bucket_start_time, pgsm_query_id, queryid, blk_read_time, blk_write_time
    into pg_stat_monitor_blk_io
    from pg_stat_monitor_archive
    order by bucket_start_time asc;

Table: pg_stat_monitor_uniq_id

Save a copy of all unique query IDs in order to parse out future queries from the view. 

Column pgsm_query_id identifies the query in such a manner that one can still identify the same query even when generated on other platforms under different loading conditions with  different data:

with a as (select distinct on (pgsm_query_id) *
            from pg_stat_monitor_archive
            where application_name='pgbench')
select cmd_type, cmd_type_text,pgsm_query_id, queryid,query as example_query
    into pg_stat_monitor_uniq_id
    from a
    order by cmd_type;

pgbench=# d pg_stat_monitor_uniq_id
          Table "public.pg_stat_monitor_uniq_id"
    Column     |  Type   | Collation | Nullable | Default
---------------+---------+-----------+----------+---------
 cmd_type      | integer |           |          |
 cmd_type_text | text    |           |          |
 pgsm_query_id | bigint  |           |          |
 queryid       | bigint  |           |          |
 example_query | text    |           |          |

This is an example set of queries generated by pgbench. Note the numbers in column pgsm_query_id are always the same values irrespective of hosts or environments:

select cmd_type_text, pgsm_query_id, example_query 
from pg_stat_monitor_uniq_id where cmd_type > 0;

cmd_type_text |    pgsm_query_id     |           example_query
---------------+----------------------+-----------------------------------------------------------------
 SELECT        | -7455620703706695456 | SELECT abalance FROM pgbench_accounts WHERE aid = 16416498
 UPDATE        |  -510321339504955469 | UPDATE pgbench_accounts SET abalance = abalance + 2063 
                                        WHERE aid = 1482568
 UPDATE        |  5276535447716615446 | UPDATE pgbench_branches SET bbalance = bbalance + 1384 
                                        WHERE bid = 7
 UPDATE        |  3629195281782908951 | UPDATE pgbench_tellers SET tbalance = tbalance + -2966 
                                        WHERE tid = 330
 INSERT        | -8751124061964589929 | INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) 
                                        VALUES (321, 56, 21104880, 4211, CURRENT_TIMESTAMP)

Benchmarking

Two types of performance monitoring are profiled:

  1. Real-time loading performance
  2. Aggregate performance over a specific time period, i.e., a snapshot.

Although the results of the benchmarking can be viewed by querying the view pg_stat_monitor you will note, as demonstrated by the bash script and SQL statements below, that the contents of the view is immediately copied and saved into a collection of tables. This is because the data will disappear over time as pg_stat_monitor cycles through its allotted number of buckets.

A script executing a benchmarking run:

#!/bin/bash

# REFERENCE
#   https://docs.percona.com/pg-stat-monitor/reference.html
#
set -e

export PGPASSWORD="MYPASSWORD" 
       PGHOST='MYHOST’ 
       PGPORT=5434 
       PGUSER=postgres
#
# initialize benchmarking database
#
dropdb --if-exists pgbench
createdb pgbench
/usr/pgsql-12/bin/pgbench -i --foreign-keys -s 300 pgbench
psql pgbench -c 'create extension pg_stat_monitor'

#
# configure pg_stat_monitor, requires system restart
#
psql postgres <<_eof_
-- set bucket time range,  default is normally 60 seconds
   alter system set pg_stat_monitor.pgsm_bucket_time = '1min';
-- set number of buckets, default is normally 10
   alter system set pg_stat_monitor.pgsm_max_buckets = 75;
_eof_

systemctl restart postgresql@13-main

psql postgres <<_eof_
-- zero pg_stat_monitor stats
    select * from pg_stat_monitor_reset();
_eof_

#
# begin benchmarking run
#
#   4500 seconds (75 minutes)
/usr/pgsql-12/bin/pgbench -U postgres -c 4 -j 2 -T 4500 -P 5 -b tpcb-like pgbench

#
# copy and save the benchmarking run into tables
#
psql postgres <<_eof_
    drop table if exists pg_stat_monitor_archive, 
                           pg_stat_monitor_qry, 
                           pg_stat_monitor_uniq_id;

    select * into pg_stat_monitor_archive from pg_stat_monitor order by bucket_start_time;

    select bucket_start_time, pgsm_query_id, queryid, total_exec_time
    into pg_stat_monitor_qry
    from pg_stat_monitor_archive
    where application_name='pgbench';

    with a as (select distinct on (pgsm_query_id) * 
               from pg_stat_monitor_archive 
               where application_name='pgbench')    
    select cmd_type, cmd_type_text,pgsm_query_id, queryid,query as example_query
        into pg_stat_monitor_uniq_id
        from a
        order by cmd_type;;
_eof_


echo "DONE"

progress: 4435.0 s, 341.2 tps, lat 11.718 ms stddev 3.951
progress: 4440.0 s, 361.2 tps, lat 11.075 ms stddev 3.519
progress: 4445.0 s, 348.0 tps, lat 11.483 ms stddev 5.246
progress: 4450.0 s, 383.8 tps, lat 10.418 ms stddev 4.514
progress: 4455.0 s, 363.6 tps, lat 10.988 ms stddev 4.326
progress: 4460.0 s, 344.0 tps, lat 11.621 ms stddev 3.981
progress: 4465.0 s, 360.4 tps, lat 11.093 ms stddev 4.457
progress: 4470.0 s, 383.8 tps, lat 10.423 ms stddev 5.615
progress: 4475.0 s, 369.6 tps, lat 10.811 ms stddev 3.784
progress: 4480.0 s, 355.6 tps, lat 11.227 ms stddev 3.954
progress: 4485.0 s, 378.8 tps, lat 10.580 ms stddev 2.890
progress: 4490.0 s, 370.8 tps, lat 10.770 ms stddev 2.879
progress: 4495.0 s, 365.2 tps, lat 10.947 ms stddev 4.997
progress: 4500.0 s, 379.2 tps, lat 10.549 ms stddev 2.832

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: simple
number of clients: 4
number of threads: 2
duration: 4500 s
number of transactions actually processed: 1564704
latency average = 11.497 ms
latency stddev = 4.800 ms
tps = 347.711175 (including connections establishing)
tps = 347.711731 (excluding connections establishing)

Dashboard example 1: Querying saved data

Top panel (Query execution time vs. DML)

Five (5) SQL statements are used to create this panel:

-- SELECT --
select  bucket_start_time,total_exec_time as "SELECT"
from pg_stat_monitor_qry
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where a.cmd_type_text='SELECT'
order by 1 asc;

-- INSERT --
select  bucket_start_time,total_exec_time as "INSERT"
from pg_stat_monitor_qry
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where a.cmd_type_text='INSERT'
order by 1 asc;

-- UPDATE 1 --
select  bucket_start_time,total_exec_time as "UPDATE 1"
from pg_stat_monitor_qry
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = -510321339504955469
order by 1 asc;

-- UPDATE 2 --
select  bucket_start_time,total_exec_time as "UPDATE 2"
from pg_stat_monitor_qry
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = 5276535447716615446
order by 1 asc;

-- UPDATE 3 --
select  bucket_start_time,total_exec_time  as "UPDATE 3"
from pg_stat_monitor_qry
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = 3629195281782908951
order by 1 asc;

Bottom panel (Query execution time vs. shared blocks)

-- INSERT (ins_[hit|read|dirty|write]) --
select bucket_start_time,
       shared_blks_hit as ins_hit, shared_blks_read as ins_read,
       shared_blks_dirtied as ins_dirt, shared_blks_written as ins_writ
from pg_stat_monitor_shared_blk_io
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='INSERT'
order by 1 asc;

-- UPDATE 1 (update1_[hit|read|dirty|write]) --
select bucket_start_time,
       shared_blks_hit as update1_hit, shared_blks_read as update1_read,
       shared_blks_dirtied as update1_dirt, shared_blks_written as update1_writ
from pg_stat_monitor_shared_blk_io
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = -510321339504955469
order by 1 asc;

-- UPDATE 2 (update2_[hit|read|dirty|write]) --
select bucket_start_time,
       shared_blks_hit as update2_hit, shared_blks_read as update2_read,
       shared_blks_dirtied as update2_dirt, shared_blks_written as update2_writ
from pg_stat_monitor_shared_blk_io
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = 5276535447716615446
order by 1 asc;

-- UPDATE 3 (update3_[hit|read|dirty|write]) --
select bucket_start_time,
       shared_blks_hit as update3_hit, shared_blks_read as update3_read,
       shared_blks_dirtied as update3_dirt, shared_blks_written as update3_writ
from pg_stat_monitor_shared_blk_io
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = 3629195281782908951
order by 1 asc;

Analysis

Here are some example patterns that can be discerned:

  1. The SELECT statements are the fastest DML operations (top panel).
  2. Although SQL statement UPDATE 1 (top panel) takes up the most time, its contents do not have much presence in the shared buffer relative to the other update statements.
  3. Inserts (top) are the 2nd slowest set of statements, yet they have very little execution time performing inserts compared to the UPDATES in the shared buffer (bottom).

Dashboard example 2: Monitoring in real time

These two panels show read/write IO performance to the persistent storage while benchmarking a live run.

Top panel (Execution time vs. DML)

-- SELECT --
select  bucket_start_time,total_exec_time as "SELECT"
from pg_stat_monitor
join pg_stat_monitor_uniq_id a using (pgsm_query_id)
where a.cmd_type_text='SELECT'
order by 1 asc;

-- INSERT --
select  bucket_start_time,total_exec_time as "INSERT"
from pg_stat_monitor
join pg_stat_monitor_uniq_id a using (pgsm_query_id)
where a.cmd_type_text='INSERT'
order by 1 asc;

-- UPDATE 1 --
select  bucket_start_time,total_exec_time as "UPDATE 1"
from pg_stat_monitor
join pg_stat_monitor_uniq_id a using (pgsm_query_id)
where a.cmd_type_text='UPDATE'
and pgsm_query_id = -510321339504955469
order by 1 asc;

-- UPDATE 2 --
select  bucket_start_time,total_exec_time as "UPDATE 2"
from pg_stat_monitor
join pg_stat_monitor_uniq_id a using (pgsm_query_id)
where a.cmd_type_text='UPDATE'
and pgsm_query_id = 5276535447716615446
order by 1 asc;

-- UPDATE 3 --
select  bucket_start_time,total_exec_time  as "UPDATE 3"
from pg_stat_monitor
join pg_stat_monitor_uniq_id a using (pgsm_query_id)
where a.cmd_type_text='UPDATE'
and pgsm_query_id = 3629195281782908951
order by 1 asc;

Bottom panel (Time vs. IO)

-- time vs read/write blocks (blk_read_time, blk_write_time
--    track_io_timing is on
select bucket_start_time, blk_read_time, blk_write_time from public.pg_stat_monitor;

Analysis

It’s quite easy to observe that SQL statement UPDATE 1 represents the bulk of the read operations.

-- example SQL statement
UPDATE pgbench_accounts SET abalance = abalance + 2063  WHERE aid = 1482568

Interestingly, writes are not as significant as reads.

Conclusion

I’m excited about pg_stat_monitor. Not only can it be used in Grafana, but it’s easily implemented in any monitoring solution, including our own Percona Monitoring and Management. It’s also incorporated in our latest version of Percona Operator for PostgreSQL.

Pg_stat_monitor is an obvious, common sense improvement over pg_stat_statement’s greatest limitation i.e., its inability to bucket metrics over time intervals. And to be frankly honest, I can see the pg_stat_monitor extension eventually replacing pg_stat_statement as the defacto extension monitoring Postgres when it comes to real-time analysis.

Happy monitoring!

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

 

Download Percona Distribution for PostgreSQL Today!

Dec
25
2023
--

Talking Drupal #430 – Drupal in 2024

Today we are talking about Drupal in 2024, What we are looking forward to with Drupal 11, and the Drupal Advent Calendar with James Shields. We’ll also cover Drupal 10.2 as our module of the week.

For show notes visit: www.talkingDrupal.com/430

Topics

  • Advent calendar
  • Selection process
  • Popularity
  • Next year
  • Drupal features in 2024
  • Drupal 11
    • Project browser
    • Recipes / Starter templates
    • Automated updates
    • Gitlab
    • Smaller core
  • Predictions

Resources

Guests

James Shields – lostcarpark.com lostcarpark

Hosts

Nic Laflin – nLighteneddevelopment.com nicxvan John Picozzi – epam.com johnpicozzi Martin Anderson-Clutz – mandclu Ron Northcutt – community.appsmith.com rlnorthcutt

MOTW Correspondent

Martin Anderson-Clutz – mandclu Drupal 10.2

  • Improvements include
    • Technology Updates
      • PHP 8.3
      • Includes capabilities that previously required contrib projects
      • File name sanitization
      • A search filter on the permissions page
    • End Users
      • Performance enhancements and improved caching APIs
      • Support for PHP Fibers to accelerate handling things like asynchronous remote calls
    • Content Creators
      • Revision UI for media
      • Wider editing area in Claro on large screens
      • The return of “Show blocks” in CKEditor 5, missing until now
    • Site Builders
      • Field creation UI has a new, more visual interface, and an updated workflow
      • Block visibility can now be based on the HTTP response status, for example to make it visible or invisible on 404 or 403 responses
      • Tour module is no longer enabled by default for the Standard and Umami profiles
      • New “negated regular expression” operator for views filters (string/integer), to exclude results matching a provided pattern
    • Site Owners
      • Announcements Feed is now stable and included in the Standard profile
      • The functionality in the experimental Help Topics module has been merged into the main Help module, so the Help Topics module is now deprecated
      • New permission: Use help pages
    • Developers
      • A fairly sizable change is a move to use native PHP attributes instead of doctrine annotations to declare metadata for plugin classes. Work is already underway to get core code converted, and an issue has been opened to have rector do this conversion for contrib projects
      • A new DeprecationHelper::backwardsCompatibleCall() method to help write Drupal extensions that support multiple versions of core
      • A PerformanceTestBase is now in core, to support automated testing of performance metrics
      • A new #config_target property in ConfigFormBase to simplify creating configuration forms
      • Symfony mailer is now a composer dependency of core
      • New decimal primitive data type
      • Expanded configuration validation, Symfony autowiring support, HTML5 output from the HTML utility class is now default, and more
      • In addition to these and the features highlighted in the official announcement, there are three pages of change records for the 10.2.0 release, and we’ll include a link to those in the show notes
Dec
22
2023
--

What’s New in Percona Toolkit 3.5.6

Percona Toolkit

Percona Toolkit 3.5.6 was released on December 21, 2023. This blog post covers the main changes in this release.

New tool: pt-galera-log-explainer

We continue adding more instruments for the Support teams.
ptgaleralogexplainer was written by my Percona Support colleague Yoann La Cancellera.
ptgaleralogexplainer filters, aggregates, and summarizes multiple Galera library logs together. The purpose of this tool is to help find useful information in Percona XtraDB Cluster/Galera library logs.

ptgaleralogexplainer  takes raw log files, usually very verbose as the one available in the regression test suite, and makes short summary out of them:

$ ./bin/pt-galera-log-explainer list --all src/go/pt-galera-log-explainer/tests/logs/merge_rotated_daily/node1.20230315.log
identifier                     node1                                               
current path                   .../tests/logs/merge_rotated_daily/node1.20230315.log   
last known ip                                                                          
last known name                node1                                               
mysql version                                                                          
                                                                                        
2023-03-15T20:10:57.784904+02:00   node2 joined                                        
2023-03-15T20:10:57.785568+02:00   node3 left                                          
2023-03-15T20:10:57.791959+02:00   node3 left                                          
2023-03-15T20:10:57.797221+02:00   PRIMARY(n=2)                                        
2023-03-15T20:20:12.714291+02:00   node2 joined                                        
2023-03-15T20:20:12.714331+02:00   node3 joined                                        
2023-03-15T20:20:13.776977+02:00   PRIMARY(n=3)                                        
2023-03-15T20:20:14.839684+02:00   local node will resync node3                        
2023-03-15T20:20:14.839723+02:00   SYNCED -> DONOR                                     
2023-03-15T20:20:15.799020+02:00   IST will be used                                    
2023-03-15T20:20:16.850525+02:00   finished sending IST to node3                       
2023-03-15T20:20:16.850549+02:00   DESYNCED -> JOINED                                  
2023-03-15T20:20:16.865312+02:00   JOINED -> SYNCED

The tool can process logs from multiple nodes and draw a timeline. You can find a usage example with sample output in the user reference manual. You can also filter events to have a more compact output.

ptgaleralogexplainer  is an excellent tool for analyzing large log files that produce the Galera library.

Better macOS and ARM support

This release contains a few improvements for platforms and operating systems that Percona does not officially support.

PR-516, contributed by Ivan Kruglov, makes
ptonlineschemachangeand other tools respect case-insensitive lookup on Windows and macOS: the default option for these operating systems.

PR-720 simplifies the build process for the Percona Toolkit. Historically, Percona Toolkit was written in Perl and Shell programming languages. Build instructions for all code were simple:

perl Makefile.PL
make
(optionally) make test
make install

Since the first tool, written in the Go programming language, was introduced, package maintainers had to perform one extra step:

cd src/go
make <PLATFORM such as linux-amd64>

Now, this extra step is optional because the top-level
Makefilehas instructions for building Go tools on the current platform.

PR-712 adds the
darwinarm64  platform to the list of platforms for which Go binaries could be built. While we do not officially support ARM and macOS, you can build the tools yourself without extra effort.

To create macOS ARM binaries on the same platform, simply run the following from the top-level directory.

perl Makefile.PL
make

If you want to create binaries on another platform, change the directory to
src/go, then run

make darwin-arm64

We also have unofficial ARM packages for Percona Toolkit and PMM Dump in our labs at https://github.com/Percona-Lab/percona-on-arm.

Tools improvements

Explain output for slow query in JSON report for pt-query-digest

Earlier,
ptquerydigest did not print
EXPLAIN  output with option
output=json . PR-471, sent by Ayush Goyal, adds this possibility.

Hook before_die for pt-online-schema-change

PR-509, sent by Ilaria Migliozzi, introduces a new hook for
ptonlineschemachange . If
ptonlineschemachange  stops ungracefully, this hook lets you print diagnostic information about failing operations. You can find an example plugin on GitHub.

Option –unstop for pt-archiver

ptarchiver supports the option –stop that terminates running instances by creating a sentinel file. However, there was no option to remove this file and restart
ptarchiver . PR-429 by fraff resolves this gap by adding the option
unstop . If
ptarchiver  is called with this option, it removes the sentinel file and restarts the operation.

Quality improvements

In my last Percona Toolkit release blog, What’s New in Percona Toolkit 3.5.5, I wrote about how we plan to make regression tests part of the release process. This partially happened, and we are testing Percona Toolkit with Percona Server for MySQL 5.7 and 8.0 on all supported platforms now. There are still a few issues with tests, as reported at PT-2295. We are planning to fix them and then add support for other products.

Viktor Szépe continued to contribute to the quality of the Percona Toolkit source code. He introduced EditorConfig for our GitHub repository, fixed typos, and made style improvements for Go code.

In addition to Kushal Haldar’s reports about vulnerabilities in Go, we enabled better automation for our GitHub repository. This release was built with the latest version of Go and fixes all known module vulnerabilities.

Percona Toolkit Docker images

We released an official Docker image for the Percona Toolkit at https://hub.docker.com/r/percona/percona-toolkit. It can be installed using the following command.

docker pull percona/percona-toolkit

To call any tool using docker, run the following.

docker run <TOOL NAME> <OPTIONS>

For example:

$ docker run --network="host" percona/percona-toolkit pt-online-schema-change 
> h=127.0.0.1,P=12345,u=msandbox,p=msandbox,D=test,t=t1 --alter='ADD COLUMN f2 INT' 
> --execute
Found 2 slaves:
s76 -> 127.0.0.1:12346
s76 -> 127.0.0.1:12347
Will check slave lag on:
s76 -> 127.0.0.1:12346
s76 -> 127.0.0.1:12347
Operation, tries, wait:
analyze_table, 10, 1
copy_rows, 10, 0.25
create_triggers, 10, 1
drop_triggers, 10, 1
swap_tables, 10, 1
update_foreign_keys, 10, 1
Altering `test`.`t1`...
Creating new table...
Created new table test._t1_new OK.
Altering new table...
Altered `test`.`_t1_new` OK.
2023-12-21T15:29:22 Creating triggers...
2023-12-21T15:29:22 Created triggers OK.
2023-12-21T15:29:22 Copying approximately 1 rows...
2023-12-21T15:29:22 Copied rows OK.
2023-12-21T15:29:22 Analyzing new table...
2023-12-21T15:29:22 Swapping tables...
2023-12-21T15:29:22 Swapped original and new tables OK.
2023-12-21T15:29:22 Dropping old table...
2023-12-21T15:29:22 Dropped old table `test`.`_t1_old` OK.
2023-12-21T15:29:22 Dropping triggers...
2023-12-21T15:29:22 Dropped triggers OK.
Successfully altered `test`.`t1`

Docker images are not only needed for those who prefer to install software via docker but will allow us to easily ship the Percona Toolkit together with other software, such as Percona Monitoring and Management or Percona Kubernetes Operators.

Community contributions

This release includes many contributions from community members. We want to thank:

  • Ivan Kruglov for fixing case-sensitivity issues in
    ptonlineschemachange  and other tools
  • Chrys Swingler for fixing a bug with the option
    skipcheckslavelag  in
    ptonlineschemachange  and
    pttablechecksum
  • Ayush Goyal for adding
    EXPLAIN  output for slow query in JSON report for
    ptquerydigest
  • Ilaria Migliozzi for new hook
    after_die  for
    ptonlineschemachange
  • Jakob for fixing
    ptarchiver bug PT-2064.
  • Viktor Szépe for introducing EditorConfig, fixing typos and issues with the Go code
  • fraff for the option
    unstop  for
    ptarchiver
  • Zongzhi Chen for fixing
    ptpmp  bug PT-2211
  • Jason Ng for fixing
    ptvisualexplain  bug PT-2277
  • Kushal Haldar for his reports about vulnerabilities in Go-based tools

Percona Toolkit: Free your DBAs with advanced open source command-line tools.

 

Learn more

Dec
22
2023
--

The Pros and Cons of Wildcard Indexes on MongoDB

https://www.percona.com/blog/wp-content/uploads/2023/07/Enterprise-Grade-MongoDB-Alternative.jpg

​​MongoDB is a schemaless database that is extremely flexible. When you create a collection, you don’t have to specify a structure in advance, providing field names and data types. Just start inserting JSON documents, and MongoDB will store them, no matter which fields and data types you provide. As a consequence, a collection can store completely different documents.

MongoDB does not require any ALTER statement to modify the schema of a collection, like in a relational database. If you need at some point to add new fields, just do it. Start inserting new JSON documents with the additional fields. Very easy.

But how can we manage the creation of indexes on a collection where I cannot foresee the fields we could have? In this article, I’ll show wildcard indexes and their pros and cons.

Create a wildcard index on a single field

The simple idea of a wildcard index is to provide the possibility to create an index without knowing in advance the fields we are expecting in the documents. You can put whatever you need and MongoDB will index everything, no matter the field’s name, no matter the data type. The feature looks amazing, but it comes at some cost.

To test wildcard indexes, let’s create a small collection for storing our users’ details. We have some fixed fields like name, date_of_birth, and gender, but also we have a subdocument userMetadata for any other attribute we don’t know in advance. This way, we can store everything we need.

db.user.insert( { name: "John", date_of_birth: new ISODate("2001-02-05"), gender: 'M', userMetadata: { "likes" : [ "dogs", "cats" ] } } )
db.user.insert( { name: "Marie", date_of_birth: new ISODate("2008-03-12"), gender: 'F', userMetadata: { "dislikes" : "hamsters" } } )
db.user.insert( { name: "Tom", date_of_birth: new ISODate("1998-12-23"), gender: 'M', userMetadata: { "age" : 25 } } )
db.user.insert( { name: "Adrian", date_of_birth: new ISODate("1991-06-22"), gender: 'M', userMetadata: "inactive" } )
db.user.insert( { name: "Janice", date_of_birth: new ISODate("1995-09-04"), gender: 'F', userMetadata: { "shoeSize": 8, "likes": [ "horses", "dogs" ] } } )
db.user.insert( { name: "Peter", date_of_birth: new ISODate("2004-01-25"), gender: 'M', userMetadata: { "drivingLicense": { class: "A", "expirationDate": new ISODate("2030-05-05") } } } )

db.user.find()
[
{
_id: ObjectId('658452229a147dcb1198d9df'),
name: 'John',
date_of_birth: ISODate('2001-02-05T00:00:00.000Z'),
gender: 'M',
userMetadata: { likes: [ 'dogs', 'cats' ] }
},
{
_id: ObjectId('658452289a147dcb1198d9e0'),
name: 'Marie',
date_of_birth: ISODate('2008-03-12T00:00:00.000Z'),
gender: 'F',
userMetadata: { dislikes: 'hamsters' }
},
{
_id: ObjectId('6584522e9a147dcb1198d9e1'),
name: 'Tom',
date_of_birth: ISODate('1998-12-23T00:00:00.000Z'),
gender: 'M',
userMetadata: { age: 25 }
},
{
_id: ObjectId('658452519a147dcb1198d9e2'),
name: 'Adrian',
date_of_birth: ISODate('1991-06-22T00:00:00.000Z'),
gender: 'M',
userMetadata: 'inactive'
},
{
_id: ObjectId('658452d69a147dcb1198d9e3'),
name: 'Janice',
date_of_birth: ISODate('1995-09-04T00:00:00.000Z'),
gender: 'F',
userMetadata: { shoeSize: 8, likes: [ 'horses', 'dogs' ] }
},
{
_id: ObjectId('658453a09a147dcb1198d9e4'),
name: 'Peter',
date_of_birth: ISODate('2004-01-25T00:00:00.000Z'),
gender: 'M',
userMetadata: {
drivingLicense: {
class: 'A',
expirationDate: ISODate('2030-05-05T00:00:00.000Z')
}
}
}
]

As you can see, the metaData subdocument contains different fields. But all those fields are not indexed. Let’s suppose our collection contains several million documents; how can we retrieve, for example, all users with a specific driving license class or a specific shoe size without triggering a full collection scan? We can create a wildcard index on the userMetadata field using the special syntax $**

Let’s do it:

db.user.createIndex({ "userMetadata.$**" : 1 })
db.user.getIndexes()
[
{ v: 2, key: { _id: 1 }, name: '_id_' },
{ v: 2, key: { 'userMetadata.$**': 1 }, name: 'userMetadata.$**_1' }
]

This tells MongoDB to create an entry in the index for every single field and for any array member inside userMetadata.

Now, we can benefit from the index to execute any kind of query.

db.user.find({ "userMetadata.likes": "dogs" })
[
{
_id: ObjectId('658452229a147dcb1198d9df'),
name: 'John',
date_of_birth: ISODate('2001-02-05T00:00:00.000Z'),
gender: 'M',
userMetadata: { likes: [ 'dogs', 'cats' ] }
},
{
_id: ObjectId('658452d69a147dcb1198d9e3'),
name: 'Janice',
date_of_birth: ISODate('1995-09-04T00:00:00.000Z'),
gender: 'F',
userMetadata: { shoeSize: 8, likes: [ 'horses', 'dogs' ] }
}
]

We can simply run explain() to certify the wildcard index is used and the query is an IXSCAN.

db.user.find({ "userMetadata.likes": "dogs" }).explain()
{
explainVersion: '1',
queryPlanner: {
namespace: 'test.user',
indexFilterSet: false,
parsedQuery: { 'userMetadata.likes': { '$eq': 'dogs' } },
queryHash: 'E2BC0D70',
planCacheKey: '7C6EEF39',
maxIndexedOrSolutionsReached: false,
maxIndexedAndSolutionsReached: false,
maxScansToExplodeReached: false,
winningPlan: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { '$_path': 1, 'userMetadata.likes': 1 },
indexName: 'userMetadata.$**_1',
isMultiKey: true,
multiKeyPaths: {
'$_path': [],
'userMetadata.likes': [ 'userMetadata.likes' ]
},
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: {
'$_path': [ '["userMetadata.likes", "userMetadata.likes"]' ],
'userMetadata.likes': [ '["dogs", "dogs"]' ]
}
}
},
rejectedPlans: []
...
...

For example, all the following queries can benefit from the same index. You can test them on your own using explain().

db.user.find( { "userMetadata.age" : { $gt: 20 }  } )
db.user.find( { "userMetadata": "inactive" } )
db.user.find( { "userMetadata.drivingLicense.class": "A", "userMetadata.drivingLicense.expirationDate": { $lt: ISODate("2032-01-01") } } )
db.user.find( { "userMetadata.shoeSize": 8})

Create a wildcard index on the entire document

What about creating a wildcard index on the entire document? Is it possible?

Yes, that’s possible. We can do this if we don’t know anything in advance about the documents we’re going to get in the collection.

There is another special syntax for doing that. Use the $** again without specifying a field name. Let’s do it on our user collection.

db.user.createIndex( { "$**" : 1 } )

Again, you can test the same queries we did before. You can notice all fields of the document are indexed now.

db.user.find( { name: "Marie" } )
[
{
_id: ObjectId('658452289a147dcb1198d9e0'),
name: 'Marie',
date_of_birth: ISODate('2008-03-12T00:00:00.000Z'),
gender: 'F',
userMetadata: { dislikes: 'hamsters' }
}
]

db.user.find( { name: "Marie" } ).explain()
{
explainVersion: '1',
queryPlanner: {
namespace: 'test.user',
indexFilterSet: false,
parsedQuery: { name: { '$eq': 'Marie' } },
queryHash: '64908032',
planCacheKey: 'A6C0273F',
maxIndexedOrSolutionsReached: false,
maxIndexedAndSolutionsReached: false,
maxScansToExplodeReached: false,
winningPlan: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { '$_path': 1, name: 1 },
indexName: '$**_1',
isMultiKey: false,
multiKeyPaths: { '$_path': [], name: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: {
'$_path': [ '["name", "name"]' ],
name: [ '["Marie", "Marie"]' ]
}
}
},
rejectedPlans: []
...
...

Pros and cons

The good thing about wildcard indexes is easy to say it’s the great flexibility they provide. Just index everything, even what you are not expecting.

The bad thing instead is the index size. As you can simply imagine, what could happen in this case, I’ll start to insert an impressive amount of fields or arrays with thousands of items in the userMetadata? Well, more and more index entries will be created. A single document can generate thousands of index entries. Am I supposed to use such a huge amount of disk and memory for managing this very large index structure?

Remember, indexes are most effective if they can fit into the memory. The size of a wildcard index can simply explode if we cannot control (or we were not able to foresee) the amount of data we create.

Our collection is very small, so the numbers should not worry. But think about what could happen in the case of very large collections. The size of the indexes can go out of control.

Let’s use a simple trick to increase the size of the collection. Run the following statement to double at any time the number of documents. Do it eight or ten times, depending on how many documents you want.

db.user.find( {}, {_id:0}).forEach(function (doc) { db.user.insertOne(doc); } )

In my case, I have a collection of 19K documents. Large enough for testing index size.

db.user.stats()
...
...
 size: 2388744,
count: 19147,
numOrphanDocs: 0,
storageSize: 540672,
totalIndexSize: 1343488,
totalSize: 1884160,
indexSizes: { _id_: 630784, 'userMetadata.$**_1': 192512, '$**_1': 520192 },
avgObjSize: 124,
ns: 'test.user',
nindexes: 3,
...

The data in the collection is around 2.3MB, and the overall size of the indexes is 1.3 MB. Indexes represent more than 50% of the data size. If this is not impressive enough, think about what the percentage should be in case you have very large subdocuments. I have seen collections in production environments being over-indexed because of a single wildcard index.

The wildcard index, which was beneficial at the beginning to make things more flexible, ended instead in a serious bottleneck for the performance, causing more memory utilization and swapping.

Also, remember that most of the time, only a few fields are used for your most frequent queries. Not always does the utilization of a wildcard index really make sense.

Conclusions

Wildcard indexes may be useful in some specific cases when you don’t know much about the documents you expect. The main suggestion is not to abuse wildcard indexes since they can end in serious bottlenecks. Single-field, compound, and multikey indexes are usually better choices than wildcard ones.

Monitor the index size for all the collections for which you have a wildcard index. If you see the size is increasing rapidly drop it and create instead other indexes on the fields that are most frequently used in your queries.

There is some mitigation you can apply, like excluding some of the fields from the wildcard index. Have a look at the following page for more details: https://www.mongodb.com/docs/manual/core/indexes/index-types/index-wildcard/create-wildcard-index-multiple-fields/

The new MongoDB 7.0 now supports the creation of compound wildcard indexes. Have a look at the following page for more details: https://www.mongodb.com/docs/manual/core/indexes/index-types/index-wildcard/index-wildcard-compound/

Further reading about indexes on MongoDB: MongoDB Indexes Explained: A Comprehensive Guide to Better MongoDB Performance.

Percona Distribution for MongoDB is a freely available MongoDB database alternative, giving you a single solution that combines the best and most important enterprise components from the open source community, designed and tested to work together.

 

Download Percona Distribution for MongoDB Today!

Dec
21
2023
--

Percona Monitoring and Management High Availability – A Proof of Concept

Percona Monitoring and Management (PMM) is a state-of-the-art piece of software that exists in part thanks to great open source projects like VictoriaMetrics, PostgreSQL, and ClickHouse. The integration of those projects, plus the years of Percona expertise in the database space, makes PMM one of the best database monitoring solutions on the market.

Being software composed of different, multiple technologies can add complexity to a well-known concept: High Availability (HA). Achieving HA for PMM, as a whole, has proved to be a challenge but not an impossible task.

The easy part

Setting up the PMM cluster is the easy part. All it needs is a reverse proxy in front of a couple or more PMM instances. The go-to proxy is HAProxy configured for active/passive topology, that is, without load distribution.

For the purpose of the PoC, a single HAProxy instance is used (running as a docker container). The configuration file looks like this:

global
stats socket /var/run/api.sock user haproxy group haproxy mode 660 level admin expose-fd listeners
log stdout format raw local0 info

defaults
mode http
timeout client 10s
timeout connect 5s
timeout server 10s
timeout http-request 10s
log global

frontend stats
bind *:8404
stats enable
stats uri /
stats refresh 10s

frontend pmmfrontend
bind :80
default_backend pmmservers

backend pmmservers
option tcp-check
server pmm1 172.31.12.174:80 check port 80
server pmm2 172.31.11.132:80 check port 80 backup

The Docker container is run with this command:

docker run -d --name haproxy -v $(pwd):/usr/local/etc/haproxy:ro -p 80:80 -p 443:443 -p 8404:8404 haproxytech/haproxy-alpine:2.

The -v for the volume guarantees that the local copy of the haproxy.cfg file is the one used inside the container. Whenever you make a change in the cfg file, for the haproxy container to use it, just execute:

docker kill -s HUP haproxy

And to follow the haproxy logs:

docker logs -f haproxy

We have two frontends: One for the HAProxy stats and another for the PMM itself. The backend is a single one where the “passive” PMM instance (the one that is a pure “read replica”) is marked as “backup” so that traffic is only routed there in case the primary fails the health check.

For simplicity, the PMM instances are configured to listen to the 80 port (http) on the private IPs. This is made to avoid SSL certificates since everything goes through the same VPC (everything runs on ec2 instances on AWS). The health check, then, can be a simple “tcp-check” against port 80.

Percona Monitoring and Management high availability

As you can see, stats are available via the port 8404. With this, the easy part is done.

For this example, the PMM SERVER endpoint will be where the HAProxy frontend is listening, and that’s the one used when registering a new PMM CLIENT.

PMM CLIENT

And you can access PMM always using the same endpoint.

The not-so-easy part (made easy)

The proxy is configured to be aware of two different PMM instances — pmm1 and pmm2  — (using the private IPs 172.31.12.174 and 172.31.11.132 in this case), but we haven’t mentioned anything about those PMM instances.

And here is the not-so-easy part: One has to deploy at least two PMM instances on at least two different servers AND set up replicas. How to do it? This is the actual Proof of Concept: Enter the PMM HA script: https://github.com/nethalo/pmmha

The script will take care of installing PMM (if you already have it, you can skip this step), preparing the Primary, and setting up the Secondary. Simple as that. PMM will remain to be a black box.

The requirements are:

– Docker 23.0.3 and higher
– Docker installation will fail if on Amazon Linux or RHEL9/EL9 (unless you are on a s390x architecture machine) since the Percona Easy-Install script relies on the “Get Docker” (https://get.docker.com/) script for the docker install. You will need to install docker on your own in those cases
– SSH Access to the host servers
– sudo capabilities
– Ports 443 and 9000 accessible from outside the Primary host machine

Steps to run the script:

git clone https://github.com/nethalo/pmmha.git
cd pmmha
bash pmm.sh

Failover

The failover will be handled by the HAProxy automatically when it detects that the current primary is no longer available. Traffic will be routed to the backup server from the backend, which, if properly set as a replica, will already have the historical data for metrics and QAN, and also the inventory will be ready to continue the data scrapping from the exporters.

Your feedback is key

We want to hear from you! Suggestions, feature requests, and comments in general are welcome, either in the comment section or via the GitHub repository.

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

 

Download Percona Monitoring and Management Today

Dec
21
2023
--

Revamp MySQL Query Optimization and Overcome Slowness of ORDER BY with LIMIT Queries

Slowness of ORDER BY with LIMIT Queries

The efficiency of database queries in MySQL can make all the difference in the performance and responsiveness of applications. In this blog post, I’ll dig into MySQL query optimization and show how MySQL uses indexes in cases of queries using sorting and limiting. While sorting may seem simple, it’s important to understand how to do it efficiently and effectively to ensure your queries are optimized and use better indexing.

Since sorting your rows is not free, it can take a significant amount of resources and time to sort large data sets; thus, it’s important to do it cautiously. If you don’t need your rows in a certain order, don’t order them.

However, if you need to order your rows, doing it efficiently and effectively is essential to optimize your queries. You must understand how to use indexes to make sorting cheaper. 

Looking at this, can you say which is faster: LIMIT 1 or LIMIT 10? Presumably, fetching fewer rows is faster than fetching more records. However, for 16 years since 2007, the MySQL query optimizer has had a “bug” that not only makes LIMIT 1 slower than LIMIT 10 but can also make the former a table scan, which tends to cause problems. I went through the case details for one of our clients last week, leading me to pen down this article. I’m writing this blog post for developers/DBAs to more clearly illustrate and explain how the MySQL query optimizer works in case of queries using GROUP BY, ORDER BY with LIMIT, and how we can now control the optimization of these queries using optimizer_switch using prefer_ordering_index which is simply covered under Switchable Optimizations in MySQL.

Before looking at the problematic query, I will walk you through a little detail about the optimizer.  The Query Optimizer is the part of query execution that chooses the query plan.  A Query Execution Plan is the way a database chooses to run a specific query.  It includes index choices, join types, table query order, temporary table usage, sorting type, etc. The execution plan for a specific query can be obtained using the EXPLAIN command.

There is a concept called Switchable Optimizations, where MySQL lets you control the query optimizer, which is managed by the optimizer_switch variable. This system variable enables control over optimizer behavior. The value of this variable is a set of flags, each of which has a value of on or off to indicate whether the corresponding optimizer behavior is enabled or disabled. This variable has global and session values and can be changed at runtime. The global default can be set at server startup.

To see the current set of optimizer flags, select the variable value:

mysql> SELECT @@optimizer_switchG
*************************** 1. row ***************************
@@optimizer_switch: index_merge=on,index_merge_union=on,
                    index_merge_sort_union=on,index_merge_intersection=on,
                    engine_condition_pushdown=on,index_condition_pushdown=on,
                    mrr=on,mrr_cost_based=on,block_nested_loop=on,
                    batched_key_access=off,materialization=on,semijoin=on,
                    loosescan=on,firstmatch=on,duplicateweedout=on,
                    subquery_materialization_cost_based=on,
                    use_index_extensions=on,condition_fanout_filter=on,
                    derived_merge=on,use_invisible_indexes=off,skip_scan=on,
                    hash_join=on,subquery_to_derived=off,
                    prefer_ordering_index=on,hypergraph_optimizer=off,
                    derived_condition_pushdown=on

Optimizer flag

Let’s take a deeper look into one of the opt_name flags, i.e., prefer_ordering_index. This flag controls whether, in the case of a query having an ORDER BY or GROUP BY with a LIMIT clause, the optimizer tries to use an ordered index instead of an unordered index, a filesort, or some other optimization. This optimization is performed by default whenever the optimizer determines that using it would allow for faster query execution. Because the algorithm that makes this determination cannot handle every conceivable case (due in part to the assumption that the distribution of data is always more or less uniform), there are cases in which this optimization may not be desirable. Prior to MySQL 8.0.21, it was not possible to disable this optimization, but in MySQL 8.0.21 and later, while it remains the default behavior, it can be disabled by setting the prefer_ordering_index flag to off.

Here, we will look through a case study with an example. Let’s first understand the problem.

Problem statement:

The below query was taking too long to execute. The table being queried had a size of around 850G. But what went wrong? The query had the worst performance and took around three hours for a single row. Isn’t it crazy? And why is it taking so much time? Let’s get into more details to find out the answer.

mysql> select `tokenId` from `test_db`.`Tokens` order by `tokenId` desc limit 1;

Here is the structure of the table:

show create table `test_db`.`Tokens` G
*************************** 1. row ***************************
      Table: Tokens
Create Table: CREATE TABLE `Tokens` (
  `tokenId` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `tokenTypeId` int(10) unsigned NOT NULL DEFAULT '1',
  `accountId` bigint(20) unsigned DEFAULT NULL,
  `token` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `validFrom` datetime NOT NULL,
  `validTo` datetime NOT NULL,
  `dateCreated` datetime NOT NULL,
  `lastUpdated` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `status` enum('INACTIVE','ACTIVE','REDEEMED','CANCELLED','EXPIRED','INVALIDATED','USED') COLLATE utf8_unicode_ci NOT NULL,
  `passphrase` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  PRIMARY KEY (`tokenId`),                         
  KEY `idx_dateCreated` (`dateCreated`),
  KEY `idx_status_validTo` (`status`,`validTo`),
  KEY `idx_token` (`token`),
  KEY `idx_tokenTypeId` (`tokenTypeId`),  <<---This index is being used by the query and scans 4065011580 rows.
  KEY `idx_accountId_status` (`accountId`,`status`)
) ENGINE=InnoDB AUTO_INCREMENT=5984739122 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ROW_FORMAT=COMPRESSED
1 row in set (0.00 sec)

mysql> select table_schema, table_name, table_rows, round(data_length / 1024 / 1024 / 1024) DATA_GB, round(index_length / 1024 / 1024 / 1024) INDEX_GB, round(data_free / 1024 / 1024 / 1024) FREE_GB, round(((data_length / 1024 / 1024)+round(index_length / 1024 / 1024)+round(data_free / 1024 / 1024))/1024) TOTAL_GB from information_schema.tables where table_name='Tokens';
+----------------+------------+------------+---------+----------+---------+----------+
| table_schema   | table_name | table_rows | DATA_GB | INDEX_GB | FREE_GB | TOTAL_GB |
+----------------+------------+------------+---------+----------+---------+----------+
| test_db        | Tokens     | 4069894019 |     360 |      438 |      52 |      850 |
+----------------+------------+------------+---------+----------+---------+----------+
1 row in set (0.01 sec)

prefer_ordering_index=OFF

mysql> show variables like '%optimizer_switch%' G
*************************** 1. row ***************************
Variable_name: optimizer_switch   Value:index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=on,block_nested_loop=on,batched_key_access=off,materialization=on,semijoin=on,loosescan=on,firstmatch=on,duplicateweedout=on,subquery_materialization_cost_based=on,use_index_extensions=on,condition_fanout_filter=on,derived_merge=on,prefer_ordering_index=off,favor_range_scan=off
1 row in set (0.01 sec)

This is the execution plan with prefer_ordering_index=OFF:

mysql> explain select `tokenId` from `test_db`.`Tokens` where order by `tokenId` desc limit 1G
*************************** 1. row ***************************
          id: 1
  select_type: SIMPLE
        table: Tokens
  partitions:
        type: index
possible_keys: NULL
          key: idx_tokenTypeId
      key_len: 4
          ref: NULL
        rows: 3187489428
    filtered: 100.00
        Extra: Using index; Using filesort
1 row in set, 1 warning (0.00 sec)

  • Here, the query uses a secondary index idx_tokenTypeId, and filesort is used for sorting the result set, where you can see it is causing a full scan.

prefer_ordering_index=ON 

This is how it behaves when switching prefer_ordering_index to ON, where you can see the index is being used.

mysql> set optimizer_switch='prefer_ordering_index=on';
Query OK, 0 rows affected (0.00 sec)

mysql> explain select `tokenId` from `test_db`.`Tokens` order by `tokenId` desc limit 1G
*************************** 1. row ***************************
          id: 1
  select_type: SIMPLE
        table: Tokens
  partitions:
        type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 8
          ref: NULL
        rows: 1
    filtered: 100.00
        Extra: Using index
1 row in set, 1 warning (0.00 sec)

And this is how it is executed within less than a second! Great!

mysql> select `tokenId` from `test_db`.`Tokens` order by `tokenId` desc limit 1 ;
+------------+
| tokenId    |
+------------+
| 5984755269 |
+------------+
1 row in set (0.00 sec)

The prefer_ordering_index should, by default, have a value of ON.

Let me take you into more detail now. The usage of the index in the above case is based on the following aspects: 

  1. Index cardinality

An estimate of the number of unique values in the index. CARDINALITY is counted based on statistics stored as integers, so the value is not necessarily exact, even for small tables. The higher the cardinality, the greater the chance MySQL uses the index.

  1. File sort

Filesort is the catch-all algorithm for producing sorted results for ORDER-BY or GROUP-BY queries. This is how file sort works:

  • Read the rows that match the WHERE clause.
  • For each row, record a tuple of values consisting of the sort key value and the additional fields referenced by the query.
  • When the sort buffer becomes full, sort the tuples by sort key value in memory and write it to a temporary file.
  • After merge-sorting the temporary file, retrieve the rows in sorted order, read the required columns directly from the sorted tuples
  1. How does the query optimizer prefer ordering index?

In the case of a query having an ORDER BY or GROUP BY and a LIMIT clause, the optimizer tries to choose an ordered index by default when it appears doing so would speed up the query execution. It uses an ordered index instead of an unordered index. An unordered index may create a filesort, which apparently will increase the query execution time. This optimization is performed by default whenever the optimizer determines that using it would allow for faster execution of the query.

Prior to MySQL 8.0.21, there was no way to override this behavior, even in cases where using some other optimization might be faster. Beginning with MySQL 8.0.21, this optimization can be turned off by setting the optimizer_switch system variable’s prefer_ordering_index flag to off.

Caution:

Disabling prefer_ordering_index causes another bug: MySQL does not scan the primary key for SELECT … FROM t ORDER BY pk_col LIMIT n. Instead, it does a full table scan plus sort, which is unnecessary and likely to cause problems. This was the case with the client where they disabled prefer_ordering_index, the reason being was satisfying the where clause is their preference. However, this has caused a real issue, and we would call it a bug.

Also, please note that when you want to control optimizer strategies by setting the optimizer_switch system variable, the changes to this variable affect the execution of all subsequent queries if you set it on the global level or in my.cnf file. This may affect the performance of the other queries in your application. To affect one query differently from another, it is necessary to change optimizer_switch before each one. This can be done by setting it on session level. The other way is to use FORCE INDEX in your queries.  Currently, the optimizer hint for the flag ‘prefer_ordering_index’ is not yet available till version 8.2.0. Hopefully, MySQL will be adding this in the upcoming releases.

Another example

Table: employee_details

id employee_id department_id hire_date
1 101 1001 2017-11-09
2 102 1003 2020-02-06
3 103 1006 2021-05-15
4 104 1002 2022-07-10
5 105 1006 2022-02-06
6 106 1004 2023-06-14
Primary Key    : id
Secondary index: <department_id,employee_id>

mysql> SELECT * FROM employee_details WHERE department_id = '1006' ORDER BY id LIMIT 1;

How should MySQL execute that query? Developers tend to say, “Use the secondary index for the WHERE condition department_id = ‘1006’.” That’s reasonable; it makes sense. 

The secondary index has two matching records: <‘1006’,103> and <‘1006’, 105>. That will cause four lookups total: two secondary index reads + two corresponding primary key reads. Furthermore, the query needs to be ordered by id, which is not the order of the secondary index, so MySQL will also sort the results after those four lookups. That means EXPLAIN will say, “Using filesort”.

Let’s walk through the secondary index access step by step:

  1. Match Secondary Index <‘1006’, 105>
  2. Read corresponding PK row 5 into sort buffer
  3. Match Secondary Index <‘1006’,103>
  4. Read corresponding PK row 3 into sort buffer
  5. Sort the buffer for ORDER BY: [5, 3] → [3, 6]
  6. Apply LIMIT 1 to return PK row <3, ‘1006’,103>

That’s not a bad execution plan, but the query optimizer can choose a completely different plan: an index scan on the ORDER BY column, which happens to be the primary key: id. (Remember: an index scan on the primary is the same as a table scan because the primary key is the table.) Why? In the source code, a code comment explains:

/*Switch to index that gives order if its scan time is smaller than read_time of current chosen access method*/.

Reading rows in order might be faster than unordered secondary index lookups plus sorting. With this optimization, the new query execution plan would be:

  1. Read PK row 1 and discard (department_id value doesn’t match)
  2. Read PK row 2 and discard (department_id value doesn’t match)
  3. Read PK row 3 (department_id value matches)

Looks like MySQL is correct: by scanning the primary key in order, it reads one less row and avoids the filesort. For now, the point is that this query optimization works this way and might be faster.

— BEFORE: Secondary index lookup

mysql> explain SELECT * FROM employee_details WHERE department_id = '1006' ORDER BY id LIMIT 1G
*************************** 1. row ***************************
          id: 1
  select_type: SIMPLE
        table: employee_details
  partitions:
        type: ref
possible_keys: idx_dept_empid
          key: idx_dept_empid
      key_len: 16
          ref: const
        rows: 1000      
    filtered: 100.00
        Extra: Using filesort
1 row in set, 1 warning (0.00 sec)

But after the change, you would see an EXPLAIN plan like:

mysql> explain SELECT * FROM employee_details WHERE department_id = '1006' ORDER BY id LIMIT 1G
*************************** 1. row ***************************
          id: 1
  select_type: SIMPLE
        table: employee_details
  partitions:
        type: ref
possible_keys: idx_dept_empid
          key: PRIMARY
      key_len: 8
          ref: NULL
        rows: 2
    filtered: 100.00
        Extra: Using index
1 row in set, 1 warning (0.00 sec)

Notice that fields type, key, ref, and Extra all change. Also, PRIMARY is not listed for possible_keys before, but after (when MySQL changes the execution plan) it appears as the chosen key.

Conclusion

The optimization to switch from a non-ordering index to an ordering index for “group by” and “order by” when there is a limit clause goes very wrong for certain queries. The MySQL team has introduced a new optimizer switch to disable limit optimization, i.e., prefer_ordering_index. The MySQL team tested and analyzed most of the bugs reported in this area and identified the above problems with the algorithm. The problem is not yet solved with this change, but giving users an option to use the optimization correctly. With this, you can disable or enable the optimizer flag prefer_ordering_index or work around it by increasing the LIMIT value to change the cost calculations, or use FORCE INDEX to force the optimal secondary index. Whatever solution you may use, you just need to be aware of the performance penalties that may come and have a clear understanding of the optimization techniques you choose.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

 

Try Percona Distribution for MySQL today!

Dec
20
2023
--

Using Huge Pages with PostgreSQL Running Inside Kubernetes

Huge pages make PostgreSQL faster; can we implement it in Kubernetes? Modern servers operate with terabytes of RAM, and by default, processors work with virtual memory address translation for each 4KB page. OS maintains a huge list of allocated and free pages to make slow but reliable address translation from virtual to physical.

Please check out the Why Linux HugePages are Super Important for Database Servers: A Case with PostgreSQL blog post for more information.

Setup

I recommend starting with 2MB huge pages because it’s trivial to set up. Unfortunately, the performance in benchmarks is almost the same as for 4KB pages. Kubernetes worker nodes should be configured with GRUB_CMDLINE_LINUX or sysctl vm.nr_hugepages=N: https://kubernetes.io/docs/tasks/manage-hugepages/scheduling-hugepages/

This step could be hard with managed Kubernetes services, like GCP, but easy for kubeadm, kubespray, k3d, and kind installations.

Kubectl helps to check the amount of huge pages available.

kubectl describe nodes NODENAME
…
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      1Gi (25%)  1Gi (25%)
…

The tool reports only 2MB pages availability in the above output. During the deployment procedure on the custom resource apply stage, Percona Operator for PostgreSQL 2.2.0 is not able to start on such nodes:

$ kubectl -n pgo get pods -l postgres-operator.crunchydata.com/data=postgres
NAME                        READY   STATUS             RESTARTS       AGE
cluster1-instance1-f65t-0   3/4     CrashLoopBackOff   6 (112s ago)   8m35s
cluster1-instance1-2bss-0   3/4     CrashLoopBackOff   6 (100s ago)   8m35s
cluster1-instance1-89v7-0   3/4     CrashLoopBackOff   6 (104s ago)   8m35s

Logs are very confusing:

kubectl -n pgo logs cluster1-instance1-f65t-0 -c database
selecting dynamic shared memory implementation ... posix
sh: line 1:   737 Bus error               (core dumped) "/usr/pgsql-15/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=100 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1

By default, PostgreSQL is configured to use huge pages, but Kubernetes needs to allow it first. .spec.instances.resources.limits should be modified to mention huge pages. PG pods are not able to start without proper limits on the node with huge pages enabled.

instances:
  - name: instance1
    replicas: 3
    resources:
      limits:
        hugepages-2Mi: 1024Mi
        memory: 1Gi
        cpu: 500m

hugepages-2Mi works in combination with the memory parameter; you can’t just specify huge pages limits.

Finally, let’s verify huge pages usage in postmaster memory map:

$ kubectl -n pgo exec -it cluster1-instance1-hgrp-0 -c database -- bash

ps -eFH # check process tree and find “first” postgres process

pmap -X -p 107|grep huge

         Address Perm   Offset Device     Inode   Size   Rss  Pss Pss_Dirty Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping

    7f35c5c00000 rw-s 00000000  00:0f 145421787 432128     0    0         0          0         0        0              0             0          18432          264192    0       0      0           0 /anon_hugepage (deleted)

Both Shared_Hugetlb Private_Hugetlb columns are set (18432 and 264192). It confirms that PostgreSQL can use huge pages.

Don’t set huge pages to the exact value of shared_buffers, as shared memory could also be consumed by extensions and many internal structures.

postgres=# SELECT sum(allocated_size)/1024/1024 FROM pg_shmem_allocations ;
       ?column?       
----------------------
 422.0000000000000000
(1 row)
postgres=# select * from pg_shmem_allocations order by allocated_size desc LIMIT 10;
         name         |    off    |   size    | allocated_size 
----------------------+-----------+-----------+----------------
 <anonymous>          |           | 275369344 |      275369344
 Buffer Blocks        |   6843520 | 134217728 |      134217728
 pg_stat_monitor      | 147603584 |  20971584 |       20971648
 XLOG Ctl             |     54144 |   4208200 |        4208256
                      | 439219200 |   3279872 |        3279872
 Buffer Descriptors   |   5794944 |   1048576 |        1048576
 CommitTs             |   4792192 |    533920 |         534016
 Xact                 |   4263040 |    529152 |         529152
 Checkpointer Data    | 146862208 |    393280 |         393344
 Checkpoint BufferIds | 141323392 |    327680 |         327680
(10 rows)

Pg_stat_statements and pg_stat_monitor could introduce a significant difference to the small value of shared_buffers. Thus you need “hugepages-2Mi: 512Mi” for “shared_buffers: 128MB”.

Now you know all the caveats and may want to repeat the configuration.

It’s easy with anydbver and k3d. Allocate 2MB huge pages:

sysctl vm.nr_hugepages=2048

Verify huge pages availability:

egrep 'Huge|Direct' /proc/meminfo
AnonHugePages:    380928 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:    2048
HugePages_Free:     2048
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:         4194304 kB
DirectMap4k:     1542008 kB
DirectMap2M:    19326976 kB
DirectMap1G:           0 kB

  1. Install and configure anydbver.

    git clone https://github.com/ihanick/anydbver.git
    cd anydbver
    ansible-galaxy collection install theredgreek.sqlite
    echo PROVIDER=docker > .anydbver
    (cd images-build;./build.sh)
  2. Start k3d cluster and install Percona Operator for PostgreSQL 2.2.0:

    ./anydbver deploy k8s-pg:2.2.0
  3. The command hangs on the cluster deployment stage, and the second terminal shows CrashLoopBackoff state:

    kubectl -n pgo get pods -l postgres-operator.crunchydata.com/data=postgres
  4. Change data/k8s/percona-postgresql-operator/deploy/cr.yaml
    Uncomment .spec.instances[0].resources.limits and set memory: 1Gi, hugepages-2Mi: 1024Mi
  5. Apply CR again:

    kubectl -n pgo apply -f data/k8s/percona-postgresql-operator/deploy/cr.yaml

In summary:

  • Huge pages are not supported out of the box in public clouds
  • Database crashes can occur if huge pages allocation fails with a bus error
  • Huge pages is not a silver bullet.
    • Without frequent CPU context switches and massively random large shared buffer access, default 4K pages show comparable results.
    • Workloads with less than 4-5k transactions per second are fine even without huge pages

 

Learn more about Percona Operator for PostgreSQL

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com