Nov
26
2019
--

Comparing S3 Streaming Tools with Percona XtraBackup

Comparing S3 Streaming Tools

Making backups over the network can be done in two ways: either save on disk and transfer or just transfer without saving. Both ways have their strong and weak points. The second way, particularly, is highly dependent on the upload speed, which would either reduce or increase the backup time. Other factors that influence it are chunk size and the number of upload threads.

Percona XtraBackup 2.4.14 has gained S3 streaming, which is the capability to upload backups directly to s3-compatible storage without saving locally first. This feature was developed because we wanted to improve the upload speeds of backups in Percona Operator for XtraDB Cluster.

There are many implementations of S3 Compatible Storage: AWS S3, Google Cloud Storage, Digital Ocean Spaces, Alibaba Cloud OSS, MinIO, and Wasabi.

We’ve measured the speed of AWS CLI, gsutil, MinIO client, rclone, gof3r and the xbcloud tool (part of Percona XtraBackup) on AWS (in single and multi-region setups) and on Google Cloud. XtraBackup was compared in two variants: a default configuration and one with tuned chunk size and amount of uploading threads.

Here are the results.

AWS (Same Region)

The backup data was streamed from the AWS EC2 instance to the AWS S3, both in the us-east-1 region.

 

 

tool settings CPU max mem speed speed comparison
AWS CLI default settings 66% 149Mb 130MiB/s baseline
AWS CLI 10Mb block, 16 threads 68% 169Mb 141MiB/s +8%
MinIO client not changeable 10% 679Mb 59MiB/s -55%
rclone rcat not changeable 102% 7138Mb 139MiB/s +7%
gof3r default settings 69% 252Mb 97MiB/s -25%
gof3r 10Mb block, 16 threads 77% 520Mb 108MiB/s -17%
xbcloud default settings 10% 96Mb 25MiB/s -81%
xbcloud 10Mb block, 16 threads 60% 185Mb 134MiB/s +3%

 

Tip: If you run MySQL on an EC2 instance to make backups inside one region, do snapshots instead.

AWS (From US to EU)

The backup data was streamed from AWS EC2 in us-east-1 to AWS S3 in eu-central-1.

 

 

tool settings CPU max mem speed speed comparison
AWS CLI default settings 31% 149Mb 61MiB/s baseline
AWS CLI 10Mb block, 16 threads 33% 169Mb 66MiB/s +8%
MinIO client not changeable 3% 679Mb 20MiB/s -67%
rclone rcat not changeable 55% 9307Mb 77MiB/s +26%
gof3r default settings 69% 252Mb 97MiB/s +59%
gof3r 10Mb block, 16 threads 77% 520Mb 108MiB/s +77%
xbcloud default settings 4% 96Mb 10MiB/s -84%
xbcloud 10Mb block, 16 threads 59% 417Mb 123MiB/s +101%

 

Tip: Think about disaster recovery, and what will you do when the whole region is not available. It makes no sense to back up to the same region; always transfer backups to another region.

Google Cloud (From US to EU)

The backup data were streamed from Compute Engine instance in us-east1 to Cloud Storage europe-west3. Interestingly, Google Cloud Storage supports both native protocol and S3(interoperability) API. So, Percona XtraBackup can transfer data to Google Cloud Storage directly via S3(interoperability) API.

 

tool settings CPU max mem speed speed comparison
gsutil not changeable, native protocol 8% 246Mb 23MiB/s etalon
rclone rcat not changeable, native protocol 6% 61Mb 16MiB/s -30%
xbcloud default settings, s3 protocol 3% 97Mb 9MiB/s -61%
xbcloud 10Mb block, 16 threads, s3 protocol 50% 417Mb 133MiB/s +478%

 

Tip: A cloud provider can block your account due to many reasons, such as human or robot mistakes, inappropriate content abuse after hacking, credit card expire, sanctions, etc. Think about disaster recovery and what will you do when a cloud provider blocks your account, it may make sense to back up to another cloud provider or on-premise.

Conclusion

xbcloud tool (part of Percona XtraBackup) is 2-5 times faster with tuned settings on long-distance with native cloud vendor tools, and 14% faster and requires 20% less memory than analogs with the same settings. Also, xbcloud is the most reliable tool for transferring backups to S3-compatible storage because of two reasons:

  • It calculates md5 sums during the uploading and puts them into a .md5/filename.md5 file and verifies sums on the download (gof3r does the same).
  • xbcloud sends data in 10mb chunks and resends them if any network failure happens.

PS: Please find instructions on GitHub if you would like to reproduce this article’s results.

Nov
22
2019
--

UUIDs are Popular, but Bad for Performance — Let’s Discuss

UUID popular

If you do a quick web search about UUIDs and MySQL, you’ll get a fair number of results. Here are just a few examples:

So, does a well-covered topic like this one needs any more attention? Well, apparently – yes. Even though most posts are warning people against the use of UUIDs, they are still very popular. This popularity comes from the fact that these values can easily be generated by remote devices, with a very low probability of collision. With this post, my goal is to summarize what has already been written by others and, hopefully, bring in a few new ideas.

What are UUIDs?

UUID stands for Universally Unique IDentifier and is defined in the RFC 4122. It is a 128 bits number, normally written in hexadecimal and split by dashes into five groups. A typical UUID value looks like:

yves@laptop:~$ uuidgen 
83fda883-86d9-4913-9729-91f20973fa52

There are officially 5 types of UUID values, version 1 to 5, but the most common are: time-based (version 1 or version 2) and purely random (version 3). The time-based UUIDs encode the number of 10ns since January 1st, 1970 in 7.5 bytes (60 bits), which is split in a “time-low”-“time-mid”-“time-hi” fashion. The missing 4 bits is the version number used as a prefix to the time-hi field.  This yields the 64 bits of the first 3 groups. The last 2 groups are the clock sequence, a value incremented every time the clock is modified and a host unique identifier. Most of the time, the MAC address of the main network interface of the host is used as a unique identifier.

There are important points to consider when you use time-based UUID values:

  • It is possible to determine the approximated time when the value was generated from the first 3 fields
  • There are many repetitive fields between consecutive UUID values
  • The first field, “time-low”, rolls over every 429s
  • The MySQL UUID function produces version one values

Here’s an example using the “uuidgen” Unix tool to generate time-based values:

yves@laptop:~$ for i in $(seq 1 500); do echo "$(date +%s): $(uuidgen -t)"; sleep 1; done
1573656803: 572e4122-0625-11ea-9f44-8c16456798f1
1573656804: 57c8019a-0625-11ea-9f44-8c16456798f1
1573656805: 586202b8-0625-11ea-9f44-8c16456798f1
...
1573657085: ff86e090-0625-11ea-9f44-8c16456798f1
1573657086: 0020a216-0626-11ea-9f44-8c16456798f1
...
1573657232: 56b943b2-0626-11ea-9f44-8c16456798f1
1573657233: 57534782-0626-11ea-9f44-8c16456798f1
1573657234: 57ed593a-0626-11ea-9f44-8c16456798f1
...

The first field rolls over (at t=1573657086) and the second field is incremented. It takes about 429s to see similar values again for the first field. The third field changes only once per about a year. The last field is static on a given host, the MAC address is used on my laptop:

yves@laptop:~$ ifconfig | grep ether | grep 8c
    	ether 8c:16:45:67:98:f1  txqueuelen 1000  (Ethernet)

The other frequently seen UUID version is 4, the purely random one. By default, the Unix “uuidgen” tool produces UUID version 4 values:

yves@laptop:~$ for i in $(seq 1 3); do uuidgen; done
6102ef39-c3f4-4977-80d4-742d15eefe66
14d6e343-028d-48a3-9ec6-77f1b703dc8f
ac9c7139-34a1-48cf-86cf-a2c823689a91

The only “repeated” value is the version, “4”, at the beginning of the 3rd field. All the other 124 bits are random.

What is so Wrong with UUID Values?

In order to appreciate the impact of using UUID values as a primary key, it is important to review how InnoDB organizes the data. InnoDB stores the rows of a table in the b-tree of the primary key. In database terminology, we call this a clustered index. The clustered index orders the rows automatically by the primary key.

When you insert a new row with a random primary key value, InnoDB has to find the page where the row belongs, load it in the buffer pool if it is not already there, insert the row and then, eventually, flush the page back to disk. With purely random values and large tables, all b-tree leaf pages are susceptible to receive the new row, there are no hot pages. Rows inserted out of the primary key order cause page splits causing a low filling factor. For tables much larger than the buffer pool, an insert will very likely need to read a table page from disk. The page in the buffer pool where the new row has been inserted will then be dirty.  The odds the page will receive a second row before it needs to be flushed to disk are very low. Most of the time, every insert will cause two IOPs – one read and one write. The first major impact is on the rate of IOPs and it is a major limiting factor for scalability.

The only way to get decent performance is thus to use storage with low latency and high endurance. That’s where you’ll the second major performance impact. With a clustered index, the secondary indexes use the primary key values as the pointers. While the leaves of the b-tree of the primary key store rows, the leaves of the b-tree of a secondary index store primary key values.

Let’s assume a table of 1B rows having UUID values as primary key and five secondary indexes. If you read the previous paragraph, you know the primary key values are stored six times for each row. That means a total of 6B char(36) values representing 216 GB. That is just the tip of the iceberg, as tables normally have foreign keys, explicit or not, pointing to other tables. When the schema is based on UUID values, all these columns and indexes supporting them are char(36). I recently analyzed a UUID based schema and found that about 70 percent of storage was for these values.

As if that’s not enough, there’s a third important impact of using UUID values. Integer values are compared up to 8 bytes at a time by the CPU but UUID values are compared char per char. Databases are rarely CPU bound, but nevertheless this adds to the latencies of the queries. If you are not convinced, look at this performance comparison between integers vs strings:

mysql> select benchmark(100000000,2=3);
+--------------------------+
| benchmark(100000000,2=3) |
+--------------------------+
|                        0 |
+--------------------------+
1 row in set (0.96 sec)

mysql> select benchmark(100000000,'df878007-80da-11e9-93dd-00163e000002'='df878007-80da-11e9-93dd-00163e000003');
+----------------------------------------------------------------------------------------------------+
| benchmark(100000000,'df878007-80da-11e9-93dd-00163e000002'='df878007-80da-11e9-93dd-00163e000003') |
+----------------------------------------------------------------------------------------------------+
|                                                                                                  0 |
+----------------------------------------------------------------------------------------------------+
1 row in set (27.67 sec)

Of course, the above example is a worst-case scenario but it at least gives the span of the issue. Comparing integers is about 28 times faster. Even if the difference appears rapidly in the char values, it is still about 2.5 times slower:

mysql> select benchmark(100000000,'df878007-80da-11e9-93dd-00163e000002'='ef878007-80da-11e9-93dd-00163e000003');
+----------------------------------------------------------------------------------------------------+
| benchmark(100000000,'df878007-80da-11e9-93dd-00163e000002'='ef878007-80da-11e9-93dd-00163e000003') |
+----------------------------------------------------------------------------------------------------+
|                                                                                                  0 |
+----------------------------------------------------------------------------------------------------+
1 row in set (2.45 sec)

Let’s explore a few solutions to address those issues.

Size of the Values

The default representation for UUID, hash, and token values is often the hexadecimal notation. With a cardinality, the number of possible values, of only 16 per byte, it is far from efficient. What about using another representation like base64 or even straight binary? How much do we save? How is the performance affected?

Let’s begin by the base64 notation. The cardinality of each byte is 64 so it takes 3 bytes in base64 to represent 2 bytes of actual value. A UUID value consists of 16 bytes of data, if we divide by 3, there is a remainder of 1. To handle that, the base64 encoding adds ‘=’ at the end:

mysql> select to_base64(unhex(replace(uuid(),'-','')));
+------------------------------------------+
| to_base64(unhex(replace(uuid(),'-',''))) |
+------------------------------------------+
| clJ4xvczEeml1FJUAJ7+Fg==                 |
+------------------------------------------+
1 row in set (0.00 sec)

If the length of the encoded entity is known, like for a UUID, we can remove the ‘==’, as it is just dead weight. A UUID encoded in base64 thus has a length of 22.

The next logical step is to directly store the value in binary format. This the most optimal format but displaying the values in the mysql client is less convenient.

So, how’s the size impacting performance? To illustrate the impact, I inserted random UUID values in a table with the following definition…

CREATE TABLE `data_uuid` (
  `id` char(36) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

… for the default hexadecimal representation. For base64, the ‘id’ column is defined as char(22) while binary(16) is used for the binary example. The database server has a buffer pool size at 128M and its IOPs are limited to 500. The insertions are done over a single thread.

Insertion rates for tables using different representation for UUID values

In all cases, the insertion rate is at first CPU bound but as soon the table is larger than the buffer pool, the insertion rapidly becomes IO bound. This is expected and shouldn’t surprise anyone. The use of a smaller representation for the UUID values just allows more rows to fit in the buffer pool but in the long run, it doesn’t really help the performance, as the random insertion order dominates. If you are using random UUID values as primary keys, your performance is limited by the amount of memory you can afford.

Option 1: Saving IOPs with Pseudo-Random Order

As we have seen, the most important issue is the random nature of the values. A new row may end up in any of the table leaf pages. So unless the whole table is loaded in the buffer pool, it means a read IOP and eventually a write IOP. My colleague David Ducos gave a nice solution to this problem but some customers do not want to allow for the possibility of extracting information from the UUID values, like, for example, the generation timestamp.

What if we somewhat just reduce then the randomness of the values in a way that a prefix of a few bytes is constant for a time interval? During the time interval, only a fraction of the whole table, corresponding to the cardinality of the prefix, would be required to be in the memory to save the read IOPs. This would also increase the likelihood a page receives a second write before being flushed to disk, thus reducing the write load. Let’s consider the following UUID generation function:

drop function if exists f_new_uuid; 
delimiter ;;
CREATE DEFINER=`root`@`%` FUNCTION `f_new_uuid`() RETURNS char(36)
    NOT DETERMINISTIC
BEGIN
    DECLARE cNewUUID char(36);
    DECLARE cMd5Val char(32);


    set cMd5Val = md5(concat(rand(),now(6)));
    set cNewUUID = concat(left(md5(concat(year(now()),week(now()))),4),left(cMd5Val,4),'-',
        mid(cMd5Val,5,4),'-4',mid(cMd5Val,9,3),'-',mid(cMd5Val,13,4),'-',mid(cMd5Val,17,12));

    RETURN cNewUUID;
END;;
delimiter ;

The first four characters of the UUID value comes from the MD5 hash of the concatenation of the current year and week number. This value is, of course, static over a week. The remaining of the UUID value comes from the MD5 of a random value and the current time at a precision of 1us. The third field is prefixed with a “4” to indicate it is a version 4 UUID type. There are 65536 possible prefixes so, during a week, only 1/65536 of the table rows are required in the memory to avoid a read IOP upon insertion. That’s much easier to manage, a 1TB table will need to have only about 16MB in the buffer pool to support the inserts.

Option 2: Mapping UUIDs to Integers

Even if you use pseudo-ordered UUID values stored using binary(16), it is still a very large data type which will inflate the size of the dataset. Remember the primary key values are used as pointers in the secondary indexes by InnoDB. What if we store all the UUID values of a schema in a mapping table? The mapping table will be defined as:

CREATE TABLE `uuid_to_id` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `uuid` char(36) NOT NULL,
  `uuid_hash` int(10) unsigned GENERATED ALWAYS AS (crc32(`uuid`)) STORED NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_hash` (`uuid_hash`)
) ENGINE=InnoDB AUTO_INCREMENT=2590857 DEFAULT CHARSET=latin1;

It is important to notice the uuid_to_id table does not enforce the uniqueness of uuid. The idx_hash index acts a bit like a bloom filter. We’ll know for sure a UUID value is not present in the table when there is no matching hash value but we’ll have to validate with the stored UUID value when there is a matching hash. To help us here, let’s create a SQL function:

DELIMITER ;;
CREATE DEFINER=`root`@`%` FUNCTION `f_uuid_to_id`(pUUID char(36)) RETURNS int(10) unsigned
    DETERMINISTIC
BEGIN
        DECLARE iID int unsigned;
        DECLARE iOUT int unsigned;

        select get_lock('uuid_lock',10) INTO iOUT;

        SELECT id INTO iID
        FROM uuid_to_id WHERE uuid_hash = crc32(pUUID) and uuid = pUUID;

        IF iID IS NOT NULL THEN
            select release_lock('uuid_lock') INTO iOUT;
            SIGNAL SQLSTATE '23000'
                SET MESSAGE_TEXT = 'Duplicate entry', MYSQL_ERRNO = 1062;
        ELSE
            insert into uuid_to_id (uuid) values (pUUID);
            select release_lock('uuid_lock') INTO iOUT;
            set iID = last_insert_id();
        END IF;

        RETURN iID;
END ;;
DELIMITER ;

The function checks if the UUID values passed exist in the uuid_to_id table, and if it does it returns the matching id value otherwise it inserts the UUID value and returns the last_insert_id. To protect against the concurrent submission of the same UUID values, I added a database lock. The database lock limits the scalability of the solution. If your application cannot submit twice the request over a very short time frame, the lock could be removed. I have also another version of the function with no lock calls and using a small dedup table where recent rows are kept for only a few seconds. See my github if you are interested.

Results for the Alternate Approaches

Now, let’s have a look at the insertion rates using these alternate approaches.

Insertion on tables using UUID values as primary keys, alternative solutions

The pseudo-order results are great. Here I modified the algorithm to keep the UUID prefix constant for one minute instead of one week in order to better fit the test environment. Even if the pseudo-order solution performs well, keep in mind it is still bloating the schema and overall the performance gains may not be that great.

The mapping to integer values, although the insert rates are smaller due to the additional DMLs required, decouples the schema from the UUID values. The tables now use integers as primary keys. This mapping removes nearly all the scalability concerns of using UUID values. Still, even on a small VM with limited CPU and IOPS, the UUID mapping technique yields nearly 4000 inserts/s. Put into context, this means 14M rows per hour, 345M rows per day and 126B rows per year. Such rates likely fit most requirements. The only growth limitation factor is the size of the hash index. When the hash index will be too large to fit in the buffer pool, performance will start to decrease.

Other Options than UUID Values?

Of course, there are other possibilities to generate unique IDs.  The method used by the MySQL function UUID_SHORT() is interesting. A remote device like a smartphone could use the UTC time instead of the server uptime. Here’s a proposal:

(Seconds since January 1st 1970) << 32
+ (lower 2 bytes of the wifi MAC address) << 16
+ 16_bits_unsigned_int++;

The 16 bits counter should be initialized at a random value and allowed to roll over. The odds of two devices producing the same ID are very small. It has to happen at approximately the same time, both devices must have the same lower bytes for the MAC and their 16 bits counter at the same increment.

Notes

All the data related to this post can be found in my github.

Nov
22
2019
--

Tips for Designing Grafana Dashboards

Designing Grafana Dashboards

As Grafana powers our star product – Percona Monitoring and Management (PMM) – we have developed a lot of experience creating Grafana Dashboards over the last few years.   In this article, I will share some of the considerations for designing Grafana Dashboards. As usual, when it comes to questions of design they are quite subjective, and I do not expect you to chose to apply all of them to your dashboards, but I hope they will help you to think through your dashboard design better.

Design Practical Dashboards

Grafana features many panel types, and even more are available as plugins. It may be very attractive to use many of them in your dashboards using many different visualization options. Do not!  Stick to a few data visualization patterns and only add additional visualizations when they provide additional practical value not because they are cool.  Graph and Singlestat panel types probably cover 80% of use cases.

Do Not Place Too Many Graphs Side by Side

This probably will depend a lot on how your dashboards are used.  If your dashboard is designed for large screens placed on the wall you may be able to fit more graphs side by side, if your dashboard needs to scale down to lower resolution small laptop screen I would suggest sticking to 2-3 graphs in a row.

Use Proper Units

Grafana allows you to specify a unit for the data type displayed. Use it! Without type set values will not be properly shortened and very hard to read:

Grafana Dashboards

Compare this to

Grafana Dashboards2

Mind Decimals

You can specify the number of values after decimal points you want to display or leave it default.  I found default picking does not always work very well, for example here:

Grafana Dashboards3

For some reason on the panel Axis, we have way too many values displayed after the decimal point.  Grafana also often picks three values after decimal points as in the table below which I find inconvenient – from the glance view, it is hard to understand if we’re dealing with a decimal point or with “,” as a “thousands” separator, so I may be looking at 2462 GiB there.  While it is not feasible in this case, there are cases such as data rate where a 1000x value difference is quite possible.  Instead, I prefer setting it to one decimal (or one if it is enough) which makes it clear that we’re not looking at thousands.

Label your Axis

You can label your axis (which especially makes sense) if the presentation is something not as obvious as in this example; we’re using a negative value to lot writes to a swap file.

Grafana Dashboards4

Use Shared Crosshair or Tooltip

In Dashboard Settings, you will find “Graph Tooltip” option and set it to “Default”,
“Shared Crosshair” or “Share Tooltip”  This is how these will look:

Grafana Dashboards5

Grafana Dashboards 6

Grafana Dashboards 6

 

Shared crosshair shows the line matching the same time on all dashboards while Tooltip shows the tooltip value on all panels at the same time.  You can pick what makes sense for you; my favorite is using the tooltip setting because it allows me to visually compare the same time without making the dashboard too slow and busy.

Note there is handy shortcut CTRL-O which allows you to cycle between these settings for any dashboard.

Pick Colors

If you’re displaying truly dynamic information you will likely have to rely on Grafana’s automatic color assignment, but if not, you can pick specific colors for all values being plotted.  This will prevent colors from potentially being re-assigned to different values without you planning to do so.

Grafana Dashboards 7

Picking colors you also want to make sure you pick colors that make logical sense. For example, I think for free memory “green” is a better color than “red”.  As you pick the colors, use the same colors for the same type of information when you show it on the different panels if possible, because it makes it easier to understand.

I would even suggest sticking to the same (or similar) color for the Same Kind of Data – if you have many panels which show disk Input and Output using similar colors, this can be a good idea.

Fill Stacking Graphs

Grafana does not require it, but I would suggest you use filling when you display stacking data and don’t use filling when you’re plotting multiple independent values.  Take a look at these graphs:

In the first graph, I need to look at the actual value of the plotted value to understand what I’m looking at. At the same time, in the second graph, that value is meaningless and what is valuable is the filled amount. I can see on the second graph what amount of the Cache, blue value, has shrunk.

I prefer using a fill factor of 6+ so it is easier to match the fill colors with colors in the table.   For the same reason, I prefer not to use the fill gradient on such graphs as it makes it much harder to see the color and the filled volume.

Do Not Abuse Double Axis

Graphs that use double axis are much harder to understand.  I used to use it very often, but now I avoid it when possible, only using it when I absolutely want to limit the number of panels.

Note in this case I think gradient fits OK because there is only one value displayed as the line, so you can’t get confused if you need to look at total value or “filled volume”.

Separate Data of Different Scales on Different Graphs

I used to plot Innodb Rows Read and Written at the same graph. It is quite common to have reads to be 100x higher in volume than writes, crowding them out and making even significant changes in writes very hard to see.  Splitting them to different graphs solved this issue.

Consider Staircase Graphs

In the monitoring applications, we often display average rates computed over a period of time.  If this is the case, we do not know how the rate was changing within that period and it would be misleading to show that. This especially makes sense if you’re displaying only a few data points.

Let’s look at this graph which is being viewed with one-hour resolution:

This visually shows what amount of rows read was falling from 16:00 to 18:00, and if we compare it to the staircase graph:

It simply shows us that the value at 18 am was higher than 17 am, but does not make any claim about the change.

This display, however, has another issue. Let’s look at the same data set with 5min resolution:

We can see the average value from 16:00 to 17:00 was lower than from 17:00 to 18:00, but this is however NOT what the lower resolution staircase graph shows – the value for 17 to 18 is actually lower!

The reason for that is if we compute on Prometheus side rate() for 1 hour at 17:00 it will be returned as a data point for 17:00 where this average rate is really for 16:00 to 17:00, while staircase graph will plot it from 17:00 to 18:00 until a new value is available.  It is off by one hour.

To fix it, you need to shift the data appropriately. In Prometheus, which we use in PMM, I can use an offset operator to shift the data to be displayed correctly:

Provide Multiple Resolutions

I’m a big fan of being able to see the data on the same dashboard with different resolutions, which can be done through a special dashboard variable of type “Interval”.  High-resolution data can provide a great level of detail but can be very volatile.

While lower resolution can hide this level of detail, it does show trends better.

Multiple Aggregates for the Same Metrics

To get even more insights, you can consider plotting the same metrics with different aggregates applied to it:

In this case, we are looking at the same variable – threads_running – but at its average value over a period of time versus max (peak) value. Both of them are meaningful in a different way.

You can also notice here that points are used for the Max value instead of a line. This is in general good practice for highly volatile data, as a plottings line for something which changes wildly is messy and does not provide much value.

Use Help and Panel Links

If you fill out a description for the panel, it will be visible if you place your mouse over the tiny “i” sign. This is very helpful to explain what the panel shows and how to use this data.  You can use Markup for formatting.  You can also provide one or more panel links, that you can use for additional help or drill down.

With newer Grafana versions, you can even define a more advanced drill-down, which can contain different URLs based on the series you are looking at, as well as other templating variables:

Summary

This list of considerations for designing Grafana Dashboards and best practices is by no means complete, but I hope you pick up an idea or two which will allow you to create better dashboards!

Nov
20
2019
--

Profiling Software Using perf and Flame Graphs

Profiling Software Using perf and Flame Graphs

In this blog post, we will see how to use perf (a.k.a.: perf_events) together with Flame Graphs. They are used to generate a graphical representation of what functions are being called within our software of choice. Percona Server for MySQL is used here, but it can be extended to any software you can take a resolved stack trace from.

Before moving forward, a word of caution. As with any profiling tool, DON’T run this in production systems unless you know what you are doing.

Installing Packages Needed

For simplicity, I’ll use commands for CentOS 7, but things should be the same for Debian-based distros (apt-get install linux-tools-$(uname -r) instead of the yum command is the only difference in the steps).

To install perf, simply issue:

SHELL> sudo yum install -y perf

To get Flame Graphs project:

SHELL> mkdir -p ~/src
SHELL> cd ~/src
SHELL> git clone https://github.com/brendangregg/FlameGraph

That’s it! We are good to go.

Capturing Samples

Flame Graphs are a way of visualizing data, so we need to have some samples we can base off of. There are three ways in which we can do this. (Note that we will use the -p flag to only capture data from our process of interest, but we can potentially capture data from all the running processes if needed.)

1- Capture for a set amount of time only (ten seconds here):

SHELL> sudo perf record -a -F 99 -g -p $(pgrep -x mysqld) -- sleep 10

2- Capture until we send the interrupt signal (CTRL-C):

SHELL> sudo perf record -a -F 99 -g -p $(pgrep -x mysqld)

3- Capture for the whole lifetime of the process:

SHELL> sudo perf record -a -F 99 -g -- /sbin/mysqld \
--defaults-file=/etc/percona-server.conf.d/mysqld.cnf --user=mysql

or

SHELL> sudo perf record -a -F 99 -g -p $(pgrep -x mysqld) -- mysql -e "SELECT * FROM db.table"

We are forced to capture data from all processes in the first case of the third variant since it’s impossible to know the process ID (PID) number beforehand (with the command executed, we are actually starting the MySQL service). This type of command comes in handy when you want to have data from the exact beginning of the process, which is not possible otherwise.

In the second variant, we are running a query on an already-running MySQL service, so we can use the -p flag to capture data on the server process. This is handy if you want to capture data at the exact moment a job is running, for instance.

Preparing the Samples

After the initial capture, we will need to make the collected data “readable”. This is needed because it is stored in binary format by perf record. For this we will use:

SHELL> sudo perf script > perf.script

It will read perf.data by default, which is the same default perf record uses for its output file. It can be overridden by using the -i flag and -o flag, respectively.

We will now be able to read the generated text file, as it will be in a human-readable form. However, when doing so, you will quickly realize why we need to aggregate all this data into a more intelligible form.

Generating the Flame Graphs

We can do the following in a one-liner, by piping the output of the first as input to the second. Since we didn’t add the FlameGraph git folder to our path, we will need to use full paths.

SHELL> ~/src/FlameGraph/stackcollapse-perf.pl perf.script | ~/src/FlameGraph/flamegraph.pl > flamegraph.svg

We can now open the .svg file in any browser and start analyzing the information-rich graphs.

How Does it Look?

As an example, I will leave full commands, their outputs, and a screenshot of a flame graph generated by the process using data capture method #2. We will run an INSERT INTO … SELECT query to the database, so we can then analyze its execution.

SHELL> time sudo perf record -a -F 99 -g \
-p $(pgrep -x mysqld) \
-- mysql test -e "INSERT INTO joinit SELECT NULL, uuid(), time(now()),  (FLOOR( 1 + RAND( ) *60 )) FROM joinit;"
Warning:
PID/TID switch overriding SYSTEM
[ perf record: Woken up 7 times to write data ]
[ perf record: Captured and wrote 1.909 MB perf.data (8214 samples) ]

real 1m24.366s
user 0m0.133s
sys 0m0.378s

SHELL> sudo perf script | \ 
~/src/FlameGraph/stackcollapse-perf.pl perf.script | \
~/src/FlameGraph/flamegraph.pl > mysql_select_into_flamegraph.svg

The keen-eyed reader will notice we went one step further here and joined steps #2 and #3 via a pipe (|) to avoid writing to and reading from the perf.script output file. Additionally, there are time outputs so we can get an estimation on the amount of data the tool generates (~2Mb in 1min 25secs); this will, of course, vary depending on many factors, so take it with a pinch of salt, and test in your own environment.

The resulting flame graph is:

perf and Flame Graphs

One clear candidate for optimization is work around write_record: if we can make that function faster, there is a lot of potential for reducing overall execution time (squared in blue in the bottom left corner, we can see a total of ~60% of the samples were taken in this codepath). In the last section below we link to a blog post explaining more on how to interpret a Flame Graph, but for now, know you can mouse-over the function names and it will dynamically change the information shown at the bottom left corner. You may also visualize it better with the following guides in place:

flame graphs

Conclusion

For the Support team, we use this procedure in many cases where we need to have an in-depth view of what MySQL is executing, and for how long. This way, we can have a better insight into what operations are behind a specific workload and act accordingly. This procedure can be used either for optimizing or troubleshooting and is a very powerful tool in our tool belt! It’s known that humans are better at processing images rather than text, and this tool exploits that brilliantly, in my opinion.

Related links

Interpreting Flame Graphs (scroll down to the “Flame Graph Interpretation” section)

Flame Graphs 201, a great webinar by Marcos, if you want to dig deeper into this

Of course, Brendan Gregg (the mastermind behind the Flame Graph project) has even more information on this

Nov
19
2019
--

Installing MySQL with Docker

Installing MySQL with Docker

Installing MySQL with DockerI often need to install a certain version of MySQL, MariaDB, or Percona Server for MySQL to run some experiments, whether to check for behavior differences or to provide tested instructions. In this blog series, I will look into how you can install MySQL, MariaDB, or Percona Server for MySQL with Docker.  This post, part one, is focused on MySQL Server.

Docker is actually not my most preferred way as it does not match a typical production install, and if you look at service control behavior or file layout it is quite different.  What is great about Docker though is that it allows installing the latest MySQL version – as well as any other version – very easily.

Docker also is easy to use when you need a simple, single instance.  If you’re looking into some replication-related behaviors, DBDeployer may be a better tool for that.

These instructions are designed to get a test instance running quickly and easily; you do not want to use these for production deployments. All instructions below assume Docker is already installed.

First, you should know there are not one but two “official” MySQL Docker Repositories.  One of them is maintained by the Docker Team and is available by a simple docker run mysql:latest.  The other one is maintained by the MySQL Team at Oracle and would use a docker run mysql/mysql-server:latest  syntax.  In the examples below, we will use MySQL Team’s Docker images, though the Docker Team’s work in a similar way.

Installing the Latest MySQL Version with Docker

docker run --name mysql-latest  \
-p 3306:3306 -p 33060:33060  \
-e MYSQL_ROOT_HOST='%' -e MYSQL_ROOT_PASSWORD='strongpassword'   \
-d mysql/mysql-server:latest

This will start the latest version of MySQL instance, which can be remotely accessible from anywhere with specified root password.  This is easy for testing, but not a good security practice (which is why it is not the default).

Connecting to MySQL Server Docker Container

Installing with Docker means you do not get any tools, utilities, or libraries available on your host directly, so you either install these separately, access created instance from a remote host, or use command lines shipped with docker image.

To Start MySQL Command Line Client with Docker Run:

docker exec -it mysql-latest mysql -uroot -pstrongpassword

To Start MySQL Shell with Docker Run:

docker exec -it mysql-latest mysqlsh -uroot -pstrongpassword

Managing MySQL Server in Docker Container

When you want to stop the MySQL Server Docker Container run:

docker stop mysql-latest

If you want to restart a stopped MySQL Docker container, you should not try to use docker run to start it again. Instead, you should use:

docker start mysql-latest

If something is not right, for example, if the container is not starting, you can access its logs using this command:

docker logs mysql-latest

If you want to re-create a fresh docker container from scratch you can run:

docker stop mysql-latest
docker rm mysql-latest

Followed by the

docker run

 command described above.

Passing Command Line Options to MySQL Server in Docker Container

If you want to pass some command line options to MySQL Server, you can do it this way:

docker run --name mysql-latest  \
-p 3306:3306 -p 33060:33060  \
-e MYSQL_ROOT_HOST='%' -e MYSQL_ROOT_PASSWORD='strongpassword'   \
-d mysql/mysql-server:latest \
--innodb_buffer_pool_size=256M \
--innodb_flush_method=O_DIRECT \

Running Different MySQL Server Versions in Docker

If you just want to run one MySQL version at a time in Docker container, it is easy – you can just pick the version you want with Docker Image Tag and change the Name to be different in order to avoid name conflict:

docker run --name mysql-8.0.17  \
-p 3306:3306 -p 33060:33060  \
-e MYSQL_ROOT_HOST='%' -e MYSQL_ROOT_PASSWORD='strongpassword'   \
-d mysql/mysql-server:8.0.17

This will start MySQL 8.0.17 in Docker Container.

docker run --name mysql-5.7  \
-p 3306:3306 -p 33060:33060  \
-e MYSQL_ROOT_HOST='%' -e MYSQL_ROOT_PASSWORD='strongpassword'   \
-d mysql/mysql-server:5.7

And this will start the latest MySQL 5.7 in Docker.

Running Multiple MySQL Server Versions at the Same Time in Docker

The potential problem of running multiple MySQL Versions in Docker at the same time is TCP port conflict.   If you do not access Docker Container from outside, and just run utilities included in the same container, you can just remove port mapping (-p option) and you can run multiple containers:

docker run --name mysql-latest  \
-e MYSQL_ROOT_HOST='%' -e MYSQL_ROOT_PASSWORD='strongpassword'   \
-d mysql/mysql-server:latest

docker run --name mysql-8.0.17  \
-e MYSQL_ROOT_HOST='%' -e MYSQL_ROOT_PASSWORD='strongpassword'   \
-d mysql/mysql-server:8.0.17

In more common cases when you need to access Docker containers externally, you will want to map them to use different external port names.   For example, to start the latest MySQL 8 at ports 3306/33060 and MySQL 8.0.17 at 3307/33070,  we can use:

docker run --name mysql-latest  \
-p 3306:3306 -p 33060:33060  \
-e MYSQL_ROOT_HOST='%' -e MYSQL_ROOT_PASSWORD='strongpassword'   \
-d mysql/mysql-server:latest


docker run --name mysql-8.0.17  \
-p 3307:3306 -p 33070:33060  \
-e MYSQL_ROOT_HOST='%' -e MYSQL_ROOT_PASSWORD='strongpassword'   \
-d mysql/mysql-server:8.0.17

There are a lot more things to consider if you’re going to use MySQL on Docker for anything beyond testing.  For more information check-out the MySQL Server Page on Docker Hub and MySQL Manual.

Nov
12
2019
--

Watch Out for Disk I/O Performance Issues when Running EXT4

Performance Issues When Running EXT4

Recently, at Percona Live Europe 2019, Dimitri Kravchuk from Oracle mentioned that he observed some unclear drop in performance for MySQL on an ext4 filesystem with the latest Linux kernels. I decided to check this case out on my side and found out that indeed, starting from linux kernel 4.9, there are some cases with notable (up to 2x) performance drops for ext4 filesystem in direct i/o mode.

So what’s wrong with ext4? It started in 2016 from the patch that was pushed to kernel 4.9: “ext4: Allow parallel DIO reads”. The purpose of that patch was to help to improve read scalability in direct i/o mode. However, along with improvements in pure read workloads, it also introduced regression in intense mixed random read/write scenarios. And it’s quite weird, but this issue had not been noticed for 3 years. Only this summer was performance regression reported and discussed in LKML. As a result of this discussion, there is an attempt to fix it, but from my current understanding that fix will be pushed only to upcoming 5.4/5.5 kernels. Below I will describe what this regression looks like, how it affects MySQL workloads, and what workarounds we can apply to mitigate this issue.

ext4 Performance Regression

Let’s start by defining the scope of this ext4 performance regression. It will only have an impact if the setup/workload meets following conditions:
– fast ssd/nvme
– linux kernel>=4.9
– files resides on ext4 file system
– files opened with O_DIRECT flag
– at least some I/O should be synchronous

In the original report to LKML, the issue was observed/reproduced with a mixed random read/write scenario with sync I/O and O_DIRECT. But how do these factors relate to MySQL? The only files opened by InnoDB in O_DIRECT mode are tablespaces (*.ibd files), and I/O pattern for tablespaces consists of following operations:

– reads ibd data in synchronous mode
– writes ibd data in asynchronous mode
– posix_allocate to extend tablespace file followed by a synchronous write
– fsync

There are also extra I/O from WAL log files:

– writes data to log files in synchronous mode
– fsync

So in the case of InnoDB tablespaces that are opened with O_DIRECT, we have a mix of sync reads and async writes and it turned out that such a combination along with sync writes to innodb log file is enough to cause notable performance regression as well. I have sketched the workload for fio tool (see below) that simulates the I/O access pattern for InnoDB and have run it for SSD and NVMe drives for linux kernels 4.4.0, 5.3.0, and 5.3.0 with ext4 scalability fix.

[global]
filename=tablespace1.ibd:tablespace2.ibd:tablespace3.ibd:tablespace4.ibd:tablespace5.ibd
direct=1
bs=16k
iodepth=1

#read data from *.ibd tablespaces
[ibd_sync_read]
rw=randread
ioengine=psync

#write data to *.ibd tavlespaces
[ibd_async_write]
rw=randwrite
ioengine=libaio

#write data to ib* log file
[ib_log_sync_write]
rw=write
bs=8k
direct=0
ioengine=psync
fsync=1
filename=log.ib
numjobs=1

fio results on the chart:

Observations:

– for SATA/SSD drive there is almost no difference in throughtput, and only at 16 threads do we see a drop in reads for ext4/kernel-5.3.0. For ext4/kernel-5.3.0 mounted with dioread_nolock (that enables scalability fixes), we see that IOPS back and even look better.
– for NVMe drive the situation looks quite different – until 8 i/o threads IOPS for both reads and writes are more/less similar but after increasing pressure on i/o we see a notable spike for writes and similar drop for reads. And again mounting ext4 with dioread_nolock helps to get the same throughput as and for kernels < 4.9.

The similar performance data for the original issue reported to LKML (with more details and analysis) can be found in the patch itself.

How it Affects MySQL

O_DIRECT

Now let’s check the impact of this issue on an IO-bound sysbench/OLTP_RW workload in O_DIRECT mode. I ran a test for the following setup:

– filesystem: xfs, ext4/default, ext4/dioread_nolock
– drives: SATA/SSD and NVMe
– kernels: 4.4.0, 5.3.0, 5.3.0+ilock_fix

Observations

– in the case of SATA/SSD, the ext4 scalability issue has an impact on tps rate after 256 threads and drop is 10-15%
– in the case of NVMe and regular ext4 with kernel 5.3.0 causes performance drop in ~30-80%. If we apply a fix by mounting ext4 with dioread_nolock or use xfs,  throughput looks good.

O_DSYNC

As ext4 regression affects O_DIRECT, let’s replace O_DIRECT with O_DSYNC and look at results of the same sysbench/OLTP_RW workload on kernel 5.3.0:

Note: In order to make results between O_DIRECT and O_DSYNC comparable, I have limited available memory for MySQL instance by cgroup.

Observations:

In the case of O_DSYNC and regular ext4, the performance is just 10% less than for O_DIRECT/ext4/dioread_nolock and O_DIRECT/xfs and ~35% better than for O_DIRECT/ext4. That means that O_DSYNC can be used as a workaround for cases when you have fast storage and ext4 as filesystem but can’t switch to xfs or upgrade kernel.

Conclusions/workarounds

If your workload/setup is affected, there are the following options that you may consider as a workaround:

– downgrade linux kernel to 4.8
– install kernel 5.3.0 with fix and mount ext4 with dioread_nolock option
– if O_DIRECT is important, switch to xfs filesystem
– if changing filesystem is not an option,  replace O_DIRECT with O_DSYNC+cgroup

Nov
04
2019
--

Choose Your EC2 Instance Type Wisely on AWS

EC2 Instance Type on AWS

Recently I was doing some small testing by using EC2 instances on AWS and I noticed the execution time and performance highly depend on which time of the day I am running my scripts. I was using t3.xlarge instance type as I didn’t need many CPUs and memory for my tests, but from time to time I planned to use all the resources for a short time (few minutes), and this is when I noticed the difference.

First, let’s see what AWS says about T3 instances:

T3 instances start in Unlimited mode by default, giving users the ability to sustain high CPU performance over any desired time frame while keeping cost as low as possible.

In theory, I should not have any issues or performance differences. I have also monitored the CPU credit balance and there was no correlation between the balance and the performance at all, and because these were unlimited instances the balance should not have any impact.

I have decided to start a longer sysbench test on 3 threads to see how the QPS changes over the day.

As you can see, the Query Per Second could go down by almost 90%, which is a lot. It’s important to highlightthat the sysbench script should have generated a very steady workload. So what is this big difference? After checking all the graphs I found this:

Stealing! A lot of stealing! Here is a good article which explains stealing very well. So probably, I have a noisy neighbor. This instance was running in N. California. I have stopped it and tried to start new instances to repeat the test but I have always gotten very similar results. There was a lot of stealing which was hurting the performance a lot, probably because that region is very popular and resources are limited.

Out of curiosity, I have started two similar instances in the Stockholm region and repeated the same test and I got very steady performance as you can see here:

I guess this region is not that popular or filled yet, and we can see there is a huge difference between where you start your instance.

I also repeated the tests with the m5.xlarge instance type to see if it has the same behavior or not.

N. California

Stockholm

After I changed the instance type, we can see that both regions give very similar, steady performance, but if we take a closer look:

N. California

Stockholm

The instance in Stockholm still performs almost 5% more QPS as in N. California, and uses more CPU as well.

Conclusion

If you are using T2 and T3 instance types, you should monitor the CPU usage very closely because noisy neighbors can hurt your performance a lot.  If you need stable performance, T2 and T3 are not recommended but if you only need a short burst it might work but still, you have to monitor the steal. Other instance types can give you a much more stable performance but you could still see some difference between the regions.

Nov
01
2019
--

Use MySQL Without a Password (And Still Be Secure)

Use MySQL Without a Password

Use MySQL Without a PasswordSome say that the best password is the one you don’t have to remember. That’s possible with MySQL, thanks to the auth_socket plugin and its MariaDB version unix_socket.

Neither of these plugins is new, and some words have been written about the auth_socket on this blog before, for example: how to change passwords in MySQL 5.7 when using plugin: auth_socket. But while reviewing what’s new with MariaDB 10.4, I saw that the unix_socket now comes installed by default and is one of the authentication methods (one of them because in MariaDB 10.4 a single user can have more than one authentication plugin, as explained in the Authentication from MariaDB 10.4 document).

As already mentioned this is not news, and even when one installs MySQL using the .deb packages maintained by the Debian team, the root user is created so it uses the socket authentication. This is true for both MySQL and MariaDB:

root@app:~# apt-cache show mysql-server-5.7 | grep -i maintainers
Original-Maintainer: Debian MySQL Maintainers <pkg-mysql-maint@lists.alioth.debian.org>
Original-Maintainer: Debian MySQL Maintainers <pkg-mysql-maint@lists.alioth.debian.org>

Using the Debian packages of MySQL, the root is authenticated as follows:

root@app:~# whoami
root=
root@app:~# mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 4
Server version: 5.7.27-0ubuntu0.16.04.1 (Ubuntu)

Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select user, host, plugin, authentication_string from mysql.user where user = 'root';
+------+-----------+-------------+-----------------------+
| user | host      | plugin | authentication_string |
+------+-----------+-------------+-----------------------+
| root | localhost | auth_socket |                       |
+------+-----------+-------------+-----------------------+
1 row in set (0.01 sec)

Same for the MariaDB .deb package:

10.0.38-MariaDB-0ubuntu0.16.04.1 Ubuntu 16.04

MariaDB [(none)]> show grants;
+------------------------------------------------------------------------------------------------+
| Grants for root@localhost                                                                      |
+------------------------------------------------------------------------------------------------+
| GRANT ALL PRIVILEGES ON *.* TO 'root'@'localhost' IDENTIFIED VIA unix_socket WITH GRANT OPTION |
| GRANT PROXY ON ''@'%' TO 'root'@'localhost' WITH GRANT OPTION                                  |
+------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

For Percona Server, the .deb packages from the official Percona Repo are also setting the root user authentication to auth_socket. Here is an example of Percona Server for MySQL 8.0.16-7 and Ubuntu 16.04:

root@app:~# whoami
root
root@app:~# mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 9
Server version: 8.0.16-7 Percona Server (GPL), Release '7', Revision '613e312'

Copyright (c) 2009-2019 Percona LLC and/or its affiliates
Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select user, host, plugin, authentication_string from mysql.user where user ='root';
+------+-----------+-------------+-----------------------+
| user | host      | plugin | authentication_string |
+------+-----------+-------------+-----------------------+
| root | localhost | auth_socket |                       |
+------+-----------+-------------+-----------------------+
1 row in set (0.00 sec)

So, what’s the magic? The plugin checks that the Linux user matches the MySQL user using the SO_PEERCRED socket option to obtain information about the user running the client program. Thus, the plugin can be used only on systems that support the SO_PEERCRED option, such as Linux. The SO_PEERCRED socket option allows retrieving the uid of the process that is connected to the socket. It is then able to get the user name associated with that uid.

Here’s an example with the user “vagrant”:

vagrant@mysql1:~$ whoami
vagrant
vagrant@mysql1:~$ mysql
ERROR 1698 (28000): Access denied for user 'vagrant'@'localhost'

Since no user “vagrant” exists in MySQL, the access is denied. Let’s create the user and try again:

MariaDB [(none)]> GRANT ALL PRIVILEGES ON *.* TO 'vagrant'@'localhost' IDENTIFIED VIA unix_socket;
Query OK, 0 rows affected (0.00 sec)

vagrant@mysql1:~$ mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 45
Server version: 10.0.38-MariaDB-0ubuntu0.16.04.1 Ubuntu 16.04
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show grants;
+---------------------------------------------------------------------------------+
| Grants for vagrant@localhost                                                    |
+---------------------------------------------------------------------------------+
| GRANT ALL PRIVILEGES ON *.* TO 'vagrant'@'localhost' IDENTIFIED VIA unix_socket |
+---------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Success!

Now, what about on a non-debian distro, where this is not the default? Let’s try it on Percona Server for MySQL 8 installed on a CentOS 7:

mysql> show variables like '%version%comment';
+-----------------+---------------------------------------------------+
| Variable_name   | Value                                   |
+-----------------+---------------------------------------------------+
| version_comment | Percona Server (GPL), Release 7, Revision 613e312 |
+-----------------+---------------------------------------------------+
1 row in set (0.01 sec)

mysql> CREATE USER 'percona'@'localhost' IDENTIFIED WITH auth_socket;
ERROR 1524 (HY000): Plugin 'auth_socket' is not loaded

Failed. What is missing? The plugin is not loaded:

mysql> pager grep socket
PAGER set to 'grep socket'
mysql> show plugins;
47 rows in set (0.00 sec)

Let’s add the plugin in runtime:

mysql> nopager
PAGER set to stdout
mysql> INSTALL PLUGIN auth_socket SONAME 'auth_socket.so';
Query OK, 0 rows affected (0.00 sec)

mysql> pager grep socket; show plugins;
PAGER set to 'grep socket'
| auth_socket                     | ACTIVE | AUTHENTICATION | auth_socket.so | GPL     |
48 rows in set (0.00 sec)

We got all we need now. Let’s try again:

mysql> CREATE USER 'percona'@'localhost' IDENTIFIED WITH auth_socket;
Query OK, 0 rows affected (0.01 sec)
mysql> GRANT ALL PRIVILEGES ON *.* TO 'percona'@'localhost';
Query OK, 0 rows affected (0.01 sec)

And now we can log in as the OS user “percona”.

[percona@ip-192-168-1-111 ~]$ whoami
percona
[percona@ip-192-168-1-111 ~]$ mysql -upercona
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 19
Server version: 8.0.16-7 Percona Server (GPL), Release 7, Revision 613e312


Copyright (c) 2009-2019 Percona LLC and/or its affiliates
Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.


Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.


Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.


mysql> select user, host, plugin, authentication_string from mysql.user where user ='percona';
+---------+-----------+-------------+-----------------------+
| user    | host   | plugin   | authentication_string |
+---------+-----------+-------------+-----------------------+
| percona | localhost | auth_socket |                       |
+---------+-----------+-------------+-----------------------+
1 row in set (0.00 sec)

Success again!

Question: Can I try to log as the user percona from another user?

[percona@ip-192-168-1-111 ~]$ logout
[root@ip-192-168-1-111 ~]# mysql -upercona
ERROR 1698 (28000): Access denied for user 'percona'@'localhost'

No, you can’t.

Conclusion

MySQL is flexible enough in several aspects, one being the authentication methods. As we see in this post, one can achieve access without passwords by relying on OS users. This is helpful in several scenarios, but just to mention one: when migrating from RDS/Aurora to regular MySQL and using IAM Database Authentication to keep getting access without using passwords.

Oct
30
2019
--

Understanding Hash Joins in MySQL 8

hash joins mysql

hash joins mysqlIn MySQL 8.0.18 there is a new feature called Hash Joins, and I wanted to see how it works and in which situations it can help us. Here you can find a nice detailed explanation about how it works under the hood.

The high-level basics are the following: if there is a join, it will create an in-memory hash table based on one of the tables and will read the other table row by row, calculate a hash, and do a lookup on the in-memory hash table.

Great, but does this give us any performance benefits?

First of all, this only works on fields that are not indexed, so that is an immediate table scan and we usually do not recommend doing joins without indexes because it is slow. Here is where Hash Joins in MySQL can help because it will use an in-memory hash table instead of Nested Loop.

Let’s do some tests and see. First I created the following tables:

CREATE TABLE `t1` (
`id` int(11) NOT NULL AUTO_INCREMENT ,
`c1` int(11) NOT NULL DEFAULT '0',
`c2` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
KEY `idx_c1` (`c1`)
) ENGINE=InnoDB;

CREATE TABLE `t2` (
`id` int(11) NOT NULL AUTO_INCREMENT ,
`c1` int(11) NOT NULL DEFAULT '0',
`c2` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
KEY `idx_c1` (`c1`)
) ENGINE=InnoDB;

I have inserted 131072 random rows into both tables.

mysql> select count(*) from t1;
+----------+
| count(*) |
+----------+
| 131072   |
+----------+

First test – Hash Joins

Run a join based on c2 which is not indexed.

mysql> explain format=tree select count(*) from t1 join t2 on t1.c2 = t2.c2\G
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)
-> Inner hash join (t2.c2 = t1.c2) (cost=1728502115.04 rows=1728488704)
-> Table scan on t2 (cost=0.01 rows=131472)
-> Hash
-> Table scan on t1 (cost=13219.45 rows=131472)

1 row in set (0.00 sec)

We have to use explain format=tree to see if Hash Join will be used or not, as normal explain still says it is going to be a Nested Loop, which I think it is very misleading. I have already filed a bug report because of this and in the ticket, you can see some comments from developers saying:

The solution is to stop using traditional EXPLAIN (it will eventually go away).

So this is not going to be fixed in traditional explain and we should start using the new way.

Back to the query; we can see it is going to use Hash Join for this query, but how fast is it?

mysql> select count(*) from t1 join t2 on t1.c2 = t2.c2;
+----------+
| count(*) |
+----------+
| 17172231 |
+----------+
1 row in set (0.73 sec)

0.73s for a more than 17m rows join table. Looks promising.

Second Test – Non-Hash Joins

We can disable it with an optimizer switch or optimizer hint.

mysql> select /*+ NO_HASH_JOIN (t1,t2) */ count(*) from t1 join t2 on t1.c2 = t2.c2;
+----------+
| count(*) |
+----------+
| 17172231 |
+----------+
1 row in set (13 min 36.36 sec)

Now the same query takes more than 13 minutes. That is a huge difference and we can see Hash Join helps a lot here.

Third Test – Joins Based on Indexes

Let’s create indexes and see how fast a join based on indexes is.

create index idx_c2 on t1(c2);
create index idx_c2 on t2(c2);

mysql> select count(*) from t1 join t2 on t1.c2 = t2.c2;
+----------+
| count(*) |
+----------+
| 17172231 |
+----------+
1 row in set (2.63 sec)

2.6s

  Hash Join is even faster than the Index-based join in this case.

However, I was able to force the optimizer to use Hash Joins even if an index is available by using ignore index:

mysql> explain format=tree select count(*) from t1 ignore index (idx_c2) join t2 ignore index (idx_c2) on t1.c2 = t2.c2 where t1.c2=t2.c2\G
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)
-> Inner hash join (t2.c2 = t1.c2) (cost=1728502115.04 rows=17336898)
-> Table scan on t2 (cost=0.00 rows=131472)
-> Hash
-> Table scan on t1 (cost=13219.45 rows=131472)

1 row in set (0.00 sec)

I still think it would be nice if I can tell the optimizer with a hint to use Hash Joins even if an index is available, so we do not have to ignore indexes on all the tables. I have created a feature request for this.

However, if you read my first bug report carefully you can see a comment from a MySQL developer which indicates this might not be necessary:

BNL (Block Nested-Loop) will also go away entirely at some point, at which point this hint will be ignored.

That could mean they are planning to remove BNL joins in the future and maybe replace it with Hash join.

Limitations

We can see Hash Join can be powerful, but there are limitations:

  • As I mentioned it only works on columns that do not have indexes (or you have to ignore them).
  • It only works with equi-join conditions.
  • It does not work with LEFT or RIGHT JOIN.

I would like to see a status metric as well to monitor how many times Hash Join was used, and for this, I filled another feature request.

Conclusion

Hash Join seems a very powerful new join option, and we should keep an eye on this because I would not be surprised if we get some other features in the future as well. In theory, it would be able to do Left and Right joins as well and as we can see in the comments on the bug report that Oracle has plans for it in the future.

Oct
29
2019
--

Column Histograms on Percona Server and MySQL 8.0

MySQL Column HIstorgrams

MySQL Column HIstorgramsFrom time to time you may have experienced that MySQL was not able to find the best execution plan for a query. You felt the query should have been faster. You felt that something didn’t work, but you didn’t realize exactly what.

Maybe some of you did tests and discovered there was a better execution plan that MySQL wasn’t able to find (forcing the order of the tables with STRAIGHT_JOIN for example).

In this article, we’ll see a new interesting feature available on MySQL 8.0 as well as Percona Server for MySQL 8.0: the histogram-based statistics.

Today, we’ll see what a histogram is, how you can create and manage it, and how MySQL’s optimizer can use it.

Just for completeness, histogram statistics have been available on MariaDB since version 10.0.2, with a slightly different implementation. Anyway, what we’ll see here is related to Percona Server and MySQL 8.0 only.

 

What is a histogram

We can define a histogram as a good approximation of the data distribution of the values in a column.

Histogram-based statistics were introduced to give the optimizer more execution plans to investigate and solve a query. Until then, in some cases, the optimizer was not able to find out the best possible execution plan because non-indexed columns were ignored.

With histogram statistics, now the optimizer may have more options because also non-indexed columns can be considered. In some specific cases, a query can run faster than usual.

Let’s consider the following table to store departing times of the trains:

CREATE TABLE train_schedule(
id INT PRIMARY KEY,
train_code VARCHAR(10),
departure_station VARCHAR(100),
departure_time TIME);

We can assume that during peak hours, from 7 AM until 9 AM, there are more rows, and during the night hours we have very few rows.

Let’s take a look at the following two queries:

SELECT * FROM train_schedule WHERE departure_time BETWEEN '07:30:00' AND '09:15:00';
SELECT * FROM train_schedule WHERE departure_time BETWEEN '01:00:00' AND '03:00:00';

Without any kind of statistics, the optimizer assumes by default that the values in the departure_time column are evenly distributed, but they aren’t. In fact, the first query returns more rows because of this assumption.

Histograms were invented to provide to the optimizer a good estimation of the rows returned. This seems to be trivial for the simple queries we have seen so far. But let’s think now about having the same table involved in JOINs with other tables. In such a case, the number of rows returned can be very important for the optimizer to decide the order to consider the tables in the execution plan.

A good estimation of the rows returned gives the optimizer the capability to open the table in the first stages in case it returns few rows. This minimizes the total amount of rows for the final cartesian product. Then the query can run faster.

MySQL supports two different types of histograms: “singleton” and “equi-height”. Common for all histogram types is that they split the data set into a set of “buckets”, and MySQL automatically divides the values into the buckets and will also automatically decide what type of histogram to create.

Singleton histogram

  • one value per bucket
  • each bucket stores
    • value
    • cumulative frequency
  • well suited for equality and range conditions

Equi-height histogram

  • multiple values per bucket
  • each bucket stores
    • minimum value
    • maximum value
    • cumulative frequency
    • number of distinct values
  • not really equi-height: frequent values are in separated buckets
  • well suited for range conditions

How to use histograms

The histogram feature is available and enabled on the server, but not usable by the optimizer. Without an explicit creation, the optimizer works the same as usual and cannot get any benefit from the histogram-bases statistics.

There is some manual operation to do. Let’s see.

In the next examples, we’ll use the world sample database you can download from here: https://dev.mysql.com/doc/index-other.html

Let’s start executing a query joining two tables to find out all the languages spoken on the largest cities of the world, with more than 10 million people.

mysql> select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000; 
+-----------------+-----------+ 
| name            | language  | 
+-----------------+-----------+ 
| Mumbai (Bombay) | Asami     | 
| Mumbai (Bombay) | Bengali   | 
| Mumbai (Bombay) | Gujarati  | 
| Mumbai (Bombay) | Hindi     | 
| Mumbai (Bombay) | Kannada   | 
| Mumbai (Bombay) | Malajalam | 
| Mumbai (Bombay) | Marathi   | 
| Mumbai (Bombay) | Orija     | 
| Mumbai (Bombay) | Punjabi   | 
| Mumbai (Bombay) | Tamil     | 
| Mumbai (Bombay) | Telugu    | 
| Mumbai (Bombay) | Urdu      | 
+-----------------+-----------+ 
12 rows in set (0.04 sec)

The query takes 0.04 seconds. It’s not a lot, but consider that the database is very small. Use the BENCHMARK function to have more relevant response times if you like.

Let’s see the EXPLAIN:

mysql> explain select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000; 
+----+-------------+-----------------+------------+-------+---------------------+-------------+---------+-----------------------------------+------+----------+-------------+ 
| id | select_type | table           | partitions | type  | possible_keys       | key         | key_len | ref                               | rows | filtered | Extra       | 
+----+-------------+-----------------+------------+-------+---------------------+-------------+---------+-----------------------------------+------+----------+-------------+ 
| 1  | SIMPLE      | countrylanguage | NULL       | index | PRIMARY,CountryCode | CountryCode | 3       | NULL                              | 984  | 100.00   | Using index | 
| 1  | SIMPLE      | city            | NULL       | ref   | CountryCode         | CountryCode | 3       | world.countrylanguage.CountryCode | 18   | 33.33    | Using where | 
+----+-------------+-----------------+------------+-------+---------------------+-------------+---------+-----------------------------------+------+----------+-------------+

Indexes are used for both the tables and the estimated cartesian product has 984 * 18 = 17,712 rows.

Now generate the histogram on the Population column. It’s the only column used for filtering the data and it’s not indexed.

For that, we have to use the ANALYZE command:

mysql> ANALYZE TABLE city UPDATE HISTOGRAM ON population WITH 1024 BUCKETS; 
+------------+-----------+----------+-------------------------------------------------------+ 
| Table      | Op        | Msg_type | Msg_text                                              | 
+------------+-----------+----------+-------------------------------------------------------+ 
| world.city | histogram | status   | Histogram statistics created for column 'Population'. | 
+------------+-----------+----------+-------------------------------------------------------+

We have created a histogram using 1024 buckets. The number of buckets is not mandatory, and it can be any number from 1 to 1024. If omitted, the default value is 100.

The number of chunks affects the reliability of the statistics. The more distinct values you have, the more the chunks you need.

Let’s have a look now at the execution plan and execute the query again.

mysql> explain select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000;
+----+-------------+-----------------+------------+------+---------------------+-------------+---------+------------------------+------+----------+-------------+
| id | select_type | table           | partitions | type | possible_keys       | key         | key_len | ref                    | rows | filtered | Extra       |
+----+-------------+-----------------+------------+------+---------------------+-------------+---------+------------------------+------+----------+-------------+
|  1 | SIMPLE      | city            | NULL       | ALL  | CountryCode         | NULL        | NULL    | NULL                   | 4188 |     0.06 | Using where |
|  1 | SIMPLE      | countrylanguage | NULL       | ref  | PRIMARY,CountryCode | CountryCode | 3       | world.city.CountryCode |  984 |   100.00 | Using index |
+----+-------------+-----------------+------------+------+---------------------+-------------+---------+------------------------+------+----------+-------------+

mysql> select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000;
+-----------------+-----------+
| name            | language  |
+-----------------+-----------+
| Mumbai (Bombay) | Asami     |
| Mumbai (Bombay) | Bengali   |
| Mumbai (Bombay) | Gujarati  |
| Mumbai (Bombay) | Hindi     |
| Mumbai (Bombay) | Kannada   |
| Mumbai (Bombay) | Malajalam |
| Mumbai (Bombay) | Marathi   |
| Mumbai (Bombay) | Orija     |
| Mumbai (Bombay) | Punjabi   |
| Mumbai (Bombay) | Tamil     |
| Mumbai (Bombay) | Telugu    |
| Mumbai (Bombay) | Urdu      |
+-----------------+-----------+
12 rows in set (0.00 sec)

The execution plan is different, and the query runs faster.

We can notice that the order of the tables is the opposite as before. Even if it requires a full scan, the city table is in the first stage. It’s because of the filtered value that is only 0.06. It means that only 0.06% of the rows returned by the full scan will be used to be joined with the following table. So, it’s only 4188 * 0.06% = 2.5 rows. In total, the estimated cartesian product is 2.5 * 984 = 2.460 rows. This is significantly lower than the previous execution and explains why the query is faster.

What we have seen sounds a little counterintuitive, doesn’t it? In fact, until MySQL 5.7, we were used to considering full scans as very bad in most cases. In our case, instead, forcing a full scan using a histogram statistic on a non-indexed column lets the query to get optimized. Awesome.

 

Where are the histogram statistics

Histogram statistics are stored in the column_statistics table in the data dictionary and are not directly accessible by the users. Instead the INFORMATION_SCHEMA.COLUMN_STATISTICS table, which is implemented as a view of the data dictionary, can be used for the same purpose.

Let’s see the statistics for our table.

mysql> SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM)  
    -> FROM information_schema.column_statistics  
    -> WHERE COLUMN_NAME = 'population'\G
*************************** 1. row ***************************
           SCHEMA_NAME: world
            TABLE_NAME: city
           COLUMN_NAME: Population
JSON_PRETTY(HISTOGRAM): {
  "buckets": [
    [
      42,
      455,
      0.000980632507967639,
      4
    ],
    [
      503,
      682,
      0.001961265015935278,
      4
    ],
    [
      700,
      1137,
      0.0029418975239029173,
      4
    ],
...
...
    [
      8591309,
      9604900,
      0.9990193674920324,
      4
    ],
    [
      9696300,
      10500000,
      1.0,
      4
    ]
  ],
  "data-type": "int",
  "null-values": 0.0,
  "collation-id": 8,
  "last-updated": "2019-10-14 22:24:58.232254",
  "sampling-rate": 1.0,
  "histogram-type": "equi-height",
  "number-of-buckets-specified": 1024
}

We can see for any chunk the min and max values, the cumulative frequency, and the number of items. Also, we can see that MySQL decided to use an equi-height histogram.

Let’s try to generate a histogram on another table and column.

mysql> ANALYZE TABLE country UPDATE HISTOGRAM ON Region;
+---------------+-----------+----------+---------------------------------------------------+
| Table         | Op        | Msg_type | Msg_text                                          |
+---------------+-----------+----------+---------------------------------------------------+
| world.country | histogram | status   | Histogram statistics created for column 'Region'. |
+---------------+-----------+----------+---------------------------------------------------+
1 row in set (0.01 sec)

mysql> SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM)  FROM information_schema.column_statistics  WHERE COLUMN_NAME = 'Region'\G
*************************** 1. row ***************************
           SCHEMA_NAME: world
            TABLE_NAME: country
           COLUMN_NAME: Region
JSON_PRETTY(HISTOGRAM): {
  "buckets": [
    [
      "base64:type254:QW50YXJjdGljYQ==",
      0.02092050209205021
    ],
    [
      "base64:type254:QXVzdHJhbGlhIGFuZCBOZXcgWmVhbGFuZA==",
      0.04184100418410042
    ],
    [
      "base64:type254:QmFsdGljIENvdW50cmllcw==",
      0.05439330543933054
    ],
    [
      "base64:type254:QnJpdGlzaCBJc2xhbmRz",
      0.06276150627615062
    ],
    [
      "base64:type254:Q2FyaWJiZWFu",
      0.1631799163179916
    ],
    [
      "base64:type254:Q2VudHJhbCBBZnJpY2E=",
      0.20083682008368198
    ],
    [
      "base64:type254:Q2VudHJhbCBBbWVyaWNh",
      0.23430962343096232
    ],
    [
      "base64:type254:RWFzdGVybiBBZnJpY2E=",
      0.3179916317991631
    ],
    [
      "base64:type254:RWFzdGVybiBBc2lh",
      0.35146443514644343
    ],
    [
      "base64:type254:RWFzdGVybiBFdXJvcGU=",
      0.39330543933054385
    ],
    [
      "base64:type254:TWVsYW5lc2lh",
      0.41422594142259406
    ],
    [
      "base64:type254:TWljcm9uZXNpYQ==",
      0.44351464435146437
    ],
    [
      "base64:type254:TWljcm9uZXNpYS9DYXJpYmJlYW4=",
      0.4476987447698744
    ],
    [
      "base64:type254:TWlkZGxlIEVhc3Q=",
      0.5230125523012552
    ],
    [
      "base64:type254:Tm9yZGljIENvdW50cmllcw==",
      0.5523012552301255
    ],
    [
      "base64:type254:Tm9ydGggQW1lcmljYQ==",
      0.5732217573221757
    ],
    [
      "base64:type254:Tm9ydGhlcm4gQWZyaWNh",
      0.602510460251046
    ],
    [
      "base64:type254:UG9seW5lc2lh",
      0.6443514644351465
    ],
    [
      "base64:type254:U291dGggQW1lcmljYQ==",
      0.7029288702928871
    ],
    [
      "base64:type254:U291dGhlYXN0IEFzaWE=",
      0.7489539748953975
    ],
    [
      "base64:type254:U291dGhlcm4gQWZyaWNh",
      0.7698744769874477
    ],
    [
      "base64:type254:U291dGhlcm4gYW5kIENlbnRyYWwgQXNpYQ==",
      0.8284518828451883
    ],
    [
      "base64:type254:U291dGhlcm4gRXVyb3Bl",
      0.891213389121339
    ],
    [
      "base64:type254:V2VzdGVybiBBZnJpY2E=",
      0.9623430962343097
    ],
    [
      "base64:type254:V2VzdGVybiBFdXJvcGU=",
      1.0
    ]
  ],
  "data-type": "string",
  "null-values": 0.0,
  "collation-id": 8,
  "last-updated": "2019-10-14 22:29:13.418582",
  "sampling-rate": 1.0,
  "histogram-type": "singleton",
  "number-of-buckets-specified": 100
}

In this case, a singleton histogram was generated.

Using the following query we can see more human-readable statistics.

mysql> SELECT SUBSTRING_INDEX(v, ':', -1) value, concat(round(c*100,1),'%') cumulfreq,         
    -> CONCAT(round((c - LAG(c, 1, 0) over()) * 100,1), '%') freq   
    -> FROM information_schema.column_statistics, JSON_TABLE(histogram->'$.buckets','$[*]' COLUMNS(v VARCHAR(60) PATH '$[0]', c double PATH '$[1]')) hist  
    -> WHERE schema_name  = 'world' and table_name = 'country' and column_name = 'region';
+---------------------------+-----------+-------+
| value                     | cumulfreq | freq  |
+---------------------------+-----------+-------+
| Antarctica                | 2.1%      | 2.1%  |
| Australia and New Zealand | 4.2%      | 2.1%  |
| Baltic Countries          | 5.4%      | 1.3%  |
| British Islands           | 6.3%      | 0.8%  |
| Caribbean                 | 16.3%     | 10.0% |
| Central Africa            | 20.1%     | 3.8%  |
| Central America           | 23.4%     | 3.3%  |
| Eastern Africa            | 31.8%     | 8.4%  |
| Eastern Asia              | 35.1%     | 3.3%  |
| Eastern Europe            | 39.3%     | 4.2%  |
| Melanesia                 | 41.4%     | 2.1%  |
| Micronesia                | 44.4%     | 2.9%  |
| Micronesia/Caribbean      | 44.8%     | 0.4%  |
| Middle East               | 52.3%     | 7.5%  |
| Nordic Countries          | 55.2%     | 2.9%  |
| North America             | 57.3%     | 2.1%  |
| Northern Africa           | 60.3%     | 2.9%  |
| Polynesia                 | 64.4%     | 4.2%  |
| South America             | 70.3%     | 5.9%  |
| Southeast Asia            | 74.9%     | 4.6%  |
| Southern Africa           | 77.0%     | 2.1%  |
| Southern and Central Asia | 82.8%     | 5.9%  |
| Southern Europe           | 89.1%     | 6.3%  |
| Western Africa            | 96.2%     | 7.1%  |
| Western Europe            | 100.0%    | 3.8%  |
+---------------------------+-----------+-------+

 

Histogram maintenance

Histogram statistics are not automatically recalculated. If you have a table that is very frequently updated with a lot of INSERTs, UPDATEs, and DELETEs, the statistics can run out of date very soon. Having unreliable histograms can lead the optimizer to the wrong choice.

When you find a histogram was useful to optimize a query, you need to also have a scheduled plan to refresh the statistics from time to time, in particular after doing massive modifications to the table.

To refresh a histogram you just need to run the same ANALYZE command we have seen before.

To completely drop a histogram you may run the following:

ANALYZE TABLE city DROP HISTOGRAM ON population;

 

Sampling

The histogram_generation_max_mem_size system variable controls the maximum amount of memory available for histogram generation. The global and session values may be set at runtime.

If the estimated amount of data to be read into memory for histogram generation exceeds the limit defined by the variable, MySQL samples the data rather than reading all of it into memory. Sampling is evenly distributed over the entire table.

The default value is 20000000 but you can increase it in the case of a large column if you want more accurate statistics. For very large columns, pay attention not to increase the threshold more than the memory available in order to avoid excessive overhead or outage.

 

Conclusion

Histogram statistics are particularly useful for non-indexed columns, as shown in the example.

Execution plans that can rely on indexes are usually the best, but histograms can help in some edge cases or when creating a new index is a bad idea.

Since this is not an automatic feature, some manual testing is required to investigate if you really can get the benefit of a histogram. Also, the maintenance requires some scheduled and manual activity.

Use histograms if you really need them, but don’t abuse them since histograms on very large tables can consume a lot of memory.

Usually, the best candidates for a histogram are the columns with:

  • values that do not change much over time
  • low cardinality values
  • uneven distribution

Install Percona Server 8.0, test and enjoy the histograms.

 

Further reading on the same topic: Billion Goods in Few Categories – How Histograms Save a Life?

 

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com