UUIDs are Popular, but Bad for Performance — Let’s Discuss

UUID popular

If you do a quick web search about UUIDs and MySQL, you’ll get a fair number of results. Here are just a few examples:

So, does a well-covered topic like this one needs any more attention? Well, apparently – yes. Even though most posts are warning people against the use of UUIDs, they are still very popular. This popularity comes from the fact that these values can easily be generated by remote devices, with a very low probability of collision. With this post, my goal is to summarize what has already been written by others and, hopefully, bring in a few new ideas.

What are UUIDs?

UUID stands for Universally Unique IDentifier and is defined in the RFC 4122. It is a 128 bits number, normally written in hexadecimal and split by dashes into five groups. A typical UUID value looks like:

yves@laptop:~$ uuidgen 

There are officially 5 types of UUID values, version 1 to 5, but the most common are: time-based (version 1 or version 2) and purely random (version 3). The time-based UUIDs encode the number of 10ns since January 1st, 1970 in 7.5 bytes (60 bits), which is split in a “time-low”-“time-mid”-“time-hi” fashion. The missing 4 bits is the version number used as a prefix to the time-hi field.  This yields the 64 bits of the first 3 groups. The last 2 groups are the clock sequence, a value incremented every time the clock is modified and a host unique identifier. Most of the time, the MAC address of the main network interface of the host is used as a unique identifier.

There are important points to consider when you use time-based UUID values:

  • It is possible to determine the approximated time when the value was generated from the first 3 fields
  • There are many repetitive fields between consecutive UUID values
  • The first field, “time-low”, rolls over every 429s
  • The MySQL UUID function produces version one values

Here’s an example using the “uuidgen” Unix tool to generate time-based values:

yves@laptop:~$ for i in $(seq 1 500); do echo "$(date +%s): $(uuidgen -t)"; sleep 1; done
1573656803: 572e4122-0625-11ea-9f44-8c16456798f1
1573656804: 57c8019a-0625-11ea-9f44-8c16456798f1
1573656805: 586202b8-0625-11ea-9f44-8c16456798f1
1573657085: ff86e090-0625-11ea-9f44-8c16456798f1
1573657086: 0020a216-0626-11ea-9f44-8c16456798f1
1573657232: 56b943b2-0626-11ea-9f44-8c16456798f1
1573657233: 57534782-0626-11ea-9f44-8c16456798f1
1573657234: 57ed593a-0626-11ea-9f44-8c16456798f1

The first field rolls over (at t=1573657086) and the second field is incremented. It takes about 429s to see similar values again for the first field. The third field changes only once per about a year. The last field is static on a given host, the MAC address is used on my laptop:

yves@laptop:~$ ifconfig | grep ether | grep 8c
    	ether 8c:16:45:67:98:f1  txqueuelen 1000  (Ethernet)

The other frequently seen UUID version is 4, the purely random one. By default, the Unix “uuidgen” tool produces UUID version 4 values:

yves@laptop:~$ for i in $(seq 1 3); do uuidgen; done

The only “repeated” value is the version, “4”, at the beginning of the 3rd field. All the other 124 bits are random.

What is so Wrong with UUID Values?

In order to appreciate the impact of using UUID values as a primary key, it is important to review how InnoDB organizes the data. InnoDB stores the rows of a table in the b-tree of the primary key. In database terminology, we call this a clustered index. The clustered index orders the rows automatically by the primary key.

When you insert a new row with a random primary key value, InnoDB has to find the page where the row belongs, load it in the buffer pool if it is not already there, insert the row and then, eventually, flush the page back to disk. With purely random values and large tables, all b-tree leaf pages are susceptible to receive the new row, there are no hot pages. Rows inserted out of the primary key order cause page splits causing a low filling factor. For tables much larger than the buffer pool, an insert will very likely need to read a table page from disk. The page in the buffer pool where the new row has been inserted will then be dirty.  The odds the page will receive a second row before it needs to be flushed to disk are very low. Most of the time, every insert will cause two IOPs – one read and one write. The first major impact is on the rate of IOPs and it is a major limiting factor for scalability.

The only way to get decent performance is thus to use storage with low latency and high endurance. That’s where you’ll the second major performance impact. With a clustered index, the secondary indexes use the primary key values as the pointers. While the leaves of the b-tree of the primary key store rows, the leaves of the b-tree of a secondary index store primary key values.

Let’s assume a table of 1B rows having UUID values as primary key and five secondary indexes. If you read the previous paragraph, you know the primary key values are stored six times for each row. That means a total of 6B char(36) values representing 216 GB. That is just the tip of the iceberg, as tables normally have foreign keys, explicit or not, pointing to other tables. When the schema is based on UUID values, all these columns and indexes supporting them are char(36). I recently analyzed a UUID based schema and found that about 70 percent of storage was for these values.

As if that’s not enough, there’s a third important impact of using UUID values. Integer values are compared up to 8 bytes at a time by the CPU but UUID values are compared char per char. Databases are rarely CPU bound, but nevertheless this adds to the latencies of the queries. If you are not convinced, look at this performance comparison between integers vs strings:

mysql> select benchmark(100000000,2=3);
| benchmark(100000000,2=3) |
|                        0 |
1 row in set (0.96 sec)

mysql> select benchmark(100000000,'df878007-80da-11e9-93dd-00163e000002'='df878007-80da-11e9-93dd-00163e000003');
| benchmark(100000000,'df878007-80da-11e9-93dd-00163e000002'='df878007-80da-11e9-93dd-00163e000003') |
|                                                                                                  0 |
1 row in set (27.67 sec)

Of course, the above example is a worst-case scenario but it at least gives the span of the issue. Comparing integers is about 28 times faster. Even if the difference appears rapidly in the char values, it is still about 2.5 times slower:

mysql> select benchmark(100000000,'df878007-80da-11e9-93dd-00163e000002'='ef878007-80da-11e9-93dd-00163e000003');
| benchmark(100000000,'df878007-80da-11e9-93dd-00163e000002'='ef878007-80da-11e9-93dd-00163e000003') |
|                                                                                                  0 |
1 row in set (2.45 sec)

Let’s explore a few solutions to address those issues.

Size of the Values

The default representation for UUID, hash, and token values is often the hexadecimal notation. With a cardinality, the number of possible values, of only 16 per byte, it is far from efficient. What about using another representation like base64 or even straight binary? How much do we save? How is the performance affected?

Let’s begin by the base64 notation. The cardinality of each byte is 64 so it takes 3 bytes in base64 to represent 2 bytes of actual value. A UUID value consists of 16 bytes of data, if we divide by 3, there is a remainder of 1. To handle that, the base64 encoding adds ‘=’ at the end:

mysql> select to_base64(unhex(replace(uuid(),'-','')));
| to_base64(unhex(replace(uuid(),'-',''))) |
| clJ4xvczEeml1FJUAJ7+Fg==                 |
1 row in set (0.00 sec)

If the length of the encoded entity is known, like for a UUID, we can remove the ‘==’, as it is just dead weight. A UUID encoded in base64 thus has a length of 22.

The next logical step is to directly store the value in binary format. This the most optimal format but displaying the values in the mysql client is less convenient.

So, how’s the size impacting performance? To illustrate the impact, I inserted random UUID values in a table with the following definition…

CREATE TABLE `data_uuid` (
  `id` char(36) NOT NULL,
  PRIMARY KEY (`id`)

… for the default hexadecimal representation. For base64, the ‘id’ column is defined as char(22) while binary(16) is used for the binary example. The database server has a buffer pool size at 128M and its IOPs are limited to 500. The insertions are done over a single thread.

Insertion rates for tables using different representation for UUID values

In all cases, the insertion rate is at first CPU bound but as soon the table is larger than the buffer pool, the insertion rapidly becomes IO bound. This is expected and shouldn’t surprise anyone. The use of a smaller representation for the UUID values just allows more rows to fit in the buffer pool but in the long run, it doesn’t really help the performance, as the random insertion order dominates. If you are using random UUID values as primary keys, your performance is limited by the amount of memory you can afford.

Option 1: Saving IOPs with Pseudo-Random Order

As we have seen, the most important issue is the random nature of the values. A new row may end up in any of the table leaf pages. So unless the whole table is loaded in the buffer pool, it means a read IOP and eventually a write IOP. My colleague David Ducos gave a nice solution to this problem but some customers do not want to allow for the possibility of extracting information from the UUID values, like, for example, the generation timestamp.

What if we somewhat just reduce then the randomness of the values in a way that a prefix of a few bytes is constant for a time interval? During the time interval, only a fraction of the whole table, corresponding to the cardinality of the prefix, would be required to be in the memory to save the read IOPs. This would also increase the likelihood a page receives a second write before being flushed to disk, thus reducing the write load. Let’s consider the following UUID generation function:

drop function if exists f_new_uuid; 
delimiter ;;
CREATE DEFINER=`root`@`%` FUNCTION `f_new_uuid`() RETURNS char(36)
    DECLARE cNewUUID char(36);
    DECLARE cMd5Val char(32);

    set cMd5Val = md5(concat(rand(),now(6)));
    set cNewUUID = concat(left(md5(concat(year(now()),week(now()))),4),left(cMd5Val,4),'-',

delimiter ;

The first four characters of the UUID value comes from the MD5 hash of the concatenation of the current year and week number. This value is, of course, static over a week. The remaining of the UUID value comes from the MD5 of a random value and the current time at a precision of 1us. The third field is prefixed with a “4” to indicate it is a version 4 UUID type. There are 65536 possible prefixes so, during a week, only 1/65536 of the table rows are required in the memory to avoid a read IOP upon insertion. That’s much easier to manage, a 1TB table will need to have only about 16MB in the buffer pool to support the inserts.

Option 2: Mapping UUIDs to Integers

Even if you use pseudo-ordered UUID values stored using binary(16), it is still a very large data type which will inflate the size of the dataset. Remember the primary key values are used as pointers in the secondary indexes by InnoDB. What if we store all the UUID values of a schema in a mapping table? The mapping table will be defined as:

CREATE TABLE `uuid_to_id` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `uuid` char(36) NOT NULL,
  `uuid_hash` int(10) unsigned GENERATED ALWAYS AS (crc32(`uuid`)) STORED NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_hash` (`uuid_hash`)

It is important to notice the uuid_to_id table does not enforce the uniqueness of uuid. The idx_hash index acts a bit like a bloom filter. We’ll know for sure a UUID value is not present in the table when there is no matching hash value but we’ll have to validate with the stored UUID value when there is a matching hash. To help us here, let’s create a SQL function:

CREATE DEFINER=`root`@`%` FUNCTION `f_uuid_to_id`(pUUID char(36)) RETURNS int(10) unsigned
        DECLARE iID int unsigned;
        DECLARE iOUT int unsigned;

        select get_lock('uuid_lock',10) INTO iOUT;

        SELECT id INTO iID
        FROM uuid_to_id WHERE uuid_hash = crc32(pUUID) and uuid = pUUID;

            select release_lock('uuid_lock') INTO iOUT;
            SIGNAL SQLSTATE '23000'
                SET MESSAGE_TEXT = 'Duplicate entry', MYSQL_ERRNO = 1062;
            insert into uuid_to_id (uuid) values (pUUID);
            select release_lock('uuid_lock') INTO iOUT;
            set iID = last_insert_id();
        END IF;

        RETURN iID;
END ;;

The function checks if the UUID values passed exist in the uuid_to_id table, and if it does it returns the matching id value otherwise it inserts the UUID value and returns the last_insert_id. To protect against the concurrent submission of the same UUID values, I added a database lock. The database lock limits the scalability of the solution. If your application cannot submit twice the request over a very short time frame, the lock could be removed. I have also another version of the function with no lock calls and using a small dedup table where recent rows are kept for only a few seconds. See my github if you are interested.

Results for the Alternate Approaches

Now, let’s have a look at the insertion rates using these alternate approaches.

Insertion on tables using UUID values as primary keys, alternative solutions

The pseudo-order results are great. Here I modified the algorithm to keep the UUID prefix constant for one minute instead of one week in order to better fit the test environment. Even if the pseudo-order solution performs well, keep in mind it is still bloating the schema and overall the performance gains may not be that great.

The mapping to integer values, although the insert rates are smaller due to the additional DMLs required, decouples the schema from the UUID values. The tables now use integers as primary keys. This mapping removes nearly all the scalability concerns of using UUID values. Still, even on a small VM with limited CPU and IOPS, the UUID mapping technique yields nearly 4000 inserts/s. Put into context, this means 14M rows per hour, 345M rows per day and 126B rows per year. Such rates likely fit most requirements. The only growth limitation factor is the size of the hash index. When the hash index will be too large to fit in the buffer pool, performance will start to decrease.

Other Options than UUID Values?

Of course, there are other possibilities to generate unique IDs.  The method used by the MySQL function UUID_SHORT() is interesting. A remote device like a smartphone could use the UTC time instead of the server uptime. Here’s a proposal:

(Seconds since January 1st 1970) << 32
+ (lower 2 bytes of the wifi MAC address) << 16
+ 16_bits_unsigned_int++;

The 16 bits counter should be initialized at a random value and allowed to roll over. The odds of two devices producing the same ID are very small. It has to happen at approximately the same time, both devices must have the same lower bytes for the MAC and their 16 bits counter at the same increment.


All the data related to this post can be found in my github.


Column Histograms on Percona Server and MySQL 8.0

MySQL Column HIstorgrams

MySQL Column HIstorgramsFrom time to time you may have experienced that MySQL was not able to find the best execution plan for a query. You felt the query should have been faster. You felt that something didn’t work, but you didn’t realize exactly what.

Maybe some of you did tests and discovered there was a better execution plan that MySQL wasn’t able to find (forcing the order of the tables with STRAIGHT_JOIN for example).

In this article, we’ll see a new interesting feature available on MySQL 8.0 as well as Percona Server for MySQL 8.0: the histogram-based statistics.

Today, we’ll see what a histogram is, how you can create and manage it, and how MySQL’s optimizer can use it.

Just for completeness, histogram statistics have been available on MariaDB since version 10.0.2, with a slightly different implementation. Anyway, what we’ll see here is related to Percona Server and MySQL 8.0 only.


What is a histogram

We can define a histogram as a good approximation of the data distribution of the values in a column.

Histogram-based statistics were introduced to give the optimizer more execution plans to investigate and solve a query. Until then, in some cases, the optimizer was not able to find out the best possible execution plan because non-indexed columns were ignored.

With histogram statistics, now the optimizer may have more options because also non-indexed columns can be considered. In some specific cases, a query can run faster than usual.

Let’s consider the following table to store departing times of the trains:

CREATE TABLE train_schedule(
train_code VARCHAR(10),
departure_station VARCHAR(100),
departure_time TIME);

We can assume that during peak hours, from 7 AM until 9 AM, there are more rows, and during the night hours we have very few rows.

Let’s take a look at the following two queries:

SELECT * FROM train_schedule WHERE departure_time BETWEEN '07:30:00' AND '09:15:00';
SELECT * FROM train_schedule WHERE departure_time BETWEEN '01:00:00' AND '03:00:00';

Without any kind of statistics, the optimizer assumes by default that the values in the departure_time column are evenly distributed, but they aren’t. In fact, the first query returns more rows because of this assumption.

Histograms were invented to provide to the optimizer a good estimation of the rows returned. This seems to be trivial for the simple queries we have seen so far. But let’s think now about having the same table involved in JOINs with other tables. In such a case, the number of rows returned can be very important for the optimizer to decide the order to consider the tables in the execution plan.

A good estimation of the rows returned gives the optimizer the capability to open the table in the first stages in case it returns few rows. This minimizes the total amount of rows for the final cartesian product. Then the query can run faster.

MySQL supports two different types of histograms: “singleton” and “equi-height”. Common for all histogram types is that they split the data set into a set of “buckets”, and MySQL automatically divides the values into the buckets and will also automatically decide what type of histogram to create.

Singleton histogram

  • one value per bucket
  • each bucket stores
    • value
    • cumulative frequency
  • well suited for equality and range conditions

Equi-height histogram

  • multiple values per bucket
  • each bucket stores
    • minimum value
    • maximum value
    • cumulative frequency
    • number of distinct values
  • not really equi-height: frequent values are in separated buckets
  • well suited for range conditions

How to use histograms

The histogram feature is available and enabled on the server, but not usable by the optimizer. Without an explicit creation, the optimizer works the same as usual and cannot get any benefit from the histogram-bases statistics.

There is some manual operation to do. Let’s see.

In the next examples, we’ll use the world sample database you can download from here:

Let’s start executing a query joining two tables to find out all the languages spoken on the largest cities of the world, with more than 10 million people.

mysql> select, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000; 
| name            | language  | 
| Mumbai (Bombay) | Asami     | 
| Mumbai (Bombay) | Bengali   | 
| Mumbai (Bombay) | Gujarati  | 
| Mumbai (Bombay) | Hindi     | 
| Mumbai (Bombay) | Kannada   | 
| Mumbai (Bombay) | Malajalam | 
| Mumbai (Bombay) | Marathi   | 
| Mumbai (Bombay) | Orija     | 
| Mumbai (Bombay) | Punjabi   | 
| Mumbai (Bombay) | Tamil     | 
| Mumbai (Bombay) | Telugu    | 
| Mumbai (Bombay) | Urdu      | 
12 rows in set (0.04 sec)

The query takes 0.04 seconds. It’s not a lot, but consider that the database is very small. Use the BENCHMARK function to have more relevant response times if you like.

Let’s see the EXPLAIN:

mysql> explain select, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000; 
| id | select_type | table           | partitions | type  | possible_keys       | key         | key_len | ref                               | rows | filtered | Extra       | 
| 1  | SIMPLE      | countrylanguage | NULL       | index | PRIMARY,CountryCode | CountryCode | 3       | NULL                              | 984  | 100.00   | Using index | 
| 1  | SIMPLE      | city            | NULL       | ref   | CountryCode         | CountryCode | 3       | world.countrylanguage.CountryCode | 18   | 33.33    | Using where | 

Indexes are used for both the tables and the estimated cartesian product has 984 * 18 = 17,712 rows.

Now generate the histogram on the Population column. It’s the only column used for filtering the data and it’s not indexed.

For that, we have to use the ANALYZE command:

| Table      | Op        | Msg_type | Msg_text                                              | 
| | histogram | status   | Histogram statistics created for column 'Population'. | 

We have created a histogram using 1024 buckets. The number of buckets is not mandatory, and it can be any number from 1 to 1024. If omitted, the default value is 100.

The number of chunks affects the reliability of the statistics. The more distinct values you have, the more the chunks you need.

Let’s have a look now at the execution plan and execute the query again.

mysql> explain select, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000;
| id | select_type | table           | partitions | type | possible_keys       | key         | key_len | ref                    | rows | filtered | Extra       |
|  1 | SIMPLE      | city            | NULL       | ALL  | CountryCode         | NULL        | NULL    | NULL                   | 4188 |     0.06 | Using where |
|  1 | SIMPLE      | countrylanguage | NULL       | ref  | PRIMARY,CountryCode | CountryCode | 3       | |  984 |   100.00 | Using index |

mysql> select, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000;
| name            | language  |
| Mumbai (Bombay) | Asami     |
| Mumbai (Bombay) | Bengali   |
| Mumbai (Bombay) | Gujarati  |
| Mumbai (Bombay) | Hindi     |
| Mumbai (Bombay) | Kannada   |
| Mumbai (Bombay) | Malajalam |
| Mumbai (Bombay) | Marathi   |
| Mumbai (Bombay) | Orija     |
| Mumbai (Bombay) | Punjabi   |
| Mumbai (Bombay) | Tamil     |
| Mumbai (Bombay) | Telugu    |
| Mumbai (Bombay) | Urdu      |
12 rows in set (0.00 sec)

The execution plan is different, and the query runs faster.

We can notice that the order of the tables is the opposite as before. Even if it requires a full scan, the city table is in the first stage. It’s because of the filtered value that is only 0.06. It means that only 0.06% of the rows returned by the full scan will be used to be joined with the following table. So, it’s only 4188 * 0.06% = 2.5 rows. In total, the estimated cartesian product is 2.5 * 984 = 2.460 rows. This is significantly lower than the previous execution and explains why the query is faster.

What we have seen sounds a little counterintuitive, doesn’t it? In fact, until MySQL 5.7, we were used to considering full scans as very bad in most cases. In our case, instead, forcing a full scan using a histogram statistic on a non-indexed column lets the query to get optimized. Awesome.


Where are the histogram statistics

Histogram statistics are stored in the column_statistics table in the data dictionary and are not directly accessible by the users. Instead the INFORMATION_SCHEMA.COLUMN_STATISTICS table, which is implemented as a view of the data dictionary, can be used for the same purpose.

Let’s see the statistics for our table.

    -> FROM information_schema.column_statistics  
    -> WHERE COLUMN_NAME = 'population'\G
*************************** 1. row ***************************
           SCHEMA_NAME: world
            TABLE_NAME: city
           COLUMN_NAME: Population
  "buckets": [
  "data-type": "int",
  "null-values": 0.0,
  "collation-id": 8,
  "last-updated": "2019-10-14 22:24:58.232254",
  "sampling-rate": 1.0,
  "histogram-type": "equi-height",
  "number-of-buckets-specified": 1024

We can see for any chunk the min and max values, the cumulative frequency, and the number of items. Also, we can see that MySQL decided to use an equi-height histogram.

Let’s try to generate a histogram on another table and column.

| Table         | Op        | Msg_type | Msg_text                                          |
| | histogram | status   | Histogram statistics created for column 'Region'. |
1 row in set (0.01 sec)

*************************** 1. row ***************************
           SCHEMA_NAME: world
            TABLE_NAME: country
           COLUMN_NAME: Region
  "buckets": [
  "data-type": "string",
  "null-values": 0.0,
  "collation-id": 8,
  "last-updated": "2019-10-14 22:29:13.418582",
  "sampling-rate": 1.0,
  "histogram-type": "singleton",
  "number-of-buckets-specified": 100

In this case, a singleton histogram was generated.

Using the following query we can see more human-readable statistics.

mysql> SELECT SUBSTRING_INDEX(v, ':', -1) value, concat(round(c*100,1),'%') cumulfreq,         
    -> CONCAT(round((c - LAG(c, 1, 0) over()) * 100,1), '%') freq   
    -> FROM information_schema.column_statistics, JSON_TABLE(histogram->'$.buckets','$[*]' COLUMNS(v VARCHAR(60) PATH '$[0]', c double PATH '$[1]')) hist  
    -> WHERE schema_name  = 'world' and table_name = 'country' and column_name = 'region';
| value                     | cumulfreq | freq  |
| Antarctica                | 2.1%      | 2.1%  |
| Australia and New Zealand | 4.2%      | 2.1%  |
| Baltic Countries          | 5.4%      | 1.3%  |
| British Islands           | 6.3%      | 0.8%  |
| Caribbean                 | 16.3%     | 10.0% |
| Central Africa            | 20.1%     | 3.8%  |
| Central America           | 23.4%     | 3.3%  |
| Eastern Africa            | 31.8%     | 8.4%  |
| Eastern Asia              | 35.1%     | 3.3%  |
| Eastern Europe            | 39.3%     | 4.2%  |
| Melanesia                 | 41.4%     | 2.1%  |
| Micronesia                | 44.4%     | 2.9%  |
| Micronesia/Caribbean      | 44.8%     | 0.4%  |
| Middle East               | 52.3%     | 7.5%  |
| Nordic Countries          | 55.2%     | 2.9%  |
| North America             | 57.3%     | 2.1%  |
| Northern Africa           | 60.3%     | 2.9%  |
| Polynesia                 | 64.4%     | 4.2%  |
| South America             | 70.3%     | 5.9%  |
| Southeast Asia            | 74.9%     | 4.6%  |
| Southern Africa           | 77.0%     | 2.1%  |
| Southern and Central Asia | 82.8%     | 5.9%  |
| Southern Europe           | 89.1%     | 6.3%  |
| Western Africa            | 96.2%     | 7.1%  |
| Western Europe            | 100.0%    | 3.8%  |


Histogram maintenance

Histogram statistics are not automatically recalculated. If you have a table that is very frequently updated with a lot of INSERTs, UPDATEs, and DELETEs, the statistics can run out of date very soon. Having unreliable histograms can lead the optimizer to the wrong choice.

When you find a histogram was useful to optimize a query, you need to also have a scheduled plan to refresh the statistics from time to time, in particular after doing massive modifications to the table.

To refresh a histogram you just need to run the same ANALYZE command we have seen before.

To completely drop a histogram you may run the following:




The histogram_generation_max_mem_size system variable controls the maximum amount of memory available for histogram generation. The global and session values may be set at runtime.

If the estimated amount of data to be read into memory for histogram generation exceeds the limit defined by the variable, MySQL samples the data rather than reading all of it into memory. Sampling is evenly distributed over the entire table.

The default value is 20000000 but you can increase it in the case of a large column if you want more accurate statistics. For very large columns, pay attention not to increase the threshold more than the memory available in order to avoid excessive overhead or outage.



Histogram statistics are particularly useful for non-indexed columns, as shown in the example.

Execution plans that can rely on indexes are usually the best, but histograms can help in some edge cases or when creating a new index is a bad idea.

Since this is not an automatic feature, some manual testing is required to investigate if you really can get the benefit of a histogram. Also, the maintenance requires some scheduled and manual activity.

Use histograms if you really need them, but don’t abuse them since histograms on very large tables can consume a lot of memory.

Usually, the best candidates for a histogram are the columns with:

  • values that do not change much over time
  • low cardinality values
  • uneven distribution

Install Percona Server 8.0, test and enjoy the histograms.


Further reading on the same topic: Billion Goods in Few Categories – How Histograms Save a Life?



MongoDB: Impact-free Index Builds using Detached ReplicaSet Nodes

MongoDB Impact-free Index Builds

MongoDB Impact-free Index BuildsCreating an index on a MongoDB collection is simple; just run the command CreateIndex and that’s all there is to it. There are several index types available, and in a previous post, you can find the more important index types: MongoDB index types and explain().

The command is quite simple, but for MongoDB, building an index is probably the most complicated task. I’m going to explain what the potential issues are and the best way to create any kind of index on a Replica Set environment.

A Replica Set is a cluster of mongod servers, at least 3, where the complete database is replicated. Some of the main advantages of this kind of structure are automatic failover and read scalability. If you need more familiarity with Replica Set, you may take a look at the following posts:

Deploy a MongoDB Replica Set with Transport Encryption (Part 1)

Deploy a MongoDB Replica Set with Transport Encryption (Part 2)

Deploy a MongoDB Replica Set with Transport Encryption (Part 3)

Create Index Impact

As said, creating an index for MongoDB has really a severe impact. A simple index creation on a field like the following blocks all other operations on the database:

db.people.Createindex( { email: 1 } )

This could be ok for a very small collection, let’s say where the building will take a few milliseconds. But for larger collections, this is absolutely forbidden.

We call this way of building an index the “foreground” mode.

The foreground index creation is the fastest way, but since it is blocking we have to use something different in the production environments. Fortunately, we can also create an index in “background” mode. We may use the following command:

db.people.CreateIndex( { email : 1}, { background: true } )

The index will be built in the background by mongod using a different incremental approach. The advantage is that the database can continue to operate normally without any lock. Unfortunately, background creation takes much longer than the foreground build.

The first hint is then to create the indexes using the background option. This is OK, but not in all the cases. More on that later.

Another impact when building an index is memory usage. MongoDB uses, by default,  up to 500MB of memory for building the index, but you can override it if the index is larger. The larger the index, the higher will be the impact if you don’t have the capability to assign more memory for the task.

To increase the amount of memory for index builds, set the following in the configuration file:

maxIndexBuildMemoryUsageMegabytes: 1024

Example: set it for 1 GB.

Create Index on a Replica Set

As long as the index creation command is replicated on all the nodes of the cluster in the same way all the other write commands are replicated. the index creation is replicated on a Replica Set cluster. A foreground creation on the PRIMARY is replicated as foreground on SECONDARY nodes. A background creation is replicated as background on SECONDARY nodes as well.

The same limitation applies for the Replica Set as the standalone server. The foreground build is fast but blocking and the background build is not blocking, but it is significantly slower for very large collections.

So, what can we do?

If you need to create a small index, let’s say the size is less than the available memory, you can rely on the background creation on the PRIMARY node. The operation will be correctly replicated to the SECONDARY nodes and the overall impact won’t be too bad.

But if you need to create an index larger than the memory, on a huge collection, then even the background build is bad. The creation will have a significant impact on the server resources and you can get overall performance problems on all the nodes. In this case, we have to follow another procedure. The procedure requires more manual steps, but it’s the only way to properly build such a large index.

The idea is to detach from the Replica Set one node at the time, create the index, and rejoin the node to the cluster. But first, you need to take care of the oplog size. The oplog window should be large enough to give you the time for the index build when a node is detached.

Note: the oplog window is the timestamp difference between the first entry in the oplog and the more recent one. It represents the maximum amount of time you can have a node detached from the cluster for any kind of task (software upgrades, index builds, backups). If you rejoin the node inside the window, the node will be able to catch up with the PRIMARY just executing the missing operations from the oplog. If you rejoin the node after the window, it will have to copy completely all the collections. This will take a long time and an impressive bandwidth usage for large deployments. 

The following is the step by step guide:

  • choose one of the SECONDARY nodes
  • detach the node from the Replica Set
    • comment into the configuration file the replSetName and the port options
    • set a different port number
    • set the parameter disableLogicalSessionCacheRefresh to true
   bindIp: localhost,
   port: 27777
#  port: 27017
#   replSetName: RSCluster
   disableLogicalSessionCacheRefresh: true

    • restart mongod
    • now the server is running as standalone; any query won’t be replicated
  • connect using the alternative port and build the index in foreground mode
db.people.CreateIndex( { email: 1} )

  • connect the node to the Replica Set
    • uncomment the options in the configuration file
    • remove the disableLogicalSessionCacheRefresh option
    • restart mongod
    • now the node is a member of the Replica Set
    • wait some time for the node to catch up with the PRIMARY
  • repeat the previous steps for all the remaining SECONDARY nodes
  • stepdown the PRIMARY node to force an election
    • run rs.stepDown() command. This forces an election. Wait for some time for the PRIMARY to become a SECONDARY node.
  • restart it as standalone
    • use the same procedure we saw before
  • build the index in foreground mode
  • restart the node and connect it to the Replica Set

That’s all. We have created the index on all the nodes without any impact for the cluster and for the production applications.

Note: when restarting a node as standalone, the node could be exposed to mistake writes. For the sake of security, a good practice could be to disable TCP connections, allowing only local connections using the socket. Then you can put into the configuration file for example:
bindIp: /tmp/mongod.sock


This procedure is definitely more complicated than running a single command. It will require some time, but we hope you don’t have to create such large indexes every single day. ?


Stored Functions and Temporary Tables are Not a Good Fit

Stored Functions

Stored FunctionsIn this blog post, I am going to show why we have to be careful with stored functions in select list, as a single query can cause thousands of queries in the background if we aren’t cautious.

For this example, I am only going to use the SLEEP function to demonstrate the issue, but you could use any other stored functions.

Here is the test schema and data:

 id INT NOT NULL PRIMARY KEY auto_increment,
 INDEX (gcol)
INSERT INTO t1 VALUES (NULL,1,1),(NULL,2,1),(NULL,3,1),(NULL,4,2),(NULL,5,1);

And the first query:

mysql [localhost] {msandbox} (test) > SELECT gcol, SLEEP(1) FROM t1 WHERE 1 GROUP BY gcol LIMIT 1;
| gcol | SLEEP(1) |
|    1 |        0 |
1 row in set (1.00 sec)

The query takes one second, which means the SLEEP was executed after/during the LIMIT step.

The second query creates a temp table:

mysql [localhost] {msandbox} (test) > SELECT fcol, SLEEP(1) FROM t1 WHERE 1 GROUP BY fcol LIMIT 1;
| fcol | SLEEP(1) |
|    1 |        0 |
1 row in set (5.02 sec)

It takes 5 seconds, meaning the SLEEP was executed for every row before LIMIT, and the only differences between these two queries is the second one uses temp tables. (Yes, the schema and query could be improved to avoid temp tables, but this is only a test case and you can’t avoid temp tables all the time.)

MySQL uses a different order of steps here and does the select list evaluation before/during creating the temp tables.

As you can see, there are functions in the select list, and MySQL will create temp tables, and that could cause thousands of extra queries in the background. I recently had a case where running a single query with functions in the select list resulted in 333,906 queries in the background! In the query, there was a Limit 0,10.

There is a very good blog post describing the order of the operations in SQL, which should be the following:

- From
- Where
- Group By
- Aggregations
- Having
- Window
- Select
- Distinct
- Uninon
- Order by
- Offset
- Limit

Based on this, if MySQL strictly follows the SQL order, the LIMIT should be the last, and the first query should take 5s because the Select list evaluation should happen before that for every row. But I guess this is a performance optimization in MySQL, to only run the evaluation for the limited number of rows.

When temp tables are involved, MySQL will use a different logic and do the evaluation before/during the group by when it creates the temp tables.

When MySQL creates a temp table, it has to materialize every single row, and that includes the functions as well. If your query is going to create a huge temp table, MySQL will execute the function for all those rows.

MySQL does not have generated columns in internal temp tables, but what might do the trick here is to just point to a function that needs to be executed when MySQL reads the data out from the temp table.

I have created a ticket, but we are still debating if there is any room for improvement here.

Using DML queries in these functions

If you are using DML (insert/update/delete) queries in stored functions, and you are calling them in the select list, you have to be careful because if MySQL creates temp tables it will call those queries for all the rows in the temp table as well. So you might just want to insert only one row in another table, as you will end up with thousands if not.

How can we avoid this?

If you are facing this issue, you have to analyze your queries. You might only need an extra index to avoid temp tables, or you could rewrite the query.

Where can I see if this is happening?

Unfortunately, in this case, the slow query log cannot help us, because these function calls and the queries from the functions are not logged. This could also be a feature request because it would make debugging much easier. (I have already created a feature request in Percona Server to log queries from stored functions.)

However, in the slow query log for the original query, the Rows_examined would be very high if you are using stored functions and that number is way higher than it should be and might cause problems. Also, the general log can be helpful for investigating the problem here, because in the general log we can see all these function calls as well.


If you are using stored functions in Select list, you have to make sure the query does not use temp tables, otherwise, it could cause many extra queries in the background.


How to Report Bugs, Improvements, New Feature Requests for Percona Products

report bugs

report bugsClear and structured bug, improvement, and new feature request reports are always helpful for validating and fixing issues. In a few cases, we have received these reports with incomplete information which can cause a delay in the verification of issues. The most effective method to avoid this situation is to ensure complete information about the issue when filing a report.

In this post we will discuss:

  • The best ways to report an issue for Percona products
  • Including a “how to reproduce” test case to verify the issue
  • The purpose of bug/improvement/new feature verification is the central place to report a bug/improvement/new feature request for all Percona products.

Let’s first discuss a few important entries which you should update when reporting an issue.

Project: The product name for which you wish to report an issue.

Issue Type: Provides options for the type of request. Example: Bug/Improvement/New Feature Request.

Summary: Summary of the issue which will serve as the title. It should be a one-line summary.

Affects Version/s: The version number of the Percona software for which you are reporting an issue.

Description: This field is to describe your issue in detail. Issue description should be clear and concise.

Bug report:

  • Describe the actual issue.
  • Add test case/steps to reproduce the issue if possible.
  • If a crash bug, provide my.cnf and error log as additional information along with the details mentioned in this blog_post.
  • In some cases, the supporting file for bug reports such as a coredump, sqldump, or error log is prohibitively large. For these cases, we have an SFTP server where these files can be uploaded.

      Documentation bug:

  • Provide the documentation link, describe what is wrong, and suggest how the documentation could be improved.

Improvement/New Feature  Request:

  • For new features, describe the need and use case. Include answers to such questions as “What problem it will solve?” and “How will it benefit the users?”
  • In the case of improvements, mention what is problematic with the current behavior. What is your expectation as an improvement in a particular product feature?

Note: When reporting an issue, be sure to remove/replace sensitive information such as IP addresses, usernames, passwords, etc. from the report description and attached files.

Upstream Bugs:

Percona Server for MySQL and Percona Server for MongoDB are patched versions of their upstream codebases. It is possible that the particular issue originated from the upstream version. For these products, it would be helpful if the reporter also checks upstream for the same issue. If issues exist upstream, use the following URLs to report an issue for these respective products.


If you are a Percona customer, please file a support request to let us know how the bug affects you.

Purpose of Bug/Improvement/New Feature Request verification

  • Gather the required information from the reporter and identify whether the reported issue is a valid bug/improvement/new feature.
  • For bugs, create a reproducible test case. To effectively address, the bug must be repeatable on demand.

Any incorrect assumptions can break other parts of the code while fixing a particular bug; this is why the verification process is important to identify the exact problem. Another benefit of having a reproducible test case is that it can then be used to verify the fix.

While feature requests and improvements are about ideas on how to improve Percona products, they still need to be verified. We need to ensure that behavior reported as a new feature or improvement:

  • Is not a bug
  • Is not implemented yet
  • For new feature verification, we also check whether there is an existing, different way to achieve the same.

Once bugs, improvement, and new features are validated, the issue status will be “Open” and it will move forward for implementation.


MySQL InnoDB Sorted Index Builds

Bulk load of 7.7

It’s not essential to understand how MySQL® and Percona Server for MySQL build indexes. However, if you have an understanding of the processing, it could help when you want to reserve an appropriate amount of space for data inserts. From MySQL 5.7, developers changed the way they built secondary indexes for InnoDB, applying a bottom-up rather than the top-down approach used in earlier releases. In this post, I’ll walk through an example to show how an InnoDB index is built. At the end, I’ll explain how you can use this understanding to set an appropriate value for innodb_fill_factor.

Index building process

To build an index on a table with existing data, there are the following phases in InnoDB

  1. Read phase (read from clustered index and build secondary index entries)
  2. Merge sort phase
  3. Insert phase (insert sorted records into the secondary index)

Until version 5.6, MySQL built the secondary index by inserting one record at a time. This is a “top-down” approach. The search for the insert position starts from root (top) and reaches the appropriate leaf page (down). The record is inserted on leaf page pointed to by the cursor. It is expensive in terms of finding insert position and doing page splits and merges (at both root and non-root levels). How do you know that there are too many page splits and merges happening? You can read about that in an earlier blog by my colleague Marco Tusa, here.

From MySQL 5.7, the insert phase during add index uses “Sort Index Build”, also known as “Bulk Load for Index”. In this approach,  the index is built “bottom-up”. i.e.  Leaf pages (bottom) are built first and then the non-leaf levels up to root (up).

Use cases

A sorted index build is used in these cases:


For the last two use cases, ALTER creates a intermediate table. The intermediate table indexes (both primary and secondary) are built using “sorted index build”.


  1. Create a page at level 0. Also create a cursor to this page.
  2. Insert into the page using the cursor at Level 0 until is full.
  3. Once the page is full, create a sibling page (do not insert into sibling page yet).
  4. Create a node pointer (minimum key in child page, child page number) for the current full page and insert a node pointer into one level above (parent page).
  5. At upper level, check if cursor is already positioned. If not, create a parent page and a cursor for the level.
  6. Insert a node pointer at the parent page
  7. If parent page is full, repeat steps 3, 4, 5, 6
  8. Now insert into the sibling page and make the cursor point to the sibling page.
  9. At the end of all inserts,  there is cursor at each level pointing to the rightmost page. Commit all the cursors (meaning commit the mini-transaction that modified the  pages, release all the latches).

For the sake of simplicity, the above algorithm skipped the details about compressed pages and the handling of BLOBs (externally stored BLOBs).

Walk through of building an index, bottom-up

Using an example, let’s see how a secondary index is built, bottom-up. Again for simplicity’s sake, assume the maximum number of records allowed in leaf and non leaf pages is three.

INSERT INTO t1 VALUES (1, 11, 'hello111');
INSERT INTO t1 VALUES (2, 22, 'hello222');
INSERT INTO t1 VALUES (3, 33, 'hello333');
INSERT INTO t1 VALUES (4, 44, 'hello444');
INSERT INTO t1 VALUES (5, 55, 'hello555');
INSERT INTO t1 VALUES (6, 66, 'hello666');
INSERT INTO t1 VALUES (7, 77, 'hello777');
INSERT INTO t1 VALUES (8, 88, 'hello888');
INSERT INTO t1 VALUES (9, 99, 'hello999');
INSERT INTO t1 VALUES (10, 1010, 'hello101010');

InnoDB appends the primary key fields to the secondary index. The records of secondary index k1 are of format (b, a).  After the sort phase, the records are

(11,1), (22,2), (33,3), (44,4), (55,5), (66,6), (77,7), (88,8), (99,9), (1010, 10)

Initial insert phase

Let’s start with record (11,1).

  1. Create a page at Level 0 (leaf level).
  2. Create a cursor to the page.
  3. All inserts go to this page until it is full.

The arrow shows where the cursor is currently pointing. It is currently at page number 5, and the next inserts go to this page.

There are two more free slots, so insertion of the records (22,2) and (33,3) is straightforward.

For the next record (44, 4), page number 5 is full. Here are the steps

Index building as pages become filled

  1. Create a sibling page – page number 6
  2. Do not insert into the sibling page yet.
  3. Commit the page at the cursor i.e. mini transaction commit, release latch etc
  4. As part of the commit, create a node pointer and insert it into the parent page at [current Level + 1]. i.e at Level 1
  5. The node pointer is of the format (minimum key in child page, child page number). Minimum key in page number 5 is (11,1). Insert record ((11,1),5) at the parent level.
  6. The parent page at Level 1 doesn’t exist yet. MySQL creates page number 7 and a cursor pointing to page number 7.
  7. Insert ((11,1),5) into page number 7
  8. Now, return back to Level 0 and create links from page number 5 to 6 and vice versa
  9. The cursor at Level 0 now points to sibling page number 6.
  10. Insert (44,4) into page number 6


The next inserts – of (55,5) and (66,6) – are straightforward and they go to page 6.

Insertion of record (77,7) is similar to (44,4) except that the parent page (page number 7) already exists and it has space for two more records. Insert node pointer ((44,4),8) into page 7 first, and then record (77,7) into sibling page 8.

Insertion of records (88,8) and (99,9) is straightforward because page 8 has two free slots.

Next insert (1010, 10). Insert node pointer ((77,7),8) to the parent page (page number 7) at Level 1.

MySQL creates sibling page number 9 at Level 0. Insert record (1010,10) into page 9 and change the cursor to this page.

Commit the cursors at all levels. In the above example, the database commits page 9 at Level 0 and page 7 at Level 1. We now have a complete B+-tree index that is built from bottom-up!

Index fill factor

The global variable “innodb_fill_factor” sets the amount of space in a Btree page to be used for inserts. The default value is 100, which means the entire page is used (exclude page headers, trailers). A clustered index has an exemption with innodb_fill_factor = 100. In this case, 1/16th of clustered index page space is kept free. ie. 6.25% space is reserved for future DMLs.

A value of 80 means that MySQL uses 80% of the page for inserts, and leaves 20% for future updates.

If innodb_fill_factor is 100, there is no free space left for future inserts into the secondary index.  If you expect more DMLs on the table after add index, this can lead to page splits and merges again. In such cases, it is advisable to use values between 80-90. This variable value also affects the indexes recreated with OPTIMIZE TABLE or ALTER TABLE DROP COLUMN, ALGORITHM=INPLACE.

You should not use values that are too low – for example below 50 – because the index then occupies significantly more disk space. With low values there are more pages in an index, and index statistics sampling might not be optimal. The optimizer could choose wrong query plans with sub-optimal statistics.

Advantages of Sorted Index Build

  1. No page splits (excluding compressed tables) and merges.
  2. No repeated searches for insert position.
  3. Inserts are not redo logged (except for page allocations), so there is less pressure on the redo log subsystem.


None… Well, OK, there is one and it deserves a separate blog post ?  Stay tuned!


Watch Webinar: Speaking the same language as Developers and DBAs

Speaking the same language as Developers and DBAs

Speaking the same language as Developers and DBAsPlease join, Percona’s Product Manager, Michael Coburn as he presents his talk Speaking the same language as Developers and DBAs.

Watch the Recorded Webinar

In this talk, we’ll go over some of the high-level monitoring concepts that PMM enables for you, the CEO, to leverage in order to speak the same language as your technical team. For instance, we’ll cover:

– Which thresholds to follow for CEOs
– Host-specific resources – CPU, Memory, Disk, Network
– Database-specific resources – Connections, Queries per Second, Slow Queries, Replication lag
– Building custom KPI metrics using PMM
– Visualize Application logic
– Combine Application and Databases under one view

In order to learn more and speak the same language as Developers and DBAs, watch our webinar.


Creating Custom Sysbench Scripts

sysbench-lua for benchmark tooling

sysbench-lua for benchmark toolingSysbench has long been established as the de facto standard when it comes to benchmarking MySQL performance. Percona relies on it daily, and even Oracle uses it when blogging about new features in MySQL 8. Sysbench comes with several pre-defined benchmarking tests. These tests are written in an easy-to-understand scripting language called Lua. Some of these tests are called: oltp_read_write, oltp_point_select, tpcc, oltp_insert. There are over ten such scripts to emulate various behaviors found in standard OLTP applications.

But what if your application does not fit the pattern of traditional OLTP? How can you continue to utilize the power of load-testing, benchmarking, and results analysis with sysbench? Just write your own Lua script!

For those that want to jump ahead and see the full source, here you go.

Sysbench API

To start off, each Lua script you create must implement three core sysbench-Lua API functions. These are thread_init, thread_done, and event. You can read the comments in the code below for the meaning of each function and what is happening inside.

-- Called by sysbench one time to initialize this script
function thread_init()
  -- Create globals to be used elsewhere in the script
  -- drv - initialize the sysbench mysql driver
  drv = sysbench.sql.driver()
  -- con - represents the connection to MySQL
  con = drv:connect()
-- Called by sysbench when script is done executing
function thread_done()
  -- Disconnect/close connection to MySQL
-- Called by sysbench for each execution
function event()
  -- If user requested to disable transactions,
  -- do not execute BEGIN statement
  if not sysbench.opt.skip_trx then
  -- Run our custom statements
  -- Like above, if transactions are disabled,
  -- do not execute COMMIT
  if not sysbench.opt.skip_trx then

That’s all pretty simple and should function as a good template in your scripts. Now let’s take a look at the rest of the script.

Sanity checks and options

Now let’s get into the core code. At the top you’ll find the following sections:

if sysbench.cmdline.command == nil then
   error("Command is required. Supported commands: run")
sysbench.cmdline.options = {
  point_selects = {"Number of point SELECT queries to run", 5},
  skip_trx = {"Do not use BEGIN/COMMIT; Use global auto_commit value", false}

The first section is a sanity check to make sure the user actually wants to run this test. Other test scripts, mentioned above, support commands like prepare, run, and cleanup. Our script only supports run as the data we are using is pre-populated by our core application.

The second section allows us, the script writer, to let the user pass some options specific to our test script. In the code above, we can see an option for the number of SELECT statements that will be ran on each thread/iteration (default is 5) and the other option allows the user to disable BEGIN/COMMIT if they so desire (default is false). If you want more customization in your script, simply add more options. You’ll see how to reference these parameters later on.

The queries

Now it is time to define the custom queries we want to execute in our script.

-- Array of categories to be use in the INSERTs
local page_types = { "actor", "character", "movie" }
-- Array of COUNT(*) queries
local select_counts = {
  "SELECT COUNT(*) FROM imdb.title"
-- Array of SELECT statements that have 1 integer parameter
local select_points = {
  "SELECT * FROM imdb.title WHERE id = %d",
  "SELECT * FROM imdb.comments ORDER BY id DESC limit 10",
  "SELECT AVG(rating) avg FROM imdb.movie_ratings WHERE movie_id = %d",
  "SELECT * FROM imdb.users ORDER BY RAND() LIMIT 1"
-- Array of SELECT statements that have 1 string parameter
local select_string = {
  "SELECT * FROM imdb.title WHERE title LIKE '%s%%'"
-- INSERT statements
local inserts = {
  "INSERT INTO imdb.users (email_address, first_name, last_name) VALUES ('%s', '%s', '%s')",
  "INSERT INTO imdb.page_views (type, viewed_id, user_id) VALUES ('%s', %d, %d)"

The above code defines several arrays/lists of different queries. Why is this necessary? Later on in the code, we will have to parse each SQL statement and populate/replace the various parameters with randomly generated values. It would not do us any good to repeat the same SELECT * FROM fooTable WHERE id = 44 every time, now would it? Certainly not. We want to generate random numbers and have our queries select from the entire dataset.

Some queries have no parameters, some have integer-based, and some string-based. We will handle these differently below, which is why they are in different arrays above. This method also allows for future expansion. When you want to run additional queries within the script, just add another line to each array; no need to change any other code.

Parse and execute

The function below, execute_selects, will be called from the parent function, event, which we discussed earlier in the post. You can see for-loops for each of the three SELECT categories we created above. The comments inline should help explain what is happening. Note the use of the user-provided option –point-selects in the second loop below, which we created previously in the ‘Sanity and Options’ section.

function execute_selects()
  -- Execute each simple, no parameters, SELECT
  for i, o in ipairs(select_counts) do
  -- Loop for however many queries the
  -- user wants to execute in this category
  for i = 1, sysbench.opt.point_selects do
    -- select random query from list
    local randQuery = select_points[math.random(#select_points)]
    -- generate random id and execute
    local id = sysbench.rand.pareto(1, 3000000)
    con:query(string.format(randQuery, id))
  -- generate random string and execute
  for i, o in ipairs(select_string) do
    local str = sysbench.rand.string(string.rep("@", sysbench.rand.special(2, 15)))
    con:query(string.format(o, str))

Two more things to mention for this code. First, you will notice the use of sysbench.rand.pareto to generate a random number between 1 and 3,000,000. For our dataset, we know that each table referenced in all queries relating to WHERE id = ? has that many number of rows, at minimum. This is specific to our data. Your values will certainly be different. Second, notice the use of sysbench.rand.string, and string.rep. The string.rep segment will generate a string comprised of ‘@’ symbols, between 2 and 15 characters long. That string of ‘@’ symbols will then be passed to sysbench.rand.string, which will swap out each ‘@’ for a random alphanumeric value. For example, ‘@@@@@@’ could be changed to ‘Hk9EdC’ which will then replace the ‘%s’ inside the query string (string.format) and be executed.

Handle inserts

Our INSERT statements require values. Again, sysbench calls the function execute_inserts from event on each iteration. Inside execute_inserts, we generate some fake string data using built-in functions described above.

Those strings are then formatted into the SQL and executed.

function create_random_email()
  local username = sysbench.rand.string(string.rep("@",sysbench.rand.uniform(5,10)))
  local domain = sysbench.rand.string(string.rep("@",sysbench.rand.uniform(5,10)))
  return username .. "@" .. domain .. ".com"
function execute_inserts()
  -- generate fake email/info
  local email = create_random_email()
  local firstname = sysbench.rand.string("first-" .. string.rep("@", sysbench.rand.special(2, 15)))
  local lastname = sysbench.rand.string("last-" .. string.rep("@", sysbench.rand.special(2, 15)))
  -- INSERT for new imdb.user
  con:query(string.format(inserts[1], email, firstname, lastname))
  -- INSERT for imdb.page_view
  local page = page_types[math.random(#page_types)]
  con:query(string.format(inserts[2], page, sysbench.rand.special(2, 500000), sysbench.rand.special(2, 500000)))

Example run

$ sysbench imdb_workload.lua \
    --mysql-user=imdb --mysql-password=imdb \
    --mysql-db=imdb --report-interval=1 \
    --events=0 --time=0 run
WARNING: Both event and time limits are disabled, running an endless test
sysbench 1.0.17 (using system LuaJIT 2.0.4)
Running the test with following options:
Number of threads: 1
Report intermediate results every 1 second(s)
Initializing random number generator from current time
Initializing worker threads...
Threads started!
[ 1s ] thds: 1 tps: 15.96 qps: 177.54 (r/w/o: 112.71/31.92/32.91) lat (ms,95%): 158.63 err/s: 0.00 reconn/s: 0.00
[ 2s ] thds: 1 tps: 15.01 qps: 169.09 (r/w/o: 109.06/30.02/30.02) lat (ms,95%): 137.35 err/s: 0.00 reconn/s: 0.00
[ 3s ] thds: 1 tps: 26.00 qps: 285.00 (r/w/o: 181.00/52.00/52.00) lat (ms,95%): 108.68 err/s: 0.00 reconn/s: 0.00
[ 4s ] thds: 1 tps: 15.00 qps: 170.00 (r/w/o: 108.00/32.00/30.00) lat (ms,95%): 164.45 err/s: 0.00 reconn/s: 0.00

And there we have it! Custom queries specific to our application and dataset. Most of the sysbench parameters are self-explanatory, but let me talk about –report-interval=1 which shows statistics every 1 second. Normally sysbench does not output stats until the end of the run, however, the example execution will run forever (–events=0 –time=0) so we need stats to show all the time. You can adjust the parameters to your liking. For instance, if you only want to run a test for 5 minutes, set –events=0 –run-time=300.


Sysbench is a very well designed application that allows you to load-test your MySQL instances using pre-defined and custom queries. Using the Lua scripting language, you can create just about any scenario to fit your needs. The above is just one example that we use within Percona’s Training and Education department. It is by no means an exhaustive example of all of the capabilities of sysbench-Lua.

Photo by Lachlan Donald on Unsplash


Writing PostgreSQL Extensions is Fun – C Language

postgresql extensions

PostgreSQL is a powerful open source relational database management system. It extends the SQL language with additional features. A DBMS is not only defined by its performance and out of the box features, but also its ability to support bespoke/additional user-specific functionality. Some of these functionalities may be in the form of database constructs or modules, like stored procedures or functions, but their scope is generally limited to the functionality being exposed by the DBMS. For instance, how will you write a custom query-analyzing application that resides within your DBMS?

To support such options, PostgreSQL provides a pluggable architecture that allows you to install extensions. Extensions may consist of a configuration (control) file, a combination of SQL files, and dynamically loadable libraries.

This means you can write your own code as per the defined guidelines of an extension and plug it in a PostgreSQL instance without changing the actual PostgreSQL code tree. An extension by very definition extends what PostgreSQL can do, but more than that, it gives you the ability to interact with external entities. These external entities can be other database management systems like ClickHouse, Mongo or HDFs (normally these are called foreign data wrappers), or other interpreters or compilers (thus allowing us to write database functions in another language like Java, Python, Perl or TCL, etc.). Another potential use case of an extension can be for code obfuscation which allows you to protect your super secret code from prying eyes.

Build your own

To build your own extension, you don’t need a complete PostgreSQL code base. You can build and install an extension using installed PostgreSQL (it may require you to install a devel RPM or Debian package). Details about extensions can be found in PostgreSQL’s official documentation[1]. There many extensions available for different features in the contrib directory of PostgreSQL source. Other than the contrib directory, people are also writing extensions readily available on the internet but currently not part of the PostgreSQL source tree. The pg_stat_statements, PL/pgSQL, and PostGIS are examples of the best known or most widely used extensions.

Generally available PostgreSQL extensions may be classified into four main categories:

  • Add support of a new language extension (PL/pgSQL, PL/Python, and PL/Java)
  • Data type extensions where you can introduce new ((Hstore, cube, and hstore)
  • Miscellaneous extensions (contrib folder has many miscellaneous extensions)
  • Foreign Data Wrapper extension (postgres_fdw, mysqldb_fdw, clickhousedb_fdw)

There are four basic file types that are required for building an extension:

  • Makefile: Which uses PGXS PostgreSQL’s build infrastructure for extensions.
  • Control File: Carries information about the extension.
  • SQL File(s): If the extension has any SQL code, it may reside in form SQL files (optional)
  • C Code: The shared object that we want to build (optional).

Extension Makefile

To compile the C code, we need a makefile. It’s a very simple makefile with the exception of “PGXS”, which is PostgreSQL’s infrastructural makefile for creating extensions. The inclusion of “PGXS” is done by invoking pg_config binary with “–pgxs” flag. The detail of that file can be found at GitHub[2].

This is a sample makefile, which can be used to compile the C code.

MODULE_big = log
DATA = log--0.0.1.sql
OBJS = log.o
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

Extension Control File

This file must be named as [EXTENSION NAME]. control file. The control file can contain many options which can be found at official documentation [3]. But in this example, I have used some basic options.

comments: Comments about the extension.

default_version: This is the extension SQL version. The name of the SQL file contains this information in the file name.

relocatable:  Tells PostgreSQL if it is possible to move contained objects into a different schema.

module_pathname:  This information is replaced with the actual lib file path.

comment = 'PostgreSQL Utility Command Logger
'default_version = '0.0.1'
relocatable = true
module_pathname = '$libdir/log'

The contents of the file can be seen using the psql command \dx psql.

postgres=# \dx log      
List of installed extensions Name   | Version | Schema |       Description
log                                 | 0.0.1   | public | PostgreSQL Utility Command Logger

Extension SQL File

This is a mapping file, which I used to map the PostgreSQL function with the corresponding C function. Whenever you call the SQL function then the corresponding C function is called. The name of the file must be [EXTENSION NAME]–[default-version].sql. This is the same default_version as defined in the control file.

CREATE FUNCTION pg_all_queries(OUT query TEXT, pid OUT TEXT)

Extension C Code

There are three kinds of functions you can write in c code.

The first is where you call your c code function using SQL function written in SQL file of the extension.

The second type of function is callbacks. You register that callback by assigning the pointer of the function. There is no need for an SQL function here. You call this function automatically when a specific event occurs.

The third type of function is called automatically, even without registering. These functions are called on events such as extension load/unload time etc.

This is the C file containing the definition of the C code. There is no restriction for the name, or of the number of C files.

#include "postgres.h"
/* OS Includes */
/* PostgreSQL Includes */
void _PG_init(void);
void _PG_fini(void);
Datum pg_all_queries(PG_FUNCTION_ARGS);
static void process_utility(PlannedStmt *pstmt, const char *queryString,ProcessUtilityContext context,ParamListInfo params,QueryEnvironment *queryEnv,DestReceiver *dest,       char *completionTag);

You need to include postgres.h for the extension. You can include other PostgreSQL as needed. PG_MODULE_MAGIC is macro which you need to include in the C file for the extension. _PG_init and _PG_fini are functions that are called when the extension is loaded or unloaded, respectively.

Here is an example of the load and unload functions of an extension.

void _PG_init(void)
    /* ... C code here at time of extension loading ... */
    ProcessUtility_hook = process_utility;
Void _PG_fini(void)
    /* ... C code here at time of extension unloading ... */

Here is an example of a callback function that you can invoke whenever you call a utility statement e.g. any DDL statement. The “queryString” variable contains the actual query text.

static void process_utility(PlannedStmt *pstmt,
                           const char *queryString,
                           ProcessUtilityContext context,
                           ParamListInfo params,
                           QueryEnvironment *queryEnv,DestReceiver *dest,
                           char *completionTag)
    /* ... C code here ... */
    standard_ProcessUtility(pstmt,  &nbsp;
                            queryString,  &nbsp;
                            context,  &nbsp;
                            params,  &nbsp;
                            queryEnv,  &nbsp;
    /* ... C code here ... */

Finally, here’s an example of a C function that is invoked through a user-defined SQL function. This internally calls the C function contained in our shared object.

Datum pg_all_queries(PG_FUNCTION_ARGS)
    /* ... C code here ... */
    tupstore = tuplestore_begin_heap(true, false, work_mem);
    /* ... C code here ... */     
    values[0] = CStringGetTextDatum(query);
    values[1] = CStringGetTextDatum(pid);
    /* ... C code here ... */
    return (Datum) 0;

Compile and Install

Before compilation, you need to set the PATH for PostgreSQL’s bin directory if there is no pg_config available in the system paths.

export PATH=/usr/local/pgsql/bin:$PATH

make USE_PGXS=1

make USE_PGXS=1 install


We can now use our extension through a simple SQL query. Following is the output that is coming straight out of extension written in C programming language.

postgres=# select * from pg_all_queries();          
query                      | pid
create table foo(a int);  +| 8196
create table bar(a int);  +| 8196
drop table foo;           +| 8196
(3 rows)

I hope this example can serve as a starting point for you to create more useful extensions that will not only help you and your company, but will also give you an opportunity to share and help the PostgreSQL community grow.

The complete example can be found at Github[4].





Photo from Pexels


Upcoming Webinar Thurs 1/24: Databases Gone Serverless?

Databases Gone Serverless Webinar

Databases Gone Serverless WebinarPlease join Percona’s Senior Technical Manager, Alkin Tezuysal, as he presents Databases Gone Serverless? on Thursday, January 24th, at 6:00 AM PDT (UTC-7) / 9:00 AM EDT (UTC-4).

Register Now

Serverless computing is becoming more popular with developers. For instance, it enables them to build and run applications without needing to operate and manage servers. This talk will provide a high-level overview of serverless applications in the database world, including the use cases, possible solutions, services and benefits provided through the cloud ecosystem. In particular, we will focus on the capabilities of the AWS serverless platform.

In order to learn more, register for this webinar on Databases Gone Serverless.

Powered by WordPress | Theme: Aeros 2.0 by