MongoDB: Impact-free Index Builds using Detached ReplicaSet Nodes

MongoDB Impact-free Index Builds

MongoDB Impact-free Index BuildsCreating an index on a MongoDB collection is simple; just run the command CreateIndex and that’s all there is to it. There are several index types available, and in a previous post, you can find the more important index types: MongoDB index types and explain().

The command is quite simple, but for MongoDB, building an index is probably the most complicated task. I’m going to explain what the potential issues are and the best way to create any kind of index on a Replica Set environment.

A Replica Set is a cluster of mongod servers, at least 3, where the complete database is replicated. Some of the main advantages of this kind of structure are automatic failover and read scalability. If you need more familiarity with Replica Set, you may take a look at the following posts:

Deploy a MongoDB Replica Set with Transport Encryption (Part 1)

Deploy a MongoDB Replica Set with Transport Encryption (Part 2)

Deploy a MongoDB Replica Set with Transport Encryption (Part 3)

Create Index Impact

As said, creating an index for MongoDB has really a severe impact. A simple index creation on a field like the following blocks all other operations on the database:

db.people.Createindex( { email: 1 } )

This could be ok for a very small collection, let’s say where the building will take a few milliseconds. But for larger collections, this is absolutely forbidden.

We call this way of building an index the “foreground” mode.

The foreground index creation is the fastest way, but since it is blocking we have to use something different in the production environments. Fortunately, we can also create an index in “background” mode. We may use the following command:

db.people.CreateIndex( { email : 1}, { background: true } )

The index will be built in the background by mongod using a different incremental approach. The advantage is that the database can continue to operate normally without any lock. Unfortunately, background creation takes much longer than the foreground build.

The first hint is then to create the indexes using the background option. This is OK, but not in all the cases. More on that later.

Another impact when building an index is memory usage. MongoDB uses, by default,  up to 500MB of memory for building the index, but you can override it if the index is larger. The larger the index, the higher will be the impact if you don’t have the capability to assign more memory for the task.

To increase the amount of memory for index builds, set the following in the configuration file:

maxIndexBuildMemoryUsageMegabytes: 1024

Example: set it for 1 GB.

Create Index on a Replica Set

As long as the index creation command is replicated on all the nodes of the cluster in the same way all the other write commands are replicated. the index creation is replicated on a Replica Set cluster. A foreground creation on the PRIMARY is replicated as foreground on SECONDARY nodes. A background creation is replicated as background on SECONDARY nodes as well.

The same limitation applies for the Replica Set as the standalone server. The foreground build is fast but blocking and the background build is not blocking, but it is significantly slower for very large collections.

So, what can we do?

If you need to create a small index, let’s say the size is less than the available memory, you can rely on the background creation on the PRIMARY node. The operation will be correctly replicated to the SECONDARY nodes and the overall impact won’t be too bad.

But if you need to create an index larger than the memory, on a huge collection, then even the background build is bad. The creation will have a significant impact on the server resources and you can get overall performance problems on all the nodes. In this case, we have to follow another procedure. The procedure requires more manual steps, but it’s the only way to properly build such a large index.

The idea is to detach from the Replica Set one node at the time, create the index, and rejoin the node to the cluster. But first, you need to take care of the oplog size. The oplog window should be large enough to give you the time for the index build when a node is detached.

Note: the oplog window is the timestamp difference between the first entry in the oplog and the more recent one. It represents the maximum amount of time you can have a node detached from the cluster for any kind of task (software upgrades, index builds, backups). If you rejoin the node inside the window, the node will be able to catch up with the PRIMARY just executing the missing operations from the oplog. If you rejoin the node after the window, it will have to copy completely all the collections. This will take a long time and an impressive bandwidth usage for large deployments. 

The following is the step by step guide:

  • choose one of the SECONDARY nodes
  • detach the node from the Replica Set
    • comment into the configuration file the replSetName and the port options
    • set a different port number
    • set the parameter disableLogicalSessionCacheRefresh to true
   bindIp: localhost,
   port: 27777
#  port: 27017
#   replSetName: RSCluster
   disableLogicalSessionCacheRefresh: true

    • restart mongod
    • now the server is running as standalone; any query won’t be replicated
  • connect using the alternative port and build the index in foreground mode
db.people.CreateIndex( { email: 1} )

  • connect the node to the Replica Set
    • uncomment the options in the configuration file
    • remove the disableLogicalSessionCacheRefresh option
    • restart mongod
    • now the node is a member of the Replica Set
    • wait some time for the node to catch up with the PRIMARY
  • repeat the previous steps for all the remaining SECONDARY nodes
  • stepdown the PRIMARY node to force an election
    • run rs.stepDown() command. This forces an election. Wait for some time for the PRIMARY to become a SECONDARY node.
  • restart it as standalone
    • use the same procedure we saw before
  • build the index in foreground mode
  • restart the node and connect it to the Replica Set

That’s all. We have created the index on all the nodes without any impact for the cluster and for the production applications.

Note: when restarting a node as standalone, the node could be exposed to mistake writes. For the sake of security, a good practice could be to disable TCP connections, allowing only local connections using the socket. Then you can put into the configuration file for example:
bindIp: /tmp/mongod.sock


This procedure is definitely more complicated than running a single command. It will require some time, but we hope you don’t have to create such large indexes every single day. ?


Stored Functions and Temporary Tables are Not a Good Fit

Stored Functions

Stored FunctionsIn this blog post, I am going to show why we have to be careful with stored functions in select list, as a single query can cause thousands of queries in the background if we aren’t cautious.

For this example, I am only going to use the SLEEP function to demonstrate the issue, but you could use any other stored functions.

Here is the test schema and data:

 id INT NOT NULL PRIMARY KEY auto_increment,
 INDEX (gcol)
INSERT INTO t1 VALUES (NULL,1,1),(NULL,2,1),(NULL,3,1),(NULL,4,2),(NULL,5,1);

And the first query:

mysql [localhost] {msandbox} (test) > SELECT gcol, SLEEP(1) FROM t1 WHERE 1 GROUP BY gcol LIMIT 1;
| gcol | SLEEP(1) |
|    1 |        0 |
1 row in set (1.00 sec)

The query takes one second, which means the SLEEP was executed after/during the LIMIT step.

The second query creates a temp table:

mysql [localhost] {msandbox} (test) > SELECT fcol, SLEEP(1) FROM t1 WHERE 1 GROUP BY fcol LIMIT 1;
| fcol | SLEEP(1) |
|    1 |        0 |
1 row in set (5.02 sec)

It takes 5 seconds, meaning the SLEEP was executed for every row before LIMIT, and the only differences between these two queries is the second one uses temp tables. (Yes, the schema and query could be improved to avoid temp tables, but this is only a test case and you can’t avoid temp tables all the time.)

MySQL uses a different order of steps here and does the select list evaluation before/during creating the temp tables.

As you can see, there are functions in the select list, and MySQL will create temp tables, and that could cause thousands of extra queries in the background. I recently had a case where running a single query with functions in the select list resulted in 333,906 queries in the background! In the query, there was a Limit 0,10.

There is a very good blog post describing the order of the operations in SQL, which should be the following:

- From
- Where
- Group By
- Aggregations
- Having
- Window
- Select
- Distinct
- Uninon
- Order by
- Offset
- Limit

Based on this, if MySQL strictly follows the SQL order, the LIMIT should be the last, and the first query should take 5s because the Select list evaluation should happen before that for every row. But I guess this is a performance optimization in MySQL, to only run the evaluation for the limited number of rows.

When temp tables are involved, MySQL will use a different logic and do the evaluation before/during the group by when it creates the temp tables.

When MySQL creates a temp table, it has to materialize every single row, and that includes the functions as well. If your query is going to create a huge temp table, MySQL will execute the function for all those rows.

MySQL does not have generated columns in internal temp tables, but what might do the trick here is to just point to a function that needs to be executed when MySQL reads the data out from the temp table.

I have created a ticket, but we are still debating if there is any room for improvement here.

Using DML queries in these functions

If you are using DML (insert/update/delete) queries in stored functions, and you are calling them in the select list, you have to be careful because if MySQL creates temp tables it will call those queries for all the rows in the temp table as well. So you might just want to insert only one row in another table, as you will end up with thousands if not.

How can we avoid this?

If you are facing this issue, you have to analyze your queries. You might only need an extra index to avoid temp tables, or you could rewrite the query.

Where can I see if this is happening?

Unfortunately, in this case, the slow query log cannot help us, because these function calls and the queries from the functions are not logged. This could also be a feature request because it would make debugging much easier. (I have already created a feature request in Percona Server to log queries from stored functions.)

However, in the slow query log for the original query, the Rows_examined would be very high if you are using stored functions and that number is way higher than it should be and might cause problems. Also, the general log can be helpful for investigating the problem here, because in the general log we can see all these function calls as well.


If you are using stored functions in Select list, you have to make sure the query does not use temp tables, otherwise, it could cause many extra queries in the background.


How to Report Bugs, Improvements, New Feature Requests for Percona Products

report bugs

report bugsClear and structured bug, improvement, and new feature request reports are always helpful for validating and fixing issues. In a few cases, we have received these reports with incomplete information which can cause a delay in the verification of issues. The most effective method to avoid this situation is to ensure complete information about the issue when filing a report.

In this post we will discuss:

  • The best ways to report an issue for Percona products
  • Including a “how to reproduce” test case to verify the issue
  • The purpose of bug/improvement/new feature verification is the central place to report a bug/improvement/new feature request for all Percona products.

Let’s first discuss a few important entries which you should update when reporting an issue.

Project: The product name for which you wish to report an issue.

Issue Type: Provides options for the type of request. Example: Bug/Improvement/New Feature Request.

Summary: Summary of the issue which will serve as the title. It should be a one-line summary.

Affects Version/s: The version number of the Percona software for which you are reporting an issue.

Description: This field is to describe your issue in detail. Issue description should be clear and concise.

Bug report:

  • Describe the actual issue.
  • Add test case/steps to reproduce the issue if possible.
  • If a crash bug, provide my.cnf and error log as additional information along with the details mentioned in this blog_post.
  • In some cases, the supporting file for bug reports such as a coredump, sqldump, or error log is prohibitively large. For these cases, we have an SFTP server where these files can be uploaded.

      Documentation bug:

  • Provide the documentation link, describe what is wrong, and suggest how the documentation could be improved.

Improvement/New Feature  Request:

  • For new features, describe the need and use case. Include answers to such questions as “What problem it will solve?” and “How will it benefit the users?”
  • In the case of improvements, mention what is problematic with the current behavior. What is your expectation as an improvement in a particular product feature?

Note: When reporting an issue, be sure to remove/replace sensitive information such as IP addresses, usernames, passwords, etc. from the report description and attached files.

Upstream Bugs:

Percona Server for MySQL and Percona Server for MongoDB are patched versions of their upstream codebases. It is possible that the particular issue originated from the upstream version. For these products, it would be helpful if the reporter also checks upstream for the same issue. If issues exist upstream, use the following URLs to report an issue for these respective products.


If you are a Percona customer, please file a support request to let us know how the bug affects you.

Purpose of Bug/Improvement/New Feature Request verification

  • Gather the required information from the reporter and identify whether the reported issue is a valid bug/improvement/new feature.
  • For bugs, create a reproducible test case. To effectively address, the bug must be repeatable on demand.

Any incorrect assumptions can break other parts of the code while fixing a particular bug; this is why the verification process is important to identify the exact problem. Another benefit of having a reproducible test case is that it can then be used to verify the fix.

While feature requests and improvements are about ideas on how to improve Percona products, they still need to be verified. We need to ensure that behavior reported as a new feature or improvement:

  • Is not a bug
  • Is not implemented yet
  • For new feature verification, we also check whether there is an existing, different way to achieve the same.

Once bugs, improvement, and new features are validated, the issue status will be “Open” and it will move forward for implementation.


MySQL InnoDB Sorted Index Builds

Bulk load of 7.7

It’s not essential to understand how MySQL® and Percona Server for MySQL build indexes. However, if you have an understanding of the processing, it could help when you want to reserve an appropriate amount of space for data inserts. From MySQL 5.7, developers changed the way they built secondary indexes for InnoDB, applying a bottom-up rather than the top-down approach used in earlier releases. In this post, I’ll walk through an example to show how an InnoDB index is built. At the end, I’ll explain how you can use this understanding to set an appropriate value for innodb_fill_factor.

Index building process

To build an index on a table with existing data, there are the following phases in InnoDB

  1. Read phase (read from clustered index and build secondary index entries)
  2. Merge sort phase
  3. Insert phase (insert sorted records into the secondary index)

Until version 5.6, MySQL built the secondary index by inserting one record at a time. This is a “top-down” approach. The search for the insert position starts from root (top) and reaches the appropriate leaf page (down). The record is inserted on leaf page pointed to by the cursor. It is expensive in terms of finding insert position and doing page splits and merges (at both root and non-root levels). How do you know that there are too many page splits and merges happening? You can read about that in an earlier blog by my colleague Marco Tusa, here.

From MySQL 5.7, the insert phase during add index uses “Sort Index Build”, also known as “Bulk Load for Index”. In this approach,  the index is built “bottom-up”. i.e.  Leaf pages (bottom) are built first and then the non-leaf levels up to root (up).

Use cases

A sorted index build is used in these cases:


For the last two use cases, ALTER creates a intermediate table. The intermediate table indexes (both primary and secondary) are built using “sorted index build”.


  1. Create a page at level 0. Also create a cursor to this page.
  2. Insert into the page using the cursor at Level 0 until is full.
  3. Once the page is full, create a sibling page (do not insert into sibling page yet).
  4. Create a node pointer (minimum key in child page, child page number) for the current full page and insert a node pointer into one level above (parent page).
  5. At upper level, check if cursor is already positioned. If not, create a parent page and a cursor for the level.
  6. Insert a node pointer at the parent page
  7. If parent page is full, repeat steps 3, 4, 5, 6
  8. Now insert into the sibling page and make the cursor point to the sibling page.
  9. At the end of all inserts,  there is cursor at each level pointing to the rightmost page. Commit all the cursors (meaning commit the mini-transaction that modified the  pages, release all the latches).

For the sake of simplicity, the above algorithm skipped the details about compressed pages and the handling of BLOBs (externally stored BLOBs).

Walk through of building an index, bottom-up

Using an example, let’s see how a secondary index is built, bottom-up. Again for simplicity’s sake, assume the maximum number of records allowed in leaf and non leaf pages is three.

INSERT INTO t1 VALUES (1, 11, 'hello111');
INSERT INTO t1 VALUES (2, 22, 'hello222');
INSERT INTO t1 VALUES (3, 33, 'hello333');
INSERT INTO t1 VALUES (4, 44, 'hello444');
INSERT INTO t1 VALUES (5, 55, 'hello555');
INSERT INTO t1 VALUES (6, 66, 'hello666');
INSERT INTO t1 VALUES (7, 77, 'hello777');
INSERT INTO t1 VALUES (8, 88, 'hello888');
INSERT INTO t1 VALUES (9, 99, 'hello999');
INSERT INTO t1 VALUES (10, 1010, 'hello101010');

InnoDB appends the primary key fields to the secondary index. The records of secondary index k1 are of format (b, a).  After the sort phase, the records are

(11,1), (22,2), (33,3), (44,4), (55,5), (66,6), (77,7), (88,8), (99,9), (1010, 10)

Initial insert phase

Let’s start with record (11,1).

  1. Create a page at Level 0 (leaf level).
  2. Create a cursor to the page.
  3. All inserts go to this page until it is full.

The arrow shows where the cursor is currently pointing. It is currently at page number 5, and the next inserts go to this page.

There are two more free slots, so insertion of the records (22,2) and (33,3) is straightforward.

For the next record (44, 4), page number 5 is full. Here are the steps

Index building as pages become filled

  1. Create a sibling page – page number 6
  2. Do not insert into the sibling page yet.
  3. Commit the page at the cursor i.e. mini transaction commit, release latch etc
  4. As part of the commit, create a node pointer and insert it into the parent page at [current Level + 1]. i.e at Level 1
  5. The node pointer is of the format (minimum key in child page, child page number). Minimum key in page number 5 is (11,1). Insert record ((11,1),5) at the parent level.
  6. The parent page at Level 1 doesn’t exist yet. MySQL creates page number 7 and a cursor pointing to page number 7.
  7. Insert ((11,1),5) into page number 7
  8. Now, return back to Level 0 and create links from page number 5 to 6 and vice versa
  9. The cursor at Level 0 now points to sibling page number 6.
  10. Insert (44,4) into page number 6


The next inserts – of (55,5) and (66,6) – are straightforward and they go to page 6.

Insertion of record (77,7) is similar to (44,4) except that the parent page (page number 7) already exists and it has space for two more records. Insert node pointer ((44,4),8) into page 7 first, and then record (77,7) into sibling page 8.

Insertion of records (88,8) and (99,9) is straightforward because page 8 has two free slots.

Next insert (1010, 10). Insert node pointer ((77,7),8) to the parent page (page number 7) at Level 1.

MySQL creates sibling page number 9 at Level 0. Insert record (1010,10) into page 9 and change the cursor to this page.

Commit the cursors at all levels. In the above example, the database commits page 9 at Level 0 and page 7 at Level 1. We now have a complete B+-tree index that is built from bottom-up!

Index fill factor

The global variable “innodb_fill_factor” sets the amount of space in a Btree page to be used for inserts. The default value is 100, which means the entire page is used (exclude page headers, trailers). A clustered index has an exemption with innodb_fill_factor = 100. In this case, 1/16th of clustered index page space is kept free. ie. 6.25% space is reserved for future DMLs.

A value of 80 means that MySQL uses 80% of the page for inserts, and leaves 20% for future updates.

If innodb_fill_factor is 100, there is no free space left for future inserts into the secondary index.  If you expect more DMLs on the table after add index, this can lead to page splits and merges again. In such cases, it is advisable to use values between 80-90. This variable value also affects the indexes recreated with OPTIMIZE TABLE or ALTER TABLE DROP COLUMN, ALGORITHM=INPLACE.

You should not use values that are too low – for example below 50 – because the index then occupies significantly more disk space. With low values there are more pages in an index, and index statistics sampling might not be optimal. The optimizer could choose wrong query plans with sub-optimal statistics.

Advantages of Sorted Index Build

  1. No page splits (excluding compressed tables) and merges.
  2. No repeated searches for insert position.
  3. Inserts are not redo logged (except for page allocations), so there is less pressure on the redo log subsystem.


None… Well, OK, there is one and it deserves a separate blog post ?  Stay tuned!


Watch Webinar: Speaking the same language as Developers and DBAs

Speaking the same language as Developers and DBAs

Speaking the same language as Developers and DBAsPlease join, Percona’s Product Manager, Michael Coburn as he presents his talk Speaking the same language as Developers and DBAs.

Watch the Recorded Webinar

In this talk, we’ll go over some of the high-level monitoring concepts that PMM enables for you, the CEO, to leverage in order to speak the same language as your technical team. For instance, we’ll cover:

– Which thresholds to follow for CEOs
– Host-specific resources – CPU, Memory, Disk, Network
– Database-specific resources – Connections, Queries per Second, Slow Queries, Replication lag
– Building custom KPI metrics using PMM
– Visualize Application logic
– Combine Application and Databases under one view

In order to learn more and speak the same language as Developers and DBAs, watch our webinar.


Creating Custom Sysbench Scripts

sysbench-lua for benchmark tooling

sysbench-lua for benchmark toolingSysbench has long been established as the de facto standard when it comes to benchmarking MySQL performance. Percona relies on it daily, and even Oracle uses it when blogging about new features in MySQL 8. Sysbench comes with several pre-defined benchmarking tests. These tests are written in an easy-to-understand scripting language called Lua. Some of these tests are called: oltp_read_write, oltp_point_select, tpcc, oltp_insert. There are over ten such scripts to emulate various behaviors found in standard OLTP applications.

But what if your application does not fit the pattern of traditional OLTP? How can you continue to utilize the power of load-testing, benchmarking, and results analysis with sysbench? Just write your own Lua script!

For those that want to jump ahead and see the full source, here you go.

Sysbench API

To start off, each Lua script you create must implement three core sysbench-Lua API functions. These are thread_init, thread_done, and event. You can read the comments in the code below for the meaning of each function and what is happening inside.

-- Called by sysbench one time to initialize this script
function thread_init()
  -- Create globals to be used elsewhere in the script
  -- drv - initialize the sysbench mysql driver
  drv = sysbench.sql.driver()
  -- con - represents the connection to MySQL
  con = drv:connect()
-- Called by sysbench when script is done executing
function thread_done()
  -- Disconnect/close connection to MySQL
-- Called by sysbench for each execution
function event()
  -- If user requested to disable transactions,
  -- do not execute BEGIN statement
  if not sysbench.opt.skip_trx then
  -- Run our custom statements
  -- Like above, if transactions are disabled,
  -- do not execute COMMIT
  if not sysbench.opt.skip_trx then

That’s all pretty simple and should function as a good template in your scripts. Now let’s take a look at the rest of the script.

Sanity checks and options

Now let’s get into the core code. At the top you’ll find the following sections:

if sysbench.cmdline.command == nil then
   error("Command is required. Supported commands: run")
sysbench.cmdline.options = {
  point_selects = {"Number of point SELECT queries to run", 5},
  skip_trx = {"Do not use BEGIN/COMMIT; Use global auto_commit value", false}

The first section is a sanity check to make sure the user actually wants to run this test. Other test scripts, mentioned above, support commands like prepare, run, and cleanup. Our script only supports run as the data we are using is pre-populated by our core application.

The second section allows us, the script writer, to let the user pass some options specific to our test script. In the code above, we can see an option for the number of SELECT statements that will be ran on each thread/iteration (default is 5) and the other option allows the user to disable BEGIN/COMMIT if they so desire (default is false). If you want more customization in your script, simply add more options. You’ll see how to reference these parameters later on.

The queries

Now it is time to define the custom queries we want to execute in our script.

-- Array of categories to be use in the INSERTs
local page_types = { "actor", "character", "movie" }
-- Array of COUNT(*) queries
local select_counts = {
  "SELECT COUNT(*) FROM imdb.title"
-- Array of SELECT statements that have 1 integer parameter
local select_points = {
  "SELECT * FROM imdb.title WHERE id = %d",
  "SELECT * FROM imdb.comments ORDER BY id DESC limit 10",
  "SELECT AVG(rating) avg FROM imdb.movie_ratings WHERE movie_id = %d",
  "SELECT * FROM imdb.users ORDER BY RAND() LIMIT 1"
-- Array of SELECT statements that have 1 string parameter
local select_string = {
  "SELECT * FROM imdb.title WHERE title LIKE '%s%%'"
-- INSERT statements
local inserts = {
  "INSERT INTO imdb.users (email_address, first_name, last_name) VALUES ('%s', '%s', '%s')",
  "INSERT INTO imdb.page_views (type, viewed_id, user_id) VALUES ('%s', %d, %d)"

The above code defines several arrays/lists of different queries. Why is this necessary? Later on in the code, we will have to parse each SQL statement and populate/replace the various parameters with randomly generated values. It would not do us any good to repeat the same SELECT * FROM fooTable WHERE id = 44 every time, now would it? Certainly not. We want to generate random numbers and have our queries select from the entire dataset.

Some queries have no parameters, some have integer-based, and some string-based. We will handle these differently below, which is why they are in different arrays above. This method also allows for future expansion. When you want to run additional queries within the script, just add another line to each array; no need to change any other code.

Parse and execute

The function below, execute_selects, will be called from the parent function, event, which we discussed earlier in the post. You can see for-loops for each of the three SELECT categories we created above. The comments inline should help explain what is happening. Note the use of the user-provided option –point-selects in the second loop below, which we created previously in the ‘Sanity and Options’ section.

function execute_selects()
  -- Execute each simple, no parameters, SELECT
  for i, o in ipairs(select_counts) do
  -- Loop for however many queries the
  -- user wants to execute in this category
  for i = 1, sysbench.opt.point_selects do
    -- select random query from list
    local randQuery = select_points[math.random(#select_points)]
    -- generate random id and execute
    local id = sysbench.rand.pareto(1, 3000000)
    con:query(string.format(randQuery, id))
  -- generate random string and execute
  for i, o in ipairs(select_string) do
    local str = sysbench.rand.string(string.rep("@", sysbench.rand.special(2, 15)))
    con:query(string.format(o, str))

Two more things to mention for this code. First, you will notice the use of sysbench.rand.pareto to generate a random number between 1 and 3,000,000. For our dataset, we know that each table referenced in all queries relating to WHERE id = ? has that many number of rows, at minimum. This is specific to our data. Your values will certainly be different. Second, notice the use of sysbench.rand.string, and string.rep. The string.rep segment will generate a string comprised of ‘@’ symbols, between 2 and 15 characters long. That string of ‘@’ symbols will then be passed to sysbench.rand.string, which will swap out each ‘@’ for a random alphanumeric value. For example, ‘@@@@@@’ could be changed to ‘Hk9EdC’ which will then replace the ‘%s’ inside the query string (string.format) and be executed.

Handle inserts

Our INSERT statements require values. Again, sysbench calls the function execute_inserts from event on each iteration. Inside execute_inserts, we generate some fake string data using built-in functions described above.

Those strings are then formatted into the SQL and executed.

function create_random_email()
  local username = sysbench.rand.string(string.rep("@",sysbench.rand.uniform(5,10)))
  local domain = sysbench.rand.string(string.rep("@",sysbench.rand.uniform(5,10)))
  return username .. "@" .. domain .. ".com"
function execute_inserts()
  -- generate fake email/info
  local email = create_random_email()
  local firstname = sysbench.rand.string("first-" .. string.rep("@", sysbench.rand.special(2, 15)))
  local lastname = sysbench.rand.string("last-" .. string.rep("@", sysbench.rand.special(2, 15)))
  -- INSERT for new imdb.user
  con:query(string.format(inserts[1], email, firstname, lastname))
  -- INSERT for imdb.page_view
  local page = page_types[math.random(#page_types)]
  con:query(string.format(inserts[2], page, sysbench.rand.special(2, 500000), sysbench.rand.special(2, 500000)))

Example run

$ sysbench imdb_workload.lua \
    --mysql-user=imdb --mysql-password=imdb \
    --mysql-db=imdb --report-interval=1 \
    --events=0 --time=0 run
WARNING: Both event and time limits are disabled, running an endless test
sysbench 1.0.17 (using system LuaJIT 2.0.4)
Running the test with following options:
Number of threads: 1
Report intermediate results every 1 second(s)
Initializing random number generator from current time
Initializing worker threads...
Threads started!
[ 1s ] thds: 1 tps: 15.96 qps: 177.54 (r/w/o: 112.71/31.92/32.91) lat (ms,95%): 158.63 err/s: 0.00 reconn/s: 0.00
[ 2s ] thds: 1 tps: 15.01 qps: 169.09 (r/w/o: 109.06/30.02/30.02) lat (ms,95%): 137.35 err/s: 0.00 reconn/s: 0.00
[ 3s ] thds: 1 tps: 26.00 qps: 285.00 (r/w/o: 181.00/52.00/52.00) lat (ms,95%): 108.68 err/s: 0.00 reconn/s: 0.00
[ 4s ] thds: 1 tps: 15.00 qps: 170.00 (r/w/o: 108.00/32.00/30.00) lat (ms,95%): 164.45 err/s: 0.00 reconn/s: 0.00

And there we have it! Custom queries specific to our application and dataset. Most of the sysbench parameters are self-explanatory, but let me talk about –report-interval=1 which shows statistics every 1 second. Normally sysbench does not output stats until the end of the run, however, the example execution will run forever (–events=0 –time=0) so we need stats to show all the time. You can adjust the parameters to your liking. For instance, if you only want to run a test for 5 minutes, set –events=0 –run-time=300.


Sysbench is a very well designed application that allows you to load-test your MySQL instances using pre-defined and custom queries. Using the Lua scripting language, you can create just about any scenario to fit your needs. The above is just one example that we use within Percona’s Training and Education department. It is by no means an exhaustive example of all of the capabilities of sysbench-Lua.

Photo by Lachlan Donald on Unsplash


Writing PostgreSQL Extensions is Fun – C Language

postgresql extensions

PostgreSQL is a powerful open source relational database management system. It extends the SQL language with additional features. A DBMS is not only defined by its performance and out of the box features, but also its ability to support bespoke/additional user-specific functionality. Some of these functionalities may be in the form of database constructs or modules, like stored procedures or functions, but their scope is generally limited to the functionality being exposed by the DBMS. For instance, how will you write a custom query-analyzing application that resides within your DBMS?

To support such options, PostgreSQL provides a pluggable architecture that allows you to install extensions. Extensions may consist of a configuration (control) file, a combination of SQL files, and dynamically loadable libraries.

This means you can write your own code as per the defined guidelines of an extension and plug it in a PostgreSQL instance without changing the actual PostgreSQL code tree. An extension by very definition extends what PostgreSQL can do, but more than that, it gives you the ability to interact with external entities. These external entities can be other database management systems like ClickHouse, Mongo or HDFs (normally these are called foreign data wrappers), or other interpreters or compilers (thus allowing us to write database functions in another language like Java, Python, Perl or TCL, etc.). Another potential use case of an extension can be for code obfuscation which allows you to protect your super secret code from prying eyes.

Build your own

To build your own extension, you don’t need a complete PostgreSQL code base. You can build and install an extension using installed PostgreSQL (it may require you to install a devel RPM or Debian package). Details about extensions can be found in PostgreSQL’s official documentation[1]. There many extensions available for different features in the contrib directory of PostgreSQL source. Other than the contrib directory, people are also writing extensions readily available on the internet but currently not part of the PostgreSQL source tree. The pg_stat_statements, PL/pgSQL, and PostGIS are examples of the best known or most widely used extensions.

Generally available PostgreSQL extensions may be classified into four main categories:

  • Add support of a new language extension (PL/pgSQL, PL/Python, and PL/Java)
  • Data type extensions where you can introduce new ((Hstore, cube, and hstore)
  • Miscellaneous extensions (contrib folder has many miscellaneous extensions)
  • Foreign Data Wrapper extension (postgres_fdw, mysqldb_fdw, clickhousedb_fdw)

There are four basic file types that are required for building an extension:

  • Makefile: Which uses PGXS PostgreSQL’s build infrastructure for extensions.
  • Control File: Carries information about the extension.
  • SQL File(s): If the extension has any SQL code, it may reside in form SQL files (optional)
  • C Code: The shared object that we want to build (optional).

Extension Makefile

To compile the C code, we need a makefile. It’s a very simple makefile with the exception of “PGXS”, which is PostgreSQL’s infrastructural makefile for creating extensions. The inclusion of “PGXS” is done by invoking pg_config binary with “–pgxs” flag. The detail of that file can be found at GitHub[2].

This is a sample makefile, which can be used to compile the C code.

MODULE_big = log
DATA = log--0.0.1.sql
OBJS = log.o
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

Extension Control File

This file must be named as [EXTENSION NAME]. control file. The control file can contain many options which can be found at official documentation [3]. But in this example, I have used some basic options.

comments: Comments about the extension.

default_version: This is the extension SQL version. The name of the SQL file contains this information in the file name.

relocatable:  Tells PostgreSQL if it is possible to move contained objects into a different schema.

module_pathname:  This information is replaced with the actual lib file path.

comment = 'PostgreSQL Utility Command Logger
'default_version = '0.0.1'
relocatable = true
module_pathname = '$libdir/log'

The contents of the file can be seen using the psql command \dx psql.

postgres=# \dx log      
List of installed extensions Name   | Version | Schema |       Description
log                                 | 0.0.1   | public | PostgreSQL Utility Command Logger

Extension SQL File

This is a mapping file, which I used to map the PostgreSQL function with the corresponding C function. Whenever you call the SQL function then the corresponding C function is called. The name of the file must be [EXTENSION NAME]–[default-version].sql. This is the same default_version as defined in the control file.

CREATE FUNCTION pg_all_queries(OUT query TEXT, pid OUT TEXT)

Extension C Code

There are three kinds of functions you can write in c code.

The first is where you call your c code function using SQL function written in SQL file of the extension.

The second type of function is callbacks. You register that callback by assigning the pointer of the function. There is no need for an SQL function here. You call this function automatically when a specific event occurs.

The third type of function is called automatically, even without registering. These functions are called on events such as extension load/unload time etc.

This is the C file containing the definition of the C code. There is no restriction for the name, or of the number of C files.

#include "postgres.h"
/* OS Includes */
/* PostgreSQL Includes */
void _PG_init(void);
void _PG_fini(void);
Datum pg_all_queries(PG_FUNCTION_ARGS);
static void process_utility(PlannedStmt *pstmt, const char *queryString,ProcessUtilityContext context,ParamListInfo params,QueryEnvironment *queryEnv,DestReceiver *dest,       char *completionTag);

You need to include postgres.h for the extension. You can include other PostgreSQL as needed. PG_MODULE_MAGIC is macro which you need to include in the C file for the extension. _PG_init and _PG_fini are functions that are called when the extension is loaded or unloaded, respectively.

Here is an example of the load and unload functions of an extension.

void _PG_init(void)
    /* ... C code here at time of extension loading ... */
    ProcessUtility_hook = process_utility;
Void _PG_fini(void)
    /* ... C code here at time of extension unloading ... */

Here is an example of a callback function that you can invoke whenever you call a utility statement e.g. any DDL statement. The “queryString” variable contains the actual query text.

static void process_utility(PlannedStmt *pstmt,
                           const char *queryString,
                           ProcessUtilityContext context,
                           ParamListInfo params,
                           QueryEnvironment *queryEnv,DestReceiver *dest,
                           char *completionTag)
    /* ... C code here ... */
    /* ... C code here ... */

Finally, here’s an example of a C function that is invoked through a user-defined SQL function. This internally calls the C function contained in our shared object.

Datum pg_all_queries(PG_FUNCTION_ARGS)
    /* ... C code here ... */
    tupstore = tuplestore_begin_heap(true, false, work_mem);
    /* ... C code here ... */     
    values[0] = CStringGetTextDatum(query);
    values[1] = CStringGetTextDatum(pid);
    /* ... C code here ... */
    return (Datum) 0;

Compile and Install

Before compilation, you need to set the PATH for PostgreSQL’s bin directory if there is no pg_config available in the system paths.

export PATH=/usr/local/pgsql/bin:$PATH

make USE_PGXS=1

make USE_PGXS=1 install


We can now use our extension through a simple SQL query. Following is the output that is coming straight out of extension written in C programming language.

postgres=# select * from pg_all_queries();          
query                      | pid
create table foo(a int);  +| 8196
create table bar(a int);  +| 8196
drop table foo;           +| 8196
(3 rows)

I hope this example can serve as a starting point for you to create more useful extensions that will not only help you and your company, but will also give you an opportunity to share and help the PostgreSQL community grow.

The complete example can be found at Github[4].





Photo from Pexels


Upcoming Webinar Thurs 1/24: Databases Gone Serverless?

Databases Gone Serverless Webinar

Databases Gone Serverless WebinarPlease join Percona’s Senior Technical Manager, Alkin Tezuysal, as he presents Databases Gone Serverless? on Thursday, January 24th, at 6:00 AM PDT (UTC-7) / 9:00 AM EDT (UTC-4).

Register Now

Serverless computing is becoming more popular with developers. For instance, it enables them to build and run applications without needing to operate and manage servers. This talk will provide a high-level overview of serverless applications in the database world, including the use cases, possible solutions, services and benefits provided through the cloud ecosystem. In particular, we will focus on the capabilities of the AWS serverless platform.

In order to learn more, register for this webinar on Databases Gone Serverless.


Upcoming Webinar Thurs 1/17: How to Rock with MyRocks

How to Rock with MyRocks

How to Rock with MyRocksPlease join Percona’s Chief Technology Officer, Vadim Tkachenko, as he presents How to Rock with MyRocks on Thursday, January 17th at 10:00 AM PDT (UTC-7) / 1:00 PM EDT (UTC-4).

Register Now

MyRocks is a new storage engine from Facebook and is available in Percona Server for MySQL. In what cases will you want to use it? We will check different workloads and when MyRocks is most suitable for you. Also, as for any new engine, it’s important to set it up and tune it properly. So, we will review the most important settings to pay attention to.

Register for this webinar to learn How to Rock with MyRocks.


Upcoming Webinar Wed 1/9: Walkthrough of Percona Server MySQL 8.0

Walkthrough of Percona Server for MySQL 8.0

Walkthrough of Percona Server for MySQL 8.0Please join Percona’s MySQL Product Manager, Tyler Duzan as he presents Walkthrough of Percona Server MySQL 8.0 on Wednesday, January 9th at 11:00 AM PDT (UTC-7) / 2:00 PM (UTC-4).

Register Now

Our Percona Server for MySQL 8.0 software is the company’s free, enhanced, drop-in replacement for MySQL Community Edition. The software includes all of the great features in MySQL Community Edition 8.0. Additionally, it includes enterprise-class features from Percona made available free and open source. Thousands of enterprises trust Percona Server for MySQL to deliver excellent performance and reliability for their databases and mission-critical applications. Furthermore, our open source software meets their need for a mature, proven and cost-effective MySQL solution.

In sum, register for this webinar for a walkthrough of Percona Server for MySQL 8.0.

Powered by WordPress | Theme: Aeros 2.0 by