MariaDB 5.3/5.5 has introduced a new join type “Hash Joins” which is an implementation of a Classic Block-based Hash Join Algorithm. In this post we will see what the Hash Join is, how it works and for what types of queries would it be the right choice. I will show the results of executing benchmarks for different queries and explain the results so that you have a better understanding of when using the Hash Join will be best and when not. Although Hash Joins are available since MariaDB 5.3, but I will be running my benchmarks on the newer MariaDB 5.5.
Overview
Hash Join is a new algorithm introduced in MariaDB 5.3/5.5 that can be used for joining tables that have a equijoin conditions of the form tbl1.col1 = tbl2.col1, etc. As I mentioned above that what is actually implemented is the Classic Hash Join. But its known as Block Nested Loop Hash (BNLH) Join in MariaDB.
The Classic Hash Join Algorithm consists of two phases, a build phase and a probe phase. Let’s consider the case of joining two tables on a equijoin condition. So the first thing would be to designate the smallest of the two tables as the left operand and the other table which is bigger, to be the right operand. Now when the algorithm begins, the first phase is the build phase, in which a hash table is created over the join attributes and rows of the left operand. Next comes the probe phase, which is where the matching rows from the right operand are found, by scanning the right operand and for each row scanned performing a lookup in the hash table by using values of the columns participating in the equijoin condition. The hash table is accessed by using a hash function on the values of the join condition, and hence is quite efficient. But what about the restriction on the size of the hash table. The size of the hash table is restricted by the value of join_buffer_size, and so if the left operand is big such that the size of the hash table built on it is greater than the join_buffer_size, then multiple hash tables would be created. For example if the left operand has “n” rows, and its size is three times the value of join_buffer_size, then 3 hash tables would need to be created each containing a hash table on n/3 rows. And so both the build and probe phase would be done three times, which means that the right operand will be scanned thrice.
Wikipedia has a nicely simplified version of the Classic Hash Join algorithm (http://en.wikipedia.org/wiki/Hash_join#Classic_hash_join) which I will quote below for better understanding:
- For each tuple r in the build input R
- Add to the in-memory hash table
- If the size of the hash table equals the maximum in-memory size:
- Scan the probe input S, and add matching join tuples to the output relation
- Reset the hash table
- Do a final scan of the probe input S and add the resulting join tuples to the output relation
Now after the explanation of the hash join lets see how it performs for different test cases.
Benchmarks
For the purpose of the benchmarks I used the DBT3 dataset of scale factor 2, which means the total dataset size is 4.8G. Let me show the breakdown of dataset size by the tables that I have used in the benchmarks:
Table ‘lineitem’: 3.8G
Table ‘supplier’: 11M
Table ‘orders’: 468M
I have benchmarked two different kinds of workloads, IO bound and in-memory. Benchmark on IO bound workload was performed with a buffer pool size of 1G, while benchmark on in-memory workload was performed with a buffer pool size of 6G. The benchmarks compare Block Nested Loop (BNL) Join of MySQL 5.5.24, Batched Key Access (BKA) Join of MySQL 5.6.5 and Block Nested Loop Hash (BNLH) Join of MariaDB 5.5.20. The configuration used with the three variants of MySQL are listed below.
Configuration
Let’s first take a look at the configuration used with different MySQL flavors.
MySQL 5.5.24 Configuration
innodb_file_per_table=1
innodb_file_format=barracuda
innodb_log_file_size=512M
innodb_log_files_in_group=2
innodb_flush_log_at_trx_commit=2
innodb_flush_method=O_DIRECT
query_cache_size=0
query_cache_type=0
MySQL 5.6.5 Configuration
innodb_file_per_table=1
innodb_file_format=barracuda
innodb_log_file_size=512M
innodb_log_files_in_group=2
innodb_flush_log_at_trx_commit=2
innodb_flush_method=O_DIRECT
query_cache_size=0
query_cache_type=0
optimizer_switch='index_condition_pushdown=on'
optimizer_switch='mrr=on'
optimizer_switch='mrr_cost_based=off'
read_rnd_buffer_size=32M
optimizer_switch='batched_key_access=on'
join_buffer_size=32M
MariaDB 5.5.20 Configuration
innodb_file_per_table=1
innodb_file_format=barracuda
innodb_log_file_size=512M
innodb_log_files_in_group=2
innodb_flush_log_at_trx_commit=2
innodb_flush_method=O_DIRECT
query_cache_size=0
query_cache_type=0
optimizer_switch='index_condition_pushdown=on'
optimizer_switch='mrr=on'
optimizer_switch='mrr_sort_keys=on'
optimizer_switch='mrr_cost_based=off'
mrr_buffer_size=32M
optimizer_switch='join_cache_incremental=on'
optimizer_switch='join_cache_hashed=on'
optimizer_switch='join_cache_bka=on'
join_cache_level=4
join_buffer_size=32M
join_buffer_space_limit=32M
Note that MariaDB includes a new variable ‘join_cache_level
‘, this variable controls which Join Algorithms are allowed to be used, a value of 4 here means that Nested Loop Join and Hash Join algorithms are allowed. Now as well know that ‘join_buffer_size
‘ controls the size of the join buffer allocated for each join in a query, MariaDB introduces another variable to control the size of the buffer ‘join_buffer_space_limit
‘. This variable controls the maximum allowed size of the buffer for the whole query. By default it has a value of 1024*128*10, which means that your effective join_buffer_size
is not bigger than this value. Hence, the reason I have set join_buffer_space_limit=32M
.
Benchmark Machine Specs
The machine that I used for the benchmarks, is a dual core machine with the following CPU configuration: 2xIntel(R) Core(TM)2 CPU 6600 @ 2.40GHz. The amount of memory installed is 8G and the MySQL datadir is on a 4-disk Software RAID5 volume, the disks are 5.4K RPM disks. The filesystem used is XFS, and the OS installed is Centos 6.2
Table Structure
Before moving on, let’s take a look at the structure of the tables involved in the benchmark tests.
CREATE TABLE `supplier` (
`s_suppkey` int(11) NOT NULL DEFAULT '0',
`s_name` char(25) DEFAULT NULL,
`s_address` varchar(40) DEFAULT NULL,
`s_nationkey` int(11) DEFAULT NULL,
`s_phone` char(15) DEFAULT NULL,
`s_acctbal` decimal(10,2) DEFAULT NULL,
`s_comment` varchar(101) DEFAULT NULL,
PRIMARY KEY (`s_suppkey`),
KEY `i_s_nationkey` (`s_nationkey`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `lineitem` (
`l_orderkey` int(11) NOT NULL DEFAULT '0',
`l_partkey` int(11) DEFAULT NULL,
`l_suppkey` int(11) DEFAULT NULL,
`l_linenumber` int(11) NOT NULL DEFAULT '0',
`l_quantity` decimal(10,2) DEFAULT NULL,
`l_extendedprice` decimal(10,2) DEFAULT NULL,
`l_discount` decimal(10,2) DEFAULT NULL,
`l_tax` decimal(10,2) DEFAULT NULL,
`l_returnflag` char(1) DEFAULT NULL,
`l_linestatus` char(1) DEFAULT NULL,
`l_shipDATE` date DEFAULT NULL,
`l_commitDATE` date DEFAULT NULL,
`l_receiptDATE` date DEFAULT NULL,
`l_shipinstruct` char(25) DEFAULT NULL,
`l_shipmode` char(10) DEFAULT NULL,
`l_comment` varchar(44) DEFAULT NULL,
PRIMARY KEY (`l_orderkey`,`l_linenumber`),
KEY `i_l_shipdate` (`l_shipDATE`),
KEY `i_l_suppkey_partkey` (`l_partkey`,`l_suppkey`),
KEY `i_l_partkey` (`l_partkey`,`l_quantity`,`l_shipmode`,`l_shipinstruct`),
KEY `i_l_suppkey` (`l_suppkey`),
KEY `i_l_receiptdate` (`l_receiptDATE`),
KEY `i_l_orderkey` (`l_orderkey`),
KEY `i_l_orderkey_quantity` (`l_orderkey`,`l_quantity`),
KEY `i_l_commitdate` (`l_commitDATE`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `orders` (
`o_orderkey` int(11) NOT NULL DEFAULT '0',
`o_custkey` int(11) DEFAULT NULL,
`o_orderstatus` char(1) DEFAULT NULL,
`o_totalprice` decimal(10,2) DEFAULT NULL,
`o_orderDATE` date DEFAULT NULL,
`o_orderpriority` char(15) DEFAULT NULL,
`o_clerk` char(15) DEFAULT NULL,
`o_shippriority` int(11) DEFAULT NULL,
`o_comment` varchar(79) DEFAULT NULL,
PRIMARY KEY (`o_orderkey`),
KEY `i_o_orderdate` (`o_orderDATE`),
KEY `i_o_custkey` (`o_custkey`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Test Cases
Now let’s see the test cases and then see how the joins perform for each test case.
Test Case A – Join a small table that fits in memory to a large table with no WHERE clause
The SQL used for this test together with its EXPLAIN output as returned by MySQL 5.5 is as follows:
SELECT s_nationkey, l_shipmode, count(*)
FROM supplier INNER JOIN lineitem ON s_suppkey = l_suppkey
GROUP BY s_nationkey, l_shipmode;
+----+-------------+----------+-------+---------------+---------------+---------+-------------------------+-------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+-------+---------------+---------------+---------+-------------------------+-------+----------+----------------------------------------------+
| 1 | SIMPLE | supplier | index | PRIMARY | i_s_nationkey | 5 | NULL | 20266 | 100.00 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | lineitem | ref | i_l_suppkey | i_l_suppkey | 5 | dbt3.supplier.s_suppkey | 250 | 100.00 | Using where |
+----+-------------+----------+-------+---------------+---------------+---------+-------------------------+-------+----------+----------------------------------------------+
And the results in seconds of time taken to complete the above query:
First thing to note is that I have scaled down the time taken by MySQL 5.5 to finish the query on IO bound workload so that it could fit well in the chart, in actuality the query took 32077 seconds to finish in the IO bound workload. Anyhow we can clearly see from the above chart that Hash join comprehensively beats BKA and BNL, hash join is perfect in these cases where you are joining a small table with a very large table with no ‘indexed where’ conditions on the big table. BNLH takes half the time to complete the query for in-memory workload and 6.6x less times as compared to BKA MySQL 5.6, and 965x less time as compared to BNL MySQL 5.5. So hash join gives us an improvement by a very large factor both for IO bound workload and in-memory workload.
Test Case B – Join a small table that fits in memory to a large table with a selective WHERE clause on an indexed column
The SQL used for this test together with its EXPLAIN output as returned by MySQL 5.5 is as follows:
SELECT s_nationkey, l_shipmode, count(*)
FROM supplier INNER JOIN lineitem ON s_suppkey = l_suppkey
WHERE s_nationkey='24'
GROUP BY s_nationkey, l_shipmode;
+----+-------------+----------+------+-----------------------+---------------+---------+-------------------------+------+----------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------+-----------------------+---------------+---------+-------------------------+------+----------+-----------------------------------------------------------+
| 1 | SIMPLE | supplier | ref | PRIMARY,i_s_nationkey | i_s_nationkey | 5 | const | 808 | 100.00 | Using where; Using index; Using temporary; Using filesort |
| 1 | SIMPLE | lineitem | ref | i_l_suppkey | i_l_suppkey | 5 | dbt3.supplier.s_suppkey | 292 | 100.00 | Using where |
+----+-------------+----------+------+-----------------------+---------------+---------+-------------------------+------+----------+-----------------------------------------------------------+
And the results in seconds of time taken to complete the above query:
First thing to note is that I have scaled down the time taken by MySQL 5.5 to finish the query on IO bound workload so that it could fit well in the chart, in actuality the query took 1280 seconds to finish in the IO bound workload. In this test Hash join is not ideal because you have a highly selective where clause that reduces the size of the joining data set. And hash join performs even badly and takes 7x more time for in-memory workload in this test case. While for IO bound workload, hash join takes 53x less time to execute the query as compared to MySQL 5.5 but takes slightly more time as compared to BKA algorithm of MySQL 5.6.
Test Case C – Join a small table with a large table with a WHERE clause on a non-indexed column
The SQL used for this test together with its EXPLAIN output as returned by MySQL 5.5 is as follows:
SELECT s_nationkey, l_shipmode, count(*)
FROM supplier INNER JOIN lineitem ON s_suppkey = l_suppkey
WHERE l_shipmode='AIR'
GROUP BY s_nationkey, l_shipmode;
+----+-------------+----------+-------+---------------+---------------+---------+-------------------------+-------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+-------+---------------+---------------+---------+-------------------------+-------+----------+-------------+
| 1 | SIMPLE | supplier | index | PRIMARY | i_s_nationkey | 5 | NULL | 20174 | 100.00 | Using index |
| 1 | SIMPLE | lineitem | ref | i_l_suppkey | i_l_suppkey | 5 | dbt3.supplier.s_suppkey | 270 | 100.00 | Using where |
+----+-------------+----------+-------+---------------+---------------+---------+-------------------------+-------+----------+-------------+
And the results in seconds of time taken to complete the above query:
First thing to note is that I have scaled down the time taken by MySQL 5.5 to finish the query on IO bound workload so that it could fit well in the chart, in actuality the query took 31654 seconds to finish in the IO bound workload. Again here hash join beats BKA and BNL comprehensively. Hash join outperforms the other join types when you are joining a small table with a very large table with a where clause on a ‘non-indexed’ column In this test we can clearly see that Hash Join gives a lot of reduction in query time. The reduction in query time for IO bound workload is 1266x times when compared to MySQL 5.5 and 9x times when compared to MySQL 5.6. While for in-memory workload the reduction is query time is 3.5x when compared to both MySQL 5.5 and MySQL 5.6.
Another interesting thing to note is that for both Test B and Test C, Hash Join takes similar amount of time both for IO bound workload and in-memory workload. Why, because Hash Join implies scanning the table lineitem
(right operand) in both test cases. Since in Test B we have a limited set of rows in the supplier
table (left operand) to join to the lineitem
table (right operand) so scanning the lineitem
table (BNLH) proves to be costly as compared to doing batched index lookups (BKA). However, in Test C the cost of hash join remains the same but the cost of BKA increases, as there are going to be a lot more random index lookups needed to be performed because of the increase in the number of rows needed to be joined from supplier table (left operand).
Test Case D – Join a large data set (>1M rows) from one table with a large table
The SQL used for this test together with its EXPLAIN output as returned by MySQL 5.5 is as follows:
SELECT o.*, count(*) as num_items
FROM orders AS o INNER JOIN lineitem AS l ON o_orderkey=l_orderkey
WHERE o_orderdate > '1996-05-01' GROUP BY o_orderkey
ORDER BY num_items DESC LIMIT 10;
+----+-------------+-------+-------+--------------------------------------------+---------+---------+-------------------+---------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+-------+--------------------------------------------+---------+---------+-------------------+---------+----------+----------------------------------------------+
| 1 | SIMPLE | o | index | PRIMARY,i_o_orderdate | PRIMARY | 4 | NULL | 2993459 | 50.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | l | ref | PRIMARY,i_l_orderkey,i_l_orderkey_quantity | PRIMARY | 4 | dbt3.o.o_orderkey | 1 | 100.00 | Using index |
+----+-------------+-------+-------+--------------------------------------------+---------+---------+-------------------+---------+----------+----------------------------------------------+
And the results in seconds of time taken to complete the above query:
Here we can clearly see that MySQL 5.5 beats both BKA of MySQL 5.6 and Hash Join of MariaDB 5.5. In IO bound test MySQL 5.5 takes 2.5x less time to complete the query as compared to MySQL 5.6 BKA algorithm, and takes 6.6x less time as compared to MariaDB 5.5 Hash Join, Hash Join also performs worse as compared to BKA and takes 2.5x more time. While for in-memory workload test, MySQL 5.5 takes 5x less time as compared to MySQL 5.6 and 13x less time as compared to MariaDB 5.5 Hash Join. First thing to note is that the above query would be reading 1/3 the number of rows in the table orders
(left operand), and so MySQL 5.5 prefers to do a PRIMARY index scan of the table orders
resulting in sequential IO, while both MySQL 5.6 and MariaDB 5.5 prefer to do an index range scan on the secondary key o_orderdate
which results in random scans of the PK to fetch the columns that are not part of the secondary key. Even though MySQL 5.6 uses MRR to offset the effect of random access of PK, even then it proves to be costly. Also note that the table lineitem
, is joined by the column l_orderkey
which is the left-most PK column, so reading the table orders
in PK order has another benefit that it implies reading the table lineitem
in PK order. Hence, these benefits mean MySQL 5.5 wins. But why does Hash Join take so much more time. The reason is that the rows needed to be read from the left operand which is the table orders
are far greater than the size of the join buffer. The size of the join buffer is 32M, while the size of the left operand is 186M which means roughly 6 scans of the right operand which is the table lineitem
. Hence the reason why hash join is slow in this case, because we have to refill the join buffer with rows from orders
table many times, and hash join is not that good if you need many scans of the right operand (in this case the table lineitem
).
Another difference with the query in this test case is that, while with the queries in previous test cases, the joining key from the table supplier
would match approximately 600 rows from the table lineitem
for each distinct key value, in this test case D, the joining key from the table orders
would match approximately 5 rows from the table lineitem
for each distinct key value. Also the joining key in this test case D is PK in one table and left-most part of the PK in the second table.
How does optimizer work with the different Join Algorithms available?
Currently, the part of the optimizer that is responsible for choosing the join algorithm for a particular query and QEP is not advanced enough and there is work to be done yet. As I understand it MariaDB folks are working on the cost-based choice for any joins. It’s not easy because the current costing model is primitive and must be enhanced to support the possibility of existence of different join algorithms. So what does that mean to MariaDB/MySQL users right now with the state of the current optimizer. Right now you would have to manually enable and disable the join algorithms for the optimizer to choose from.
In MariaDB, every algorithm has a number given to it:
1 – flat BNL
2 – incremental BNL
3 – flat BNLH
4 – incremental BNLH
5 – flat BKA
6 – incremental BKA
7 – flat BKAH
8 – incremental BKAH
The variable join_cache_level controls which algorithms are enabled. If join_cache_level=4 all algorithms numbered 1 to 4 are enabled, if join_cache_level=8, all algorithms numbered 1 to 8 are enabled. Optimizer is naive in the sense that it always uses the max values join algorithm. If join_cache_level=4 it always uses BNLH (hash join), if join_cache_level=8 it always uses BKAH (a variant of BKA). Optimizer does not try to check which algorithm is the best one to use, it just assumes that the algorithm with the highest numeric value is the best one.
So we can force the join algorithm used by setting appropriate values of “join_cache_level”. For example in my test I forced the optimizer to use hash join by setting join_cache_level=4. We can set certain rules for which certain join algorithms are best and then use that algorithm by making use of the variable “join_cache_level”.
Conclusion
Based on the above information and the benchmark results for different test cases, we can see where Hash Joins work best and where they don’t. First of all Hash joins only work for equijoins. Hash join work best when you are joining very big tables with no WHERE clause, or a WHERE clause on a non-indexed column. They also provide big improvement in query response time when you are joining tables with no indexes on the join condition (Full Join). The best performance with Hash Join can be achieved when the left table can fit completely in the join buffer, or when the least amount of buffer refills are needed, as each buffer refill means a scan of the right-side table. However, Hash joins do not outperform BNL or BKA when you are joining a really small subset of rows, as then scanning the right-side table becomes costly in comparison. Block Nested Loop Join would perform better than Hash Join when you are joining two tables on a PK column such that both tables are read in PK order. One use case that I can think of for hash joins is data warehouse applications that need to run reporting queries that need to join on lookup tables which tend to be small mostly. What use cases can you think of?