Working on a test case with sysbench, I encountered this:
mysql> select * from sbtest1 where id = round(rand()*10000, 0); +------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+ | id | k | c | pad | +------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+ | 179 | 499871 | 09833083632-34593445843-98203182724-77632394229-31240034691-22855093589-98577647071-95962909368-34814236148-76937610370 | 62233363025-41327474153-95482195752-11204169522-13131828192 | | 1606 | 502031 | 81212399253-12831141664-41940957498-63947990218-16408477860-15124776228-42269003436-07293216458-45216889819-75452278174 | 25423822623-32136209218-60113604068-17409951653-00581045257 | +------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+ 2 rows in set (0.30 sec)
I was really surprised. First, and the most important, id is a primary key and the rand() function should produce just one value. How come it returns two rows? Second, why is the response time 0.30 sec? That seems really high for a primary key access.
Looking further:
CREATE TABLE `sbtest1` ( `id` int(11) NOT NULL AUTO_INCREMENT, `k` int(11) NOT NULL DEFAULT '0', `c` char(120) NOT NULL DEFAULT '', `pad` char(60) NOT NULL DEFAULT '', PRIMARY KEY (`id`), KEY `k_1` (`k`) ) ENGINE=InnoDB AUTO_INCREMENT=1000001 DEFAULT CHARSET=latin1 mysql> explain select * from sbtest1 where id = round(rand()*10000, 0); +----+-------------+---------+------------+------+---------------+------+---------+------+--------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+---------+------------+------+---------------+------+---------+------+--------+----------+-------------+ | 1 | SIMPLE | sbtest1 | NULL | ALL | NULL | NULL | NULL | NULL | 986400 | 10.00 | Using where | +----+-------------+---------+------------+------+---------------+------+---------+------+--------+----------+-------------+
So it is a primary key, but MySQL does not use an index, and it returns two rows. Is this a bug?
Deterministic vs nondeterministic functions
Turned out it is not a bug at all. It is pretty logical behavior from MySQL, but it is not what we would expect. First, why a full table scan? Well, rand() is nondeterministic function. That means we do not know what it will return ahead of time, and actually that is exactly the purpose of rand() – to return a random value. In this case, it is only logical to evaluate the function for each row, each time, and compare the results. i.e. in our case
- Read row 1, get the value of id, evaluate the value of RAND(), compare
- Proceed using the same algorithm with the remaining rows.
In other words, as the value of rand() is not known (not evaluated) beforehand, so we can’t use an index.
And in this case – rand() function – we have another interesting consequence. For larger tables with an auto_increment primary key, the probability of matching the rand() value and the auto_increment value is higher, so we can get multiple rows back. In fact, if we read the whole table from the beginning and keep comparing the auto_inc sequence with “the roll of the dice”, we can get many rows back.
That behavior is totally counter-intuitive. Nevertheless, to me, it’s also the only correct behavior.
We expect to have the rand() function evaluated before running the query. This can actually be achieved by assigning rand() to a variable:
mysql> set @id=round(rand()*10000, 0); select @id; select * from sbtest1 where id = @id; Query OK, 0 rows affected (0.00 sec) +------+ | @id | +------+ | 6068 | +------+ 1 row in set (0.00 sec) +------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+ | id | k | c | pad | +------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+ | 6068 | 502782 | 84971025350-12845068791-61736600622-38249796467-85706778555-74134284808-24438972515-17848828748-86869270666-01547789681 | 17507194006-70651503059-23792945260-94159543806-65683812344 | +------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+ 1 row in set (0.00 sec) mysql> explain select * from sbtest1 where id = @id; +----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+----------+-------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+----------+-------+ | 1 | SIMPLE | sbtest1 | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | NULL | +----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+----------+-------+ 1 row in set, 1 warning (0.01 sec)
This would meet our expectations.
There are (at least) two bug reports filed, with very interesting discussion:
- rand() used in scalar functions returns multiple rows
- SELECT on PK with ROUND(RAND()) give wrong errors
Other databases
I wanted to see how it works in other SQL databases. In PostgreSQL, the behavior is exactly the same as MySQL:
postgres=# select * from t2 where id = cast(random()*10000 as int); id | c ------+--------- 4093 | asdasda 9378 | asdasda (2 rows) postgres=# select * from t2 where id = cast(random()*10000 as int); id | c ------+--------- 5988 | asdasda 6674 | asdasda (2 rows) postgres=# explain select * from t2 where id = cast(random()*10000 as int); QUERY PLAN -------------------------------------------------------------------- Seq Scan on t2 (cost=0.00..159837.60 rows=1 width=12) Filter: (id = ((random() * '10000'::double precision))::integer) (2 rows)
And SQLite seems different, evaluating the random() function beforehand:
sqlite> select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int); 16239|asdsadasdsa sqlite> select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int); 32910|asdsadasdsa sqlite> select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int); 58658|asdsadasdsa sqlite> explain select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int); addr opcode p1 p2 p3 p4 p5 comment ---- ------------- ---- ---- ---- ------------- -- ------------- 0 Init 0 12 0 00 Start at 12 1 OpenRead 0 30182 0 2 00 root=30182 iDb=0; t2 2 Function0 0 0 3 random(0) 00 r[3]=func(r[0]) 3 Cast 3 69 0 00 affinity(r[3]) 4 Function0 0 3 2 abs(1) 01 r[2]=func(r[3]) 5 Divide 4 2 1 00 r[1]=r[2]/r[4] 6 Cast 1 68 0 00 affinity(r[1]) 7 SeekRowid 0 11 1 00 intkey=r[1]; pk 8 Copy 1 5 0 00 r[5]=r[1] 9 Column 0 1 6 00 r[6]=t2.c 10 ResultRow 5 2 0 00 output=r[5..6] 11 Halt 0 0 0 00 12 Transaction 0 0 2 0 01 usesStmtJournal=0 13 Int64 0 4 0 92233720368547 00 r[4]=92233720368547 14 Goto 0 1 0 00
Conclusion
Be careful when using MySQL nondeterministic functions in a “where” condition – rand() is the most interesting example – as their behavior may surprise you. Many people believe this to be a bug that should be fixed. Let me know in the comments: do you think it is a bug or not (and why)? I would also be interested to know how it works in other, non-opensource databases (Microsoft SQL Server, Oracle, etc)
PS: Finally, I’ve got a “clever” idea – what if I “trick” MySQL by using the deterministic keyword…
MySQL stored functions: deterministic vs not deterministic
So, I wanted to see how it works with MySQL stored functions if they are assigned “deterministic” and “not deterministic” keywords. First, I wanted to “trick” mysql and pass the deterministic to the stored function but use rand() inside. Ok, this is not what you really want to do!
DELIMITER $$ CREATE FUNCTION myrand() RETURNS INT DETERMINISTIC BEGIN RETURN round(rand()*10000, 0); END$$ DELIMITER ;
From MySQL manual about MySQL stored routines we can read:
Assessment of the nature of a routine is based on the “honesty” of the creator: MySQL does not check that a routine declared
DETERMINISTIC
is free of statements that produce nondeterministic results. However, misdeclaring a routine might affect results or affect performance. Declaring a nondeterministic routine asDETERMINISTIC
might lead to unexpected results by causing the optimizer to make incorrect execution plan choices. Declaring a deterministic routine asNONDETERMINISTIC
might diminish performance by causing available optimizations not to be used.
The result is interesting:
mysql> select myrand(); +----------+ | myrand() | +----------+ | 4202 | +----------+ 1 row in set (0.00 sec) mysql> select myrand(); +----------+ | myrand() | +----------+ | 7548 | +----------+ 1 row in set (0.00 sec) mysql> explain select * from t2 where id = myrand()\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: NULL partitions: NULL type: NULL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: NULL filtered: NULL Extra: Impossible WHERE noticed after reading const tables 1 row in set, 1 warning (0.00 sec) mysql> show warnings; +-------+------+--------------------------------------------------------------------------------+ | Level | Code | Message | +-------+------+--------------------------------------------------------------------------------+ | Note | 1003 | /* select#1 */ select '2745' AS `id`,'asasdas' AS `c` from `test`.`t2` where 0 | +-------+------+--------------------------------------------------------------------------------+ 1 row in set (0.00 sec) mysql> select * from t2 where id = 4202; +------+---------+ | id | c | +------+---------+ | 4202 | asasdas | +------+---------+ 1 row in set (0.00 sec) mysql> select * from t2 where id = 2745; +------+---------+ | id | c | +------+---------+ | 2745 | asasdas | +------+---------+ 1 row in set (0.00 sec)
So MySQL optimizer detected the problem (somehow).
If I use the NOT DETERMINISTIC keyword, then MySQL works the same as when using the rand() function:
DELIMITER $$ CREATE FUNCTION myrand2() RETURNS INT NOT DETERMINISTIC BEGIN RETURN round(rand()*10000, 0); END$$ DELIMITER ; mysql> explain select * from t2 where id = myrand2()\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: t2 partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 262208 filtered: 10.00 Extra: Using where 1 row in set, 1 warning (0.00 sec)
—
Photo by dylan nolte on Unsplash