There was a discussion on LinkedIn one month ago that caught my eye:
Database search by “within x number of miles” radius?
Anyone out there created a zipcode database and created a “search within x numer of miles” function ?
Thankful for any tips you can throw my way..J
A few people commented that some solutions wouldn’t scale. To understand why these sorts of geographic search queries are problematic in MySQL, it’s best to show some execution plans on dummy data:
-
EXPLAIN SELECT * FROM coordinates FORCE INDEX (x_y_col_a) WHERE x BETWEEN 30 AND 40
-
AND y BETWEEN 50 AND 60 AND col_a = ‘set1’;
-
+—-+————-+————-+——-+—————+———–+———+——+——+————-+
-
| id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra |
-
+—-+————-+————-+——-+—————+———–+———+——+——+————-+
-
| 1 | SIMPLE | coordinates | range | x_y_col_a | x_y_col_a | 38 | NULL | 4032 | USING WHERE |
-
+—-+————-+————-+——-+—————+———–+———+——+——+————-+
-
1 row IN SET (0.00 sec)
-
-
EXPLAIN SELECT * FROM coordinates FORCE INDEX (x_y_col_a) WHERE x BETWEEN 30 AND 40;
-
+—-+————-+————-+——-+—————+———–+———+——+——+————-+
-
| id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra |
-
+—-+————-+————-+——-+—————+———–+———+——+——+————-+
-
| 1 | SIMPLE | coordinates | range | x_y_col_a | x_y_col_a | 3 | NULL | 4032 | USING WHERE |
-
+—-+————-+————-+——-+—————+———–+———+——+——+————-+
-
1 row IN SET (0.01 sec)
-
-
SELECT count(*) FROM coordinates FORCE INDEX (x_y_col_a) WHERE x BETWEEN 30 AND 40
-
AND y BETWEEN 50 AND 60 AND col_a = ‘set1’;
-
+———-+
-
| count(*) |
-
+———-+
-
| 1 |
-
+———-+
-
1 row IN SET (0.00 sec)
-
-
SELECT count(*) FROM coordinates FORCE INDEX (x_y_col_a) WHERE x BETWEEN 30 AND 40;
-
+———-+
-
| count(*) |
-
+———-+
-
| 1664 |
-
+———-+
-
1 row IN SET (0.01 sec)
Did you notice that we estimate just as many rows on the first EXPLAIN as the second one? That doesn’t make any sense! The index covers x,y and col_a and should be eliminating a lot of searching, since there is only one row which meets this condition!
The reason for this is simply a missing feature of the MySQL optimizer – and it has to do with using x BETWEEN 30 and 40 (and it’s also true with x >= 30 AND x <= 40). Using a range like this prevents us from using the rest of the index. There is a workaround, but it’s not pretty:
-
EXPLAIN SELECT * FROM coordinates WHERE x IN
-
(30.30,30.61,31.18,31.26,31.72,32.11,32.25,32.30,32.42,32.91,33.27,
-
33.69,33.79,33.93,34.62,34.78,35.10,35.41,36.62,36.93,37.17,38.93,39.20,
-
39.56,39.84,39.87) AND y IN (59.58,56.81,57.27,54.14,56.43,51.87,54.59,
-
59.56,57.42,54.13,56.79,59.45) AND col_a = ‘set1’;
-
+—-+————-+————-+——-+—————+———–+———+——+——+————-+
-
| id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra |
-
+—-+————-+————-+——-+—————+———–+———+——+——+————-+
-
| 1 | SIMPLE | coordinates | range | x_y_col_a | x_y_col_a | 38 | NULL | 312 | USING WHERE |
-
+—-+————-+————-+——-+—————+———–+———+——+——+————-+
-
1 row IN SET (0.00 sec)
The ugliest thing about this, is that in real life you wouldn’t know all your possible values of X or Y, and so you may end up with a very big IN list. The workaround to this, is to create steppings of the value X and Y that we can use for indexes:
-
ALTER TABLE coordinates ADD x_floor INT NOT NULL, ADD y_floor INT NOT NULL, DROP INDEX x_y_col_a,
-
ADD INDEX x_floor_y_floor_col_a (x_floor, y_floor, col_a);
-
-
UPDATE coordinates SET x_floor = FLOOR(x), y_floor = FLOOR(y);
-
-
EXPLAIN SELECT * FROM coordinates WHERE x_floor IN (30,31,32,33,34,35,36,37,38,39,40)
-
AND y_floor IN (50,51,52,53,54,55,56,57,58,59,60) AND col_a = ‘set1’\G
-
*************************** 1. row ***************************
-
id: 1
-
select_type: SIMPLE
-
TABLE: coordinates
-
type: range
-
possible_keys: x_floor_y_floor_col_a
-
KEY: x_floor_y_floor_col_a
-
key_len: 40
-
ref: NULL
-
rows: 121
-
Extra: USING WHERE
-
1 row IN SET (0.00 sec)
Fantastic! The only remaining problem with this query is that it’s not quite identical to our original. In this query 60.79 will be floored to 60 (and erroneously included in our results):
-
SELECT * FROM coordinates WHERE x_floor IN (30,31,32,33,34,35,36,37,38,39,40)
-
AND y_floor IN (50,51,52,53,54,55,56,57,58,59,60) AND col_a = ‘set1’;
-
+—–+——-+——-+——-+——-+———+———+
-
| id | x | y | col_a | col_b | x_floor | y_floor |
-
+—–+——-+——-+——-+——-+———+———+
-
| 144 | 33.79 | 54.59 | set1 | NULL | 33 | 54 |
-
| 38 | 39.20 | 60.79 | set1 | NULL | 39 | 60 |
-
+—–+——-+——-+——-+——-+———+———+
-
2 rows IN SET (0.00 sec)
However, this is a quick fix by re-including the original WHERE conditions (we are just no longer using an index on them):
-
EXPLAIN SELECT * FROM coordinates WHERE x_floor IN (30,31,32,33,34,35,36,37,38,39,40)
-
AND y_floor IN (50,51,52,53,54,55,56,57,58,59,60)
-
AND col_a = ‘set1’ AND x BETWEEN 30 AND 40 AND y BETWEEN 50 AND 60\G
-
*************************** 1. row ***************************
-
id: 1
-
select_type: SIMPLE
-
TABLE: coordinates
-
type: range
-
possible_keys: x_floor_y_floor_col_a
-
KEY: x_floor_y_floor_col_a
-
key_len: 40
-
ref: NULL
-
rows: 121
-
Extra: USING WHERE
-
1 row IN SET (0.00 sec)
-
-
SELECT * FROM coordinates WHERE x_floor IN (30,31,32,33,34,35,36,37,38,39,40)
-
AND y_floor IN (50,51,52,53,54,55,56,57,58,59,60)
-
AND col_a = ‘set1’ AND x BETWEEN 30 AND 40 AND y BETWEEN 50 AND 60;
-
+—–+——-+——-+——-+——-+———+———+
-
| id | x | y | col_a | col_b | x_floor | y_floor |
-
+—–+——-+——-+——-+——-+———+———+
-
| 144 | 33.79 | 54.59 | set1 | NULL | 33 | 54 |
-
+—–+——-+——-+——-+——-+———+———+
-
1 row IN SET (0.00 sec)
Conclusion:
My examples were only on a small amount of data (16 000 rows) that fitted in memory, but the original query would have full table scanned if I didn’t use a FORCE INDEX hint. Add more data, and if X can’t filter out enough rows by itself this can create a real problem.
Workarounds are all very good, but they also make applications more difficult to maintain. If you really want to do these types of queries, you should give Sphinx a try.
Entry posted by Morgan Tocker |
6 comments