MySQL 5.5 and Percona Server 5.5 do not solve all scalability problems even for read only workloads. Workloads which got a lot of attention such as Sysbench and DBT2/TPC-C scale pretty well a they got a lot of attention, there can be other quite typical workloads however which do not scale that well. This is why it is important to test performance and scalability for your application on your hardware if you really want results most relevant for your situation.
In this example I spotted the query pattern responsible for large portion of the load in the application by aggregating query log with mk-query-digest. When I filtered out only queries of this pattern and got a simplified part of production workload which focuses only on one query but yet mimics real world values distribution.
The query looks something like this:
SELECT * FROM table WHERE deleted = 0 AND group_id IN (62715996, 62716592, 62717660, 62717716, 62717722, 62717728, 62721292, 62722236, 62722238, 62722240, 62722282, 62722284, 62722286, 62722288, 62722292, 62722294, 62728432, 62728434, 62728436, 62728438, 62728440, 62728442, 62728596, 62728688, 62728882, 62728884, 62728886, 62728888, 62728890, 62728892, 62729006, 62729008, 62729010, 62729012, 62729014, 62729042, 62729044, 62729068, 62729070, 62729072, 62729224, 62729226, 62729382, 62729560, 62729564, 62729618, 62729620, 62729624, 62729640, 62729642, 62729644, 62729750, 62729752, 62729754, 62729756, 62729760, 62729762, 62730046, 62730050, 62730052, 62730054, 62730268, 62730270, 62730272, 62730274, 62730276, 62730278, 62730280, 62730284, 62730286, 62730324, 62730326, 62730668, 62730672, 62730676, 62730678, 62730682, 62730686, 62730688, 62730692, 62730696, 62730710, 62730724, 62732448, 62733114, 62734150, 62734442, 62734446, 62734448, 62734456, 62734460, 62734682, 62734684, 62734842, 62734844, 62734846, 62734848, 62735354, 62735356, 62735358, 62736110, 62736112, 62736114, 62736116, 62736118, 62736120, 62737096, 62737098, 62737100, 62737102, 62737104, 62737106, 62737114, 62737118, 62737120, 62737122, 62737124, 62737170, 62737172, 62737174, 62737198, 62737200, 62737226, 62737240, 62737242, 62737244, 62737246, 62737250, 62737252, 62737254, 62737368, 62737370, 62737372, 62737374, 62737376, 62737378, 62737382, 62737384, 62737650, 62737652, 62737654, 62737922, 62737924, 62737926, 62737928, 62737930, 62737932, 62737934, 62737936, 62739648, 62739652, 62739654, 62739658, 62739662, 62739668, 62739680, 62739844, 62739856, 62739860, 62739868, 62740030, 62740036, 62740042, 62740044, 62740046, 62740154, 62740156, 62740160, 62740164, 62740170, 62740386, 62740388, 62740390, 62740392, 62740394, 62740396, 62740398, 62740400, 62740402, 62740404, 62740406, 62740408, 62740964, 62740966, 62740968, 62740970, 62740972, 62742370, 62742374, 62742380, 62742384, 62742388, 62742394, 62742398, 62742402, 62743720, 62743962, 62744142, 62744144, 62744146)
With secondary key defined on (group_id,deleted)
This pattern of medium size IN lists is typical for a lot of modern applications. The system I used for test was 4 socket r910 with X7542 CPUs giving 24 real cores. The workload was completely in memory in this test. I have compared Percona Server 5.1.56-12.7 and Percona Server 5.5.11-20.2 with former being configured with innodb_buffer_pool_instances=16 and 400GB buffer pool. Here are results:
First it is worth to say we see great improvements for this workload in Percona Server 5.5 showing up to 2.5 times better performance or 150%, which is a great headline numbers. If we look at more details however we can see this is improvement from complete disaster to something absolutely unacceptable. If we look at the gain system shows from performance at 1 thread to the peak performance we see the number is 2.3 for Percona Server 5.1 and around 6 for Percona Server 5.5 which is way too little for what you would expect from 24 core system. For comparison you can check
Sysbench Results which Vadim published. For read only workload we get 17x gain from 1 thread to peak performance on system with 12 physical cores and 24 threads, which is almost perfect scalability.
So there are workloads where the problem is solved and when there are where a lot of work need to be done and where you still can’t get use of more than 8 cores effectively (which would correspond to single CPU socket these days)
Here are some more details. oprofile:
867964 15.1181 mysqld btr_cur_search_to_nth_level
792651 13.8063 mysqld btr_search_guess_on_hash
716506 12.4800 mysqld mutex_spin_wait
467974 8.1511 mysqld rec_get_offsets_func
376898 6.5648 mysqld buf_page_get_gen
PMP
10 libaio::??,os_aio_linux_collect,os_aio_linux_handle,fil_aio_wait,io_handler_thread,start_thread,clone
5 pthread_cond_wait,os_cond_wait,os_event_wait_low,sync_array_wait_event,rw_lock_s_lock_spin,pfs_rw_lock_s_lock_fu
nc,btr_search_build_page_hash_index,btr_search_info_update_slow,btr_search_info_update,btr_cur_search_to_nth_level,btr_p
cur_open_with_no_init_func,row_sel_get_clust_rec_for_mysql,row_search_for_mysql,ha_innobase::index_read,handler::read_ra
nge_first,handler::read_multi_range_next,QUICK_RANGE_SELECT::get_next,rr_quick,sub_select,do_select,JOIN::exec,mysql_sel
ect,handle_select,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_conn
ection,handle_one_connection,start_thread,clone
4 pthread_cond_wait,os_cond_wait,os_event_wait_low,sync_array_wait_event,mutex_spin_wait,mutex_enter_func,pfs_mute
x_enter_func,buf_page_get_mutex_enter,buf_page_get_gen,btr_cur_search_to_nth_level,btr_pcur_open_with_no_init_func,row_s
earch_for_mysql,ha_innobase::index_read,handler::read_range_first,handler::read_multi_range_next,QUICK_RANGE_SELECT::get
_next,rr_quick,sub_select,do_select,JOIN::exec,mysql_select,handle_select,execute_sqlcom_select,mysql_execute_command,my
sql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone
2 pthread_cond_wait,os_cond_wait,os_event_wait_low,sync_array_wait_event,rw_lock_x_lock_func,pfs_rw_lock_x_lock_fu
nc,btr_search_build_page_hash_index,btr_search_info_update_slow,btr_search_info_update,btr_cur_search_to_nth_level,btr_p
cur_open_with_no_init_func,row_sel_get_clust_rec_for_mysql,row_search_for_mysql,ha_innobase::index_read,handler::read_ra
nge_first,handler::read_multi_range_next,QUICK_RANGE_SELECT::get_next,rr_quick,sub_select,do_select,JOIN::exec,mysql_sel
ect,handle_select,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_conn
ection,handle_one_connection,start_thread,clone
2 pthread_cond_wait,os_cond_wait,os_event_wait_low,sync_array_wait_event,rw_lock_s_lock_spin,pfs_rw_lock_s_lock_fu
nc,btr_search_build_page_hash_index,btr_search_info_update_slow,btr_search_info_update,btr_cur_search_to_nth_level,btr_p
cur_open_with_no_init_func,row_sel_get_clust_rec_for_mysql,row_search_for_mysql,ha_innobase::general_fetch,handler::read
_multi_range_next,QUICK_RANGE_SELECT::get_next,rr_quick,sub_select,do_select,JOIN::exec,mysql_select,handle_select,execu
te_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_conne
ction,start_thread,clone
P.S Disabling adaptive hash index makes things a lot worse in this workload and innodb_adaptive_hash_index_partitions=16
available for partitioning adaptive hash index does not help because everything hits the same index.