This Week in Data with Colin Charles 42: Security Focus on Redis and Docker a Timely Reminder to Stay Alert

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

Much of last week, there was a lot of talk around this article: New research shows 75% of ‘open’ Redis servers infected. It turns out, it helps that one should always read beyond the headlines because they tend to be more sensationalist than you would expect. From the author of Redis, I highly recommend reading Clarifications on the Incapsula Redis security report, because it turns out that in this case, it is beyond the headline. The content is also suspect. Antirez had to write this to help the press (we totally need to help keep reportage accurate).

Not to depart from the Redis world just yet, but Antirez also had some collaboration with the Apple Information Security Team with regards to the Redis Lua subsystem. The details are pretty interesting as documented in Redis Lua scripting: several security vulnerabilities fixed because you’ll note that the Alibaba team also found some other issues. Antirez also ensured that the Redis cloud providers (notably: Redis Labs, Amazon, Alibaba, Microsoft, Google, Heroku, Open Redis and Redis Green) got notified first (and in the comments, compose.io was missing, but now added to the list). I do not know if Linux distributions were also informed, but they will probably be rolling out updates soon.

In the “be careful where you get your software” department: some criminals have figured out they could host some crypto-currency mining software that you would get pre-installed if you used their Docker containers. They’ve apparently made over $90,000. It is good to note that the Backdoored images downloaded 5 million times finally removed from Docker Hub. This, however, was up on the Docker Hub for ten months and they managed to get over 5 million downloads across 17 images. Know what images you are pulling. Maybe this is again more reason for software providers to run their own registries?

James Turnbull is out with a new book: Monitoring with Prometheus. It just got released, I’ve grabbed it, but a review will come shortly. He’s managed all this while pulling off what seems to be yet another great O’Reilly Velocity San Jose Conference.


A quiet week on this front.

Link List

  • INPLACE upgrade from MySQL 5.7 to MySQL 8.0
  • PostgreSQL relevant: What’s is the difference between streaming replication vs hot standby vs warm standby ?
  • A new paper on Amazon Aurora is out: Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes. It was presented at SIGMOD 2018, and an abstract: “One of the more novel differences between Aurora and other relational databases is how it pushes redo processing to a multi-tenant scale-out storage service, purpose-built for Aurora. Doing so reduces networking traffic, avoids checkpoints and crash recovery, enables failovers to replicas without loss of data, and enables fault-tolerant storage that heals without database involvement. Traditional implementations that leverage distributed storage would use distributed consensus algorithms for commits, reads, replication, and membership changes and amplify cost of underlying storage.” Aurora, as you know, avoids distributed consensus under most circumstances. Short 8-page read.
  • Dormando is blogging again, and this was of particular interest — Caching beyond RAM: the case for NVMe. This is done in the context of memcached, which I am certain many use.
  • It is particularly heartening to note that not only does MongoDB use Linkbench for some of their performance testing, they’re also contributing to making it better via a pull request.

Industry Updates

Trying something new here… To cover fundraising, and people on the move in the database industry.

  • Kenny Gorman — who has been on the program committee for several Percona Live conferences, and spoken at the event multiple times before — is the founder and CEO of Eventador, a stream-processing as a service company built on Apache Kafka and Apache Flink, has just raised $3.8 million in funding to fuel their growth. They are also naturally spending this on hiring. The full press release.
  • Jimmy Guerrero (formerly of MySQL and InfluxDB) is now VP Marketing & Community at YugaByte DB. YugaByte was covered in column 13 as having raised $8 million in November 2017.

Upcoming appearances

  • DataOps Barcelona – Barcelona, Spain – June 21-22, 2018 – code dataopsbcn50 gets you a discount
  • OSCON – Portland, Oregon, USA – July 16-19, 2018
  • Percona webinar on Maria Server 10.3 – June 26, 2018


I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

The post This Week in Data with Colin Charles 42: Security Focus on Redis and Docker a Timely Reminder to Stay Alert appeared first on Percona Database Performance Blog.


When it’s faster to use SQL in MySQL NDB Cluster over memcache API

Memcache access for MySQL Cluster (or NDBCluster) provides faster access to the data because it avoids the SQL parsing overhead for simple lookups – which is a great feature. But what happens if I try to get multiple records via memcache API (multi-GET) and via SQL (SELECT with IN())? I’ve encountered this a few times now, so I decided to blog about it. I did a very simple benchmark with the following script:

mysql_cmd="mysql -h${mysql_server} --silent --silent"
function populate_data () {
  $mysql_cmd -e "delete from ${mysql_table};" $mysql_schema > /dev/null 2>&1
  for rec in `seq 1 $nrec`
    $mysql_cmd -e "insert into ${mysql_table} values ($rec, repeat('a',10), 0, 0);" $mysql_schema > /dev/null 2>&1
function mget_via_sql() {
  for rec in `seq 1 $nrec`
    if [ $rec -lt $nrec ]
  start_time=`date +%s%N`
  $mysql_cmd -e "select id,value from ${mysql_table} where id in (${in_list});" ${mysql_schema} > /dev/null 2>&1
  stop_time=`date +%s%N`
  time_ms=`echo "scale=3; $[ $stop_time - $start_time ] /1000 /1000" | bc -l`
  echo -n "${time_ms} "
function mget_via_mc() {
  for rec in `seq 1 $nrec`
    get_str="${get_str} ${mc_prefix}${rec}"
  start_time=`date +%s%N`
  echo "get ${get_str}" | nc $mc_server $mc_port > /dev/null 2>&1
  stop_time=`date +%s%N`
  time_ms=`echo "scale=3; $[ $stop_time - $start_time ] /1000 /1000" | bc -l`
  echo -n "${time_ms} "
function print_header() {
  echo "records mget_via_sql mget_via_mc"
populate_data $records
sleep 10
for records in `seq 1 50`
  echo -n "$records "
  mget_via_sql $records
  mget_via_mc $records

The test table looked like the following.

mysql> show create table percona.memcache_t\G
*************************** 1. row ***************************
       Table: memcache_t
Create Table: CREATE TABLE `memcache_t` (
  `id` int(11) NOT NULL DEFAULT '0',
  `value` varchar(20) DEFAULT NULL,
  `flags` int(11) DEFAULT NULL,
  `cas_value` int(11) DEFAULT NULL,
) ENGINE=ndbcluster DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

The definitions for memcache access in the ndbmemcache schema were the following.

mysql> select * from key_prefixes where key_prefix='mt:';
| server_role_id | key_prefix | cluster_id | policy   | container  |
|              0 | mt:        |          0 | ndb-only | memcache_t |
1 row in set (0.00 sec)
mysql> select * from containers where name='memcache_t';
| name       | db_schema | db_table   | key_columns | value_columns | flags | increment_column | cas_column | expire_time_column | large_values_table |
| memcache_t | percona   | memcache_t | id          | value         | flags | NULL             | cas_value  | NULL               | NULL               |
1 row in set (0.00 sec)
mysql> select * from memcache_server_roles where role_id=1;
| role_name | role_id | max_tps | update_timestamp    |
| db-only   |       1 | 1000000 | 2013-04-07 21:59:02 |
1 row in set (0.00 sec)

I had the following results – the variance is there because I did this benchmark on a cluster running in virtualbox on my workstation, but the trend shows clearly.

ndb memcache multi-get benchmark

The surprising result is that if we fetch 1 or a few records, the memcached protocol access is indeed faster. But the more records we fetch, the speed of the SQL won’t change too much, while the time required to perform the memcache multi-get is proportional with the number of record fetched. This result actually makes sense if we dig deeper. The memcache access can’t use batching, because of the way multi-get is implemented in memcached itself. On the server side, there is simply no multi-get command. The get commands are done in a loop, one by one. With a regular memcache server, one multi-get command will need one network roundtrip between the client and the server. In NDB’s case, for each key access, a roundtrip still has to be made to the storage node, and this overhead is not present in the SQL node’s case (the api and the storage nodes were running on different virtual macines). If we are using the memcache API nodes with caching, the situations gets somewhat better if the key we are looking for is in memcached’s memory (the network roundtrip can be skipped in this case).

Does this mean that memcache API is bad and unusable? I don’t think so. Most workloads, which are in need of the memcache protocol access, will most likely use it for getting one record at a time. It shines there compared to SQL (response time is less than half). This example shows that for the “Which is faster?” question, the correct answer is still, “It depends on the workload.” For most cases, anyhow.

The post When it’s faster to use SQL in MySQL NDB Cluster over memcache API appeared first on MySQL Performance Blog.


Caching could be the last thing you want to do

I recently had a run-in with a very popular PHP ecommerce package which makes me want to voice a recurring mistake I see in how many web applications are architected.

What is that mistake?

The ecommerce package I was working with depended on caching.  Out of the box it couldn’t serve 10 pages/second unless I enabled some features which were designed to be “optional” (but clearly they weren’t).

I think with great tools like memcached it is easy to get carried away and use it as the mallet for every performance problem, but in many cases it should not be your first choice.  Here is why:

  • Caching might not work for all visitors – You look at a page, it loads fast.  But is this the same for every user?  Caching can sometimes be an optimization that makes the average user have a faster experience, but in reality you should be caring more that all users get a good experience (Peter explains why here, talking about six sigma).  In practice it can often be the same user that has all the cache misses, which can make this problem even worse.
  • Caching can reduce visibility – You look at the performance profile of what takes the most time for a page to load and start trying to apply optimization.  The problem is that the profile you are looking at may skew what you should really be optimizing.  The real need (thinking six sigma again) is to know what the miss path costs, but it is somewhat hidden.
  • Cache management is really hard – have you planned for cache stampeding, or many cache items being invalidated at the same time?

What alternative approach should be taken?

Caching should be seen more as a burden that many applications just can’t live without.  You don’t want that burden until you have exhausted all other easily reachable optimizations.

What other optimizations are possible?

Before implementing caching, here is a non-exhaustive checklist to run through:

  • Do you understand every execution plan of every query? If you don’t, set long_query_time=0 and use mk-query-digest to capture queries.  Run them through MySQL’s EXPLAIN command.
  • Do your queries SELECT *, only to use subset of columns?  Or do you extract many rows, only to use a subset? If so, you are extracting too much data, and (potentially) limiting further optimizations like covering indexes.
  • Do you have information about how many queries were required to generate each page? Or more specifically do you know that each one of those queries is required, and that none of those queries could potentially be eliminated or merged?

I believe this post can be summed up as “Optimization rarely decreases complexity. Avoid adding complexity by only optimizing what is necessary to meet your goals.”  – a quote from Justin’s slides on instrumentation-for-php.  In terms of future-proofing design, many applications are better off keeping it simple and (at least initially) refusing the temptation to try and solve some problems “like the big guys do”.

Entry posted by Morgan Tocker |

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com