Mar
08
2018
--

Binlog Encryption with Percona Server for MySQL

binlog encryption

In this blog post, we’ll look at how to turn on binlog encryption in Percona Server for MySQL.

Why do I need this?

As you probably know, Percona Server for MySQL’s binlog contains sensitive information. Replication uses the binlog to copy events between servers. They contain all the information from one server so that it can be applied on another. In other words, if somebody has access to a binlog, it means they have access to all the data in the server. Moreover, said person (or, “Hacker”) could create a clone copy of our server by just making a replica of it. In the end, they have access to our binlog. This shows how important protecting a binlog really is – leakage of binlogs not only make a particular table/tablespace or a group of tables accessible to a hacker, but literally the whole server is at risk. The same situation is true with relay log – a relay log is really a copy of binlog on the slave server.

But have no fear – a new feature to the rescue – binary log encryption. Since Percona Server for MySQL version 5.7.20-19 (beta version) it is possible to enable binlog encryption for all the binlogs and relay logs produced by the server.

How do you turn it on?

To start binlog encryption, you need to start the server with –encrypt-binlog=1. This, in turn, requires –master_verify_checksum and –binlog_checksum both to be ON. Also, you need to install one of the keyring plugins.

From now on all the binlogs and relay logs produced by the server get encrypted. However, for the replication to be safe as a whole the connection between servers also has to be encrypted. See https://dev.mysql.com/doc/refman/5.7/en/replication-solutions-encrypted-connections.html for details on how to do this.

Please note that this does not mean that all binlogs in our replication schema get encrypted. Remember you need to turn on encrypt-binlog on slave servers too, even if they do not produce binlog files. Slave servers still produce relay logs when replicating from a master server. Turn on encrypt-binlog on slave servers so that their relay logs also get encrypted.

How does this work in the big picture?

The master encrypts the event before writing it into the binlog. The slave connects to master and ask for events. The master decrypts the events from the binary log and sends them over to slave.

Note that events send between the master and slave servers are not encrypted! This is why the connection between the master and slave needs to use a secure channel, i.e., TLS.

The slave receives events from the master, encrypts them and writes them down into the relay log.

That is why we need to enable encrypt-binlog on a slave. The relay log has to get encrypted too.

Next, the slave decrypts events from relay log and applies them. After applying the event the slave encrypts it and writes it down into its binlog file (given binlog is enabled on the slave).

In summary to make our replication secure, we need:

  • Turn on encrypt-binlog on the master
  • Turn on encrypt-binlog on the slave
  • The connection between master and slave needs to use TLS.

It’s worth noting that servers in replication have no idea if other servers are encrypted or not.

Why do master_verify_checksum and binlog_checksum need to be turned ON?

This is needed for “authenticate encryption”. Simply put, this is how we make sure that what we decrypt has not been changed by a third party. Also, it checks if the key that was used to decrypt the event was the correct one.

Digging deeper with mysqlbinlog

Mysqlbinlog is a standalone application that lets you read binlog files. As I write this blog post, it is not capable of decrypting binary logs – at least not by itself. However, it still can read encrypted binlog files when using a running Percona Server for MySQL. Use option –read-from-remote-server to read binary log produced by a given server.

Let’s see what happens when we try to read an encrypted binlog with mysqlbinlog without read-from-remote-server enabled. You will get something like this:

As you can see it is only possible to read binary log till event type 9f gets read. This event is the Start_encryption_event. After this event the rest of the binlog is encrypted. One thing to note is that Start_encryption_event is never propagated in replication. For instance, the master server is run with –encryt_binlog. This means that the server writes Start_encryption_event to its binary logs. However it is never sent to the slave server (the slave has no idea whether the master is encrypted).

Another option you can use with mysqlbinlog is –force option. It forces mysqlbinlog to read all the events from the binlog, even if they are encrypted. You will see something like this in the output:

As you can see, it is only possible to read two first events – until the Start_encryption_event. However, this time we can see that there are other events that follow, which are encrypted.

Running mysqlbinlog (without –read-from-remote) on encrypted binary logs may only make sense if we want to see if a given binary log is encrypted. For point-in-time recovery, and for other purposes that might require reading encrypted binlog, we would use mysqlbinlog with –read-from-remote option.

For instance, if we want to read binlog master-bin.000001, and Percona Server for MySQL is running on 127.0.0.1, port 3033, with user:robert, password:hard_password, we would use mysqlbinlog like this:

mysqlbinlog –read-from-remote-server –protocol=tcp –host=127.0.0.1 –port=3033 –user=robert –password=hard_password master-bing.000001.

When you look at the output of this command, you see something like this:

You can now see the decrypted binlog. One interesting thing to note here is that we do not see our Start_encryption_event (type 9f). This proves my point – Start_encryption_event never leaves the server (we are reading from the server now as we use –read-from-remote-server).

For more information how to use mysqlbinlog for point-in-time recovery see https://dev.mysql.com/doc/refman/5.7/en/point-in-time-recovery.html.

However, for more modern approaches for point-in-time recovery that do not use mysqlbinlog and make use of parallel appliers, see here:

Have fun with binlog encryption!

Mar
07
2018
--

Percona Live 2018 Featured Talk: Securing Your Data on PostgreSQL with Payal Singh

Payal PostgreSQL 1

Percona Live 2018 Featured TalkWelcome to another interview blog for the rapidly-approaching Percona Live 2018. Each post in this series highlights a Percona Live 2018 featured talk at the conference and gives a short preview of what attendees can expect to learn from the presenter.

This blog post highlights Payal Singh, DBA at OmniTI Computer Consulting Inc. Her talk is titled Securing Your Data on PostgreSQL. There is often a lack of understanding about how best to manage minimum basic application security features – especially with major security features being released with every major version of PostgreSQL. In our conversation, we discussed how Payal works to improve application security using Postgres:

Percona: Who are you, and how did you get into databases? What was your path to your current responsibilities?

Payal: I’m primarily a data addict. I fell in love with databases when it was first taught to me in high school. The declarative SQL syntax was intuitive to me, and efficient compared to other languages I had used (C and C++). I realized that if given the opportunity, I’d choose to become a database administrator. I joined OmniTI in summer of 2012 as a web engineer intern during my Masters, but grabbed the chance to work on an internal database migration project. Working with the DBA team gave me a lot of new insight and exposure, especially into open source databases. The more I learned, the more I loved my job. Right after completing my Masters I joined OmniTI as a full-time database administrator, and never looked back!

Percona: Your talk is titled ” Securing Your Data on PostgreSQL”. Why do you think that security (or the lack of it) is such an issue?

Payal: Securing your data is critical. In my experience, the one reason people using commercial databases are apprehensive of switching to open source alternatives is a lack of exposure to security features. If you look at open source databases today, specifically PostgreSQL, it has the most advanced security features: data encryptionauditingrow-level security to name a few. People don’t know about them, though. As a FOSS project, we don’t have a centralized marketing team to advertise these features to our potential user base, which makes it necessary to spread information through other channels. Speaking about it at a popular conference like Percona Live is one of them!

In addition to public awareness, Postgres is advancing at a lightning pace. With each new major version released every year, a bunch of new security feature additions and major improvements in existing security features are added. So much so that it becomes challenging to keep up with all these features, even for existing Postgres users. My talk on Postgres security aims to inform current as well as prospective Postgres users about the advanced security features that exist and their use case, useful tips to use them, the gotchas, what’s lacking and what’s currently under development.

Percona: Is PostgreSQL better or worse with security and security options than either MySQL or MongoDB? Why?

Payal PostgreSQL 1Payal: I may be a little biased, but I think Postgres is the best database from a security point of view. MySQL is pretty close though! There are quite a few reasons why I consider Postgres to be the best, but I’d like to save that discussion for my talk at Percona Live! For starters though, I think that Postgres’s authentication and role architecture significantly clearer and more straightforward than MySQL’s implementation. Focusing strictly on security, I’d also say that access control and management is more granular and customizable in Postgres than it is in MySQL – although here I’d have to say MySQL’s ACL is easier and more intuitive to manage.

Percona: What is the biggest challenge for database security we are facing?

Payal: For all the databases? I’d say with the rapid growth of IoT, encrypted data processing is a huge requirement that none of the well-known databases currently provide. Even encryption of data at rest outside of the IoT context requires more attention. It is one of the few things that a DBMS can do as a last-ditch effort to protect its data in SQL injection attacks, if all other layers of security (network, application layer, etc.) have failed (which very often is the case).

Percona: Why should people attend your talk? What do you hope people will take away from it? 

Payal: My talk is a run-through of all current and future Postgres security features, from the basic to the very advanced and niche. It is not an isolated talk that assumes Postgres is the only database in the world. I often compare and contrast other database implementations of similar security features as well. Not only is it a decent one-hour primer for people new and interested in Postgres, but also a good way to weigh the pros and cons among databases from a security viewpoint.

Percona: What are you looking forward to at Percona Live (besides your talk)?

Payal: I’m looking forward to all the great talks! I got a lot of information out of the talks at Percona Live last year. The tutorials on new MySQL features were especially great!

Want to find out more about this Percona Live 2018 featured talk, and Payal and PostgreSQL security? Register for Percona Live 2018, and see her talk Securing Your Data on PostgreSQL. Register now to get the best price! Use the discount code SeeMeSpeakPL18 for 10% off.

Percona Live Open Source Database Conference 2018 is the premier open source event for the data performance ecosystem. It is the place to be for the open source community. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Open Source Database Conference will be April 23-25, 2018 at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.

Feb
21
2018
--

Percona Live 2018 Open Source Database Conference Full Schedule Now Available

Percona Live 2018 Featured Talk

Percona Live 2018The conference session schedule for the seventh annual Percona Live 2018 Open Source Database Conference, taking place April 23-25 at the Santa Clara Convention Center in Santa Clara, CA is now live and available for review! Advance Registration Discounts can be purchased through March 4, 2018, 11:30 p.m. PST.

Percona Live Open Source Database Conference 2018 is the premier open source database event. With a theme of “Championing Open Source Databases,” the conference will feature multiple tracks, including MySQL, MongoDB, Cloud, PostgreSQL, Containers and Automation, Monitoring and Ops, and Database Security. Once again, Percona will be offering a low-cost database 101 track for beginning users who want to start learning how to use and operate open source databases.

Major areas of focus at the conference include:

  • Database operations and automation at scale, featuring speakers from Facebook, Slack, Github and more
  • Databases in the cloud – how database-as-a-service (DBaaS) is changing the DB Landscape, featuring speakers from AWS, Microsoft, Alibaba and more
  • Security and compliance – how GDPR and other government regulations are changing the way we manage databases, featuring speakers from Fastly, Facebook, Pythian, Percona and more
  • Bridging the gap between developers and DBAs – finding common ground, featuring speakers from Square, Oracle, Percona and more

Conference Session Schedule

Conference sessions take place April 24-25 and will feature 90+ in-depth talks by industry experts related to each of the key areas. Several sessions from Oracle and Percona will focus on how the new features and enhancements in the upcoming release of MySQL 8.0 will impact businesses. Conference session examples include:

Sponsorships

Sponsorship opportunities for Percona Live Open Source Database Conference 2018 are available and offer the opportunity to interact with the DBAs, sysadmins, developers, CTOs, CEOs, business managers, technology evangelists, solution vendors and entrepreneurs who typically attend the event. Contact live@percona.com for sponsorship details.

  • Diamond Sponsors – Continuent, VividCortex
  • Platinum – Microsoft
  • Gold Sponsors – Facebook, Grafana
  • Bronze Sponsors – Altinity, BlazingDB, SolarWinds, Timescale, TwinDB, Yelp
  • Other Sponsors – cPanel
  • Media Sponsors – Database Trends & Applications, Datanami, EnterpriseTech, HPCWire, ODBMS.org, Packt

Hyatt Regency Santa Clara & The Santa Clara Convention Center

Percona Live 2018 Open Source Database Conference is held at the Hyatt Regency Santa Clara & The Santa Clara Convention Center, at 5101 Great America Parkway Santa Clara, CA 95054.

The Hyatt Regency Santa Clara & The Santa Clara Convention Center is a prime location in the heart of the Silicon Valley. Enjoy this spacious venue with complimentary wifi, on-site expert staff and three great restaurants. You can reserve a room by booking through the Hyatt’s dedicated Percona Live reservation site.

Book your hotel using Percona’s special room block rate!

Feb
14
2018
--

Amazon Aurora MySQL Monitoring with Percona Monitoring and Management (PMM)

Amazon Aurora MySQL Monitoring small

In this blog post, we’ll review additional Amazon Aurora MySQL monitoring capabilities we’ve added in Percona Monitoring and Management (PMM) 1.7.0. You can see them in action in the MySQL Amazon Aurora Metrics dashboard.

Amazon Aurora MySQL Transaction CommitsAmazon Aurora MySQL Monitoring

This graph looks at the number of commits the Amazon Aurora engine performed, as well as the average commit latency. As you can see from this graph, latency does not always correlate with the number of commits performed and can be quite high in certain situations.

Amazon Aurora MySQL LoadAmazon Aurora MySQL Monitoring 2

In Percona Monitoring and Management, we often use the concept of “Load” – which roughly corresponds to the number of operations of a type in progress. This graph shows us what statements contribute the most load on the system, as well as what load corresponds to the Amazon Aurora transaction commits (which we observed in the graph before).

Amazon Aurora MySQL Memory Usage

Amazon Aurora MySQL Monitoring 3

This graph is pretty self-explanatory. It shows how much memory is used by the Amazon Aurora lock manager, as well as the amount of memory used by Amazon Aurora to store Data Dictionary.

Amazon Aurora MySQL Statement Latency

Amazon Aurora MySQL Monitoring 4

This graph shows the average latency for the most important types of statements. Latency spikes, as shown in this example, are often indicative of the instance overload.

Amazon Aurora MySQL Special Command Counters

Amazon Aurora MySQL Monitoring 5

Amazon Aurora MySQL allows a number of commands that are not available in standard MySQL. This graph shows the usage of such commands. Regular “unit_test” calls can be seen in the default Amazon Aurora install, and the rest depends on your workload.

Amazon Aurora MySQL Problems

Amazon Aurora MySQL Monitoring 6

This graph is where you want to see a flat line. It shows different kinds of internal Amazon Aurora MySQL problems, which in normal operation should generally be zero.

I hope you find these Amazon Aurora MySQL monitoring improvements useful. Let us know if there is any other Amazon Aurora information that would be helpful to display!

Feb
08
2018
--

Tutorial Schedule for Percona Live 2018 Is Live

Percona Live 2018

Percona Live 2018Percona has revealed the line-up of in-depth tutorials for the Percona Live 2018 Open Source Database Conference, taking place April 23-25, 2018 at the Santa Clara Convention Center in Santa Clara, Calif. Secure your spot now with Advanced Registration prices (available until March 4, 2018). Sponsorship opportunities for the conference are still available.

Percona Live 2018 Open Source Database Conference is the premier open source database event. The theme for the upcoming conference is “Championing Open Source Databases,” with a range of topics on MySQL, MongoDB and other open source databases, including time series databases, PostgreSQL and RocksDB. Session tracks include Developers, Operations and Business/Case Studies. 

Tutorials take place throughout the day on April 23, 2018. Tutorials provide practical, in-depth knowledge of critical open source database issues. Topics include:

Hyatt Regency Santa Clara & The Santa Clara Convention Center

Percona Live 2018 Open Source Database Conference will be held at the Hyatt Regency Santa Clara & The Santa Clara Convention Center, at 5101 Great America Parkway Santa Clara, CA 95054.

The Hyatt Regency Santa Clara & The Santa Clara Convention Center is a prime location in the heart of the Silicon Valley. Enjoy this spacious venue with complimentary wifi, on-site expert staff and three great restaurants. You can reserve a room by booking through the Hyatt’s dedicated Percona Live reservation site.

Book your hotel using Percona’s special room block rate!

Sponsorships

Sponsorship opportunities for Percona Live 2018 Open Source Database Conference are available and offer the opportunity to interact with the DBAs, sysadmins, developers, CTOs, CEOs, business managers, technology evangelists, solution vendors, and entrepreneurs who typically attend the event. Contact live@percona.com for sponsorship details.

  • Diamond Sponsors – Continuent, VividCortex
  • Gold Sponsors – Facebook, Grafana
  • Bronze Sponsors – SolarWinds, TwinDB, Yelp
  • Media Sponsors – Datanami, EnterpriseTech, HPCWire, ODBMS.org
Feb
07
2018
--

ProxySQL Query Cache: What It Is, How It Works

ProxySQL query cache

In this blog post, I’ll present the ProxySQL query cache functionality. This is a query caching mechanism on top of ProxySQL. As there are already many how-tos regarding the ProxySQL prerequisites and installation process, we are going to skip these steps. For those who are already familiar with ProxySQL query cache configuration, let’s go directly to the query rules and the performance results.

Before talking about the ProxySQL query cache, let’s take a look at other caching mechanisms available for MySQL environments.

MySQL query cache is a query caching mechanism – deprecated as of MySQL 5.7.20 and removed in MySQL 8.0 – on top of MySQL itself (based on the official MySQL documentation).

The MySQL query cache stores the text of a SELECT statement together with the corresponding result sent to the client. If an identical statement is received later, the server retrieves the results from the query cache rather than parsing and executing the statement again. The query cache is shared among sessions, so a result set generated by one client can be sent in response to the same query issued by another client.

Although MySQL query cache is supposed to improve performance, there are cases where MySQL query cache is not scaling well and can degrade performance due to its locking and invalidation algorithms.

You can find a really interesting post regarding MySQL query cache here.

There is also another method to cache results in a MySQL environment. It’s external caching (i.e., Memcached, Redis, etc.), but this also has some drawbacks. Introducing such a mechanism requires some changes on the application side.

But what is ProxySQL query cache if there are already the MySQL query cache and other external caching mechanisms? At the moment, although we’ve done some tests, we are not going to compare ProxySQL query cache performance against other caching mechanisms. We’ll address this in a future blog post. We will only focus on ProxySQL itself.

What is ProxySQL Query Cache

ProxySQL query cache is an in-memory key-value storage that uses:

  • as key: a combination of username, schema and query text. It is a hash derived from username, schema name and the query itself. Combining these ensures that users access only their resultsets and for the correct schema.
  • as value: the resultset returned by the backend (mysqld or another proxy).

There is some more metadata stored for each resultset:

  • length: length of the resultset
  • expire_ms: defines when the entry will expire
  • access_ms: records the last time an entry was accessed
  • ref_count: a reference count to identify resultset currently in use

Based on the configuration, the resultsets are cached on the wire while queries are executed, and the resultset is returned to the application. If the application re-executes the same query within the time slot defined by “expire_ms”, the resultset is returned by the embedded ProxySQL query cache.

The only way to invalidate entries from the ProxySQL query cache is through a time-to-live in milliseconds. This is in contrast to MySQL query cache, where the query cache gets invalidated each time a table gets updated. At the moment, it is only possible to tune the total amount of memory used by the query cache, using the variable “mysql-query_cache_size_MB”. The current implementation of mysql-query_cache_size_MB doesn’t impose a hard limit. Instead, it is used as an argument by the purging thread.

It’s obvious that it’s not easy to directly compare these two cache mechanisms, as each of them has its own way to invalidate results. Please also note a significant difference between MySQL and ProxySQL when query cache is enabled. ProxySQL query cache may serve stale resultsets due to the way it invalidates cached data (cached data are valid for the interval specified by “cache_ttl”, while MySQL’s cached data get invalidated each time data change). Every query that is cached may return stale data, and this may or may not be acceptable by the application.

How it Works

Before doing any benchmarks, I will try to give you a short description of how ProxySQL query cache gets enabled. Unlike MySQL query cache, where a common caching space exists for all tables, in ProxySQL query cache we have to define what traffic gets cached. This is done by defining query rules that match traffic that is going to be cached and setting a “cache_ttl” for the cached results. There are many ways to define matches for incoming traffic, either by query or digest using patterns. All we need to cache the resultset is to define the matching criteria and the TTL. If a query passed the matching criteria, the resultset is cached so the next requests are served directly from the ProxySQL instance instead of querying the hostgroup DB nodes (if cache_ttl has not expired).

Let’s use an example to make it clear how ProxySQL query cache is enabled.

In our setup we have three backend DB servers in a master-slave topology, with Percona Server for MySQL 5.7 and a ProxySQL ver. 1.4.3 instance with sysbench 1.0 installed. Backend servers are within the same reader hostgroup, and traffic is balanced among these servers using the same priority.

As I’ve already said, we won’t look at the ProxySQL installation. There are many topologies you can implement: deploying ProxySQL on each application server thus removing the “single point of failure” weakness, for example. But in our case, we will just present the ProxySQL query cache having a single instance. In general, you would expect to have better performance with the ProxySQL instance closer to the application.

Configuration

With the ProxySQL instance up and running, let’s confirm that all servers are OK. Querying ProxySQL admin shows that all servers are ONLINE:

Admin> select * from mysql_servers;
+--------------+-----------+------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname  | port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+-----------+------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 2        	| 10.0.2.12 | 3306 | ONLINE | 1  	| 0       	| 1000        	| 0               	| 0   	| 0          	|     	|
| 1        	| 10.0.2.11 | 3306 | ONLINE | 1  	| 0       	| 1000        	| 0               	| 0   	| 0          	|     	|
| 2        	| 10.0.2.13 | 3306 | ONLINE | 1  	| 0       	| 1000            | 0               	| 0   	| 0          	|     	|
| 2        	| 10.0.2.11 | 3306 | ONLINE | 1  	| 0       	| 1000        	| 0               	| 0   	| 0          	|     	|
+--------------+-----------+------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+

As you can see, there are two hostgroups: the “1” used for WRITES and the “2” used for READS.

Some random connects proves that traffic is correctly routed to the DB backends:

[RDBA] percona@atsaloux-proxysql: ~ $ mysql -h 127.0.0.1 -P 6033 -e "select @@hostname"
+--------------+
| @@hostname   |
+--------------+
| db1-atsaloux |
+--------------+
[RDBA] percona@atsaloux-proxysql: ~ $ mysql -h 127.0.0.1 -P 6033 -e "select @@hostname"
+--------------+
| @@hostname   |
+--------------+
| db2-atsaloux |
+--------------+
[RDBA] percona@atsaloux-proxysql: ~ $ mysql -h 127.0.0.1 -P 6033 -e "select @@hostname"
+--------------+
| @@hostname   |
+--------------+
| db1-atsaloux |
+--------------+
[RDBA] percona@atsaloux-proxysql: ~ $ mysql -h 127.0.0.1 -P 6033 -e "select @@hostname"
+--------------+
| @@hostname   |
+--------------+
| db3-atsaloux |
+--------------+

Let’s first take a look at some statistics. Before using sysbench, the “stats_mysql_query_digest” table (where digests are stored) is empty:

Admin> SELECT count_star,sum_time,hostgroup,digest,digest_text FROM stats_mysql_query_digest_reset ORDER BY sum_time DESC;
Empty set (0.00 sec)

The “stats_mysql_query_digest: table contains statistics related to the queries routed through the ProxySQL server. How many times each query was executed and the total execution time are two of the several provided statistics.

Before doing any benchmarks, I had to create some data. The following sysbench commands were used for selects by PK or by RANGE. For simplicity, we are not going to execute benchmarks inside transactions — although ProxySQL query cache is effective.

--threads

 will be adjusted for each benchmark:

sysbench --threads=16 --max-requests=0 --time=60 --mysql-user=percona --mysql-password=percona --mysql-db=sbtest --mysql-host=127.0.0.1 --mysql-port=6033 --oltp-table-size=1000000 --oltp-read-only=on --oltp-skip-trx=on --oltp-test-mode=simple --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-point-selects=1 --oltp-simple-ranges=0 /usr/share/sysbench/tests/include/oltp_legacy/oltp.lua run
sysbench --threads=16 --max-requests=0 --time=60 --mysql-user=percona --mysql-password=percona --mysql-db=sbtest --mysql-host=127.0.0.1 --mysql-port=6033 --oltp-table-size=1000000 --oltp-read-only=on --oltp-skip-trx=on --oltp-test-mode=simple --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-point-selects=0 --oltp-simple-ranges=1 /usr/share/sysbench/tests/include/oltp_legacy/oltp.lua run

Before running the full benchmark, a simple sysbench was run to get the queries digests that are used for the ProxySQL query cache configuration.

After running the first benchmark with ProxySQL query cache disabled, I queried the “stats_mysql_query_digest” table again and got the following results where it logs all executed queries.

Admin> SELECT count_star,sum_time,hostgroup,digest,digest_text FROM stats_mysql_query_digest_reset ORDER BY sum_time DESC;
+------------+-------------+-----------+--------------------+------------------------------------------------+
| count_star | sum_time    | hostgroup | digest             | digest_text                                    |
+------------+-------------+-----------+--------------------+------------------------------------------------+
| 301536     | 20962929791 | 2         | 0xBF001A0C13781C1D | SELECT c FROM sbtest1 WHERE id=?               |
| 3269       | 30200073    | 2         | 0x290B92FD743826DA | SELECT c FROM sbtest1 WHERE id BETWEEN ? AND ? |
+------------+-------------+-----------+--------------------+------------------------------------------------+
2 row in set (0.01 sec)

Add mysql_query_rules To Be Cached

Output above provides all the needed information in order to enable ProxySQL query cache. What we need to do now add the query rules that match the results that should be cached. In this case we use a matching pattern criteria and a cache_ttl of 5000ms. Taking this into consideration, we added the following rules:

Admin> INSERT INTO mysql_query_rules (rule_id,active,digest,cache_ttl,apply) VALUES (1,1,'0xBF001A0C13781C1D',5000,1);
Query OK, 1 row affected (0.00 sec)
Admin> INSERT INTO mysql_query_rules (rule_id,active,digest,cache_ttl,apply) VALUES (2,1,'0x290B92FD743826DA',5000,1);
Query OK, 1 row affected (0.00 sec)

We shouldn’t forget that we must load query rules at runtime. If we don’t want to lose these rules (i.e., after a ProxySQL restart), we should also save to disk:

Admin> LOAD MYSQL QUERY RULES TO RUNTIME; SAVE MYSQL QUERY RULES TO DISK;
Query OK, 0 rows affected (0.00 sec)

Now let’s reset the stats_mysql_query_digest results:

Admin> SELECT 1 FROM stats_mysql_query_digest_reset LIMIT 1; -- we reset the counters
+---+
| 1 |
+---+
| 1 |
+---+
1 row in set (0.01 sec)
----------

And re-run the benchmarks with query cache enabled. To confirm what traffic was cached, we have to query the stats_mysql_query_digest once again:

Admin> SELECT count_star,sum_time,hostgroup,digest,digest_text FROM stats_mysql_query_digest ORDER BY sum_time DESC;
+------------+------------+-----------+--------------------+------------------------------------------------+
| count_star | sum_time   | hostgroup | digest             | digest_text                                    |
+------------+------------+-----------+--------------------+------------------------------------------------+
| 108681     | 6304585632 | 2         | 0x290B92FD743826DA | SELECT c FROM sbtest1 WHERE id BETWEEN ? AND ? |
| 343277     | 0          | -1        | 0x290B92FD743826DA | SELECT c FROM sbtest1 WHERE id BETWEEN ? AND ? |
+------------+------------+-----------+--------------------+------------------------------------------------+
2 rows in set (0.00 sec)

Admin> SELECT count_star,sum_time,hostgroup,digest,digest_text FROM stats_mysql_query_digest ORDER BY sum_time DESC;
+------------+-----------+-----------+--------------------+----------------------------------+
| count_star | sum_time  | hostgroup | digest             | digest_text                      |
+------------+-----------+-----------+--------------------+----------------------------------+
| 79629      | 857050510 | 2         | 0xBF001A0C13781C1D | SELECT c FROM sbtest1 WHERE id=? |
| 441194     | 0         | -1        | 0xBF001A0C13781C1D | SELECT c FROM sbtest1 WHERE id=? |
+------------+-----------+-----------+--------------------+----------------------------------+
2 rows in set (0.00 sec)

Cached queries are the ones marked with a special hostgroup -1 (this means that these queries were not sent to any hostgroup), and the total execution time for the queries cached is 0 (this means that the request was served within the same events loop).

Below you can see the benchmark results. Let’s look at what happens for selects by PK and selects by RANGE:

ProxySQL query cache

 

ProxySQL query cache

Points of Interest

  • In all cases, when threads and backend servers are increasing, ProxySQL performs better. This is achieved due to it’s connection pooling and multiplexing capabilities.
  • Enabling ProxySQL query cache provides a significant performance boost.
  • ProxySQL query cache can achieve a ~2X performance boost at a minimum.
  • This boost can be considerably valuable in cases where MySQL performance may fall to 50% (i.e., select by RANGE).
  • We shouldn’t forget that results are affected by hardware specs as well, but it’s obvious that ProxySQL with query cache enabled gives a really high throughput.

ProxySQL Query Cache Limitations

Current known limitations:

  • It is not possible to define query cache invalidation other than with cache_ttl.
  • There is no way to enforce query cache purge.
  • mysql-query_cache_size_MB is not strictly enforced, but only used as a metric to trigger automatic purging of expired entries.
  • Although access_ms is recorded, it is not used as a metric to expire an unused metric when mysql-query_cache_size_MB is achieved.
  • Query cache does not support prepared statements.
  • Query cache may serve stale data.

Conclusion

ProxySQL is generally a very powerful and easy-to-use tool.

With regards to query cache, it seems to scale very well and achieve a significant performance boost. Although having complex configs (not only for query cache) can add some extra overhead, it’s easy to maintain.

cache_ttl can be a limitation, but if it’s correctly configured in conjunction with max_replication_lag when configuring nodes in hostgroups, it does not add any significant drawback. In any case, it depends whether or not this is acceptable by the application.

Feb
06
2018
--

Announcing Experimental Percona Monitoring and Management (PMM) Functionality via Percona Labs

Experimental Percona Monitoring and Management

Experimental Percona Monitoring and ManagementIn this blog post, we’ll introduce how you can look at some experimental Percona Monitoring and Management (PMM) features using Percona Labs builds on GitHub.

Note: PerconaLabs and Percona-QA are open source GitHub repositories for unofficial scripts and tools created by Percona staff. While not covered by Percona support or services agreements, these handy utilities can help you save time and effort.

Percona software builds located in the PerconaLabs and Percona-QA repositories are not officially released software, and also aren’t covered by Percona support or services agreements. 

Percona Monitoring and Management (PMM) is a free and open-source platform for managing and monitoring MySQL® and MongoDB® performance. You can run PMM in your environment for maximum security and reliability. It provides thorough time-based analysis for MySQL and MongoDB servers to ensure that your data works as efficiently as possible.

This month we’re announcing access to Percona Labs builds of Percona Monitoring and Management so that you can experiment with new functionality that’s not yet in our mainline product. You can identify the unique builds at:

https://hub.docker.com/r/perconalab/pmm-server/tags/

Most of the entries here are the pre-release candidate images we use for QA, and they follow a format of all integers (for example “201802061627”). You’re fine to use these images, but they aren’t the ones that have the experimental functionality.

Today we have two builds of note (these DO have the experimental functionality):

  • 1.6.0-prom2.1
  • 1.5.3-prometheus2

We’re highlighting Prometheus 2.1 on top of our January 1.6 release (1.6.0-prom2.1), available in Docker format. Some of the reasons you might want to deploy this experimental build to take advantage of the Prometheus 2 benefits are:

  • Reduced CPU usage by Prometheus, meaning you can add more hosts to your PMM Server
  • Performance improvements, meaning dashboards load faster
  • Reduced disk I/O, disk space usage

Please keep in mind that as this is a Percona Labs build (see our note above), so in addition note the following two criteria:

  • Support is available from our Percona Monitoring and Management Forums
  • Upgrades might not work – don’t count on upgrading out of this version to a newer release (although it’s not guaranteed to block upgrades)

How to Deploy an Experimental Build from Percona Labs

The great news is that you can follow our Deployment Instructions for Docker, and the only change is where you specify a different Docker container to pull. For example, the standard way to deploy the latest stable PMM Server release with Docker is:

docker pull percona/pmm-server:latest

To use the Percona Labs build 1.6.0-prom2.1 with Prometheus 2.1, execute the following:

docker pull perconalab/pmm-server:1.6.0-prom2.1

Please share your feedback on this build on our Percona Monitoring and Management Forums.

If you’re looking to deploy Percona’s officially released PMM Server (not the Percona Labs release, but our mainline version which currently is release 1.7) into a production environment, I encourage you to consider a Percona Support contract, which includes PMM at no additional charge!

Feb
05
2018
--

Four Ways MySQL Executes GROUP BY

MySQL GROUP BY

MySQL GROUP BYIn this blog post, I’ll look into four ways MySQL executes GROUP BY. 

In my previous blog post, we learned that indexes or other means of finding data might not be the most expensive part of query execution. For example, MySQL GROUP BY could potentially be responsible for 90% or more of the query execution time. 

The main complexity when MySQL executes GROUP BY is computing aggregate functions in a GROUP BY statement. How this works is shown in the documentation for UDF Aggregate Functions. As we see, the requirement is that UDF functions get all values that constitute the single group one after another. That way, it can compute the aggregate function value for the single group before moving to another group.

The problem, of course, is that in most cases the source data values aren’t grouped. Values coming from a variety of groups follow one another during processing. As such, we need a special step to handle MySQL GROUP BY.

Let’s look at the same table we looked at before:

mysql> show create table tbl G
*************************** 1. row ***************************
      Table: tbl
Create Table: CREATE TABLE `tbl` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `k` int(11) NOT NULL DEFAULT '0',
 `g` int(10) unsigned NOT NULL,
 PRIMARY KEY (`id`),
 KEY `k` (`k`)
) ENGINE=InnoDB AUTO_INCREMENT=2340933 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

And the same GROUP BY statements executed in different ways:

1: Index Ordered GROUP BY in MySQL

mysql> select k, count(*) c from tbl group by k order by k limit 5;
+---+---+
| k | c |
+---+---+
| 2 | 3 |
| 4 | 1 |
| 5 | 2 |
| 8 | 1 |
| 9 | 1 |
+---+---+
5 rows in set (0.00 sec)
mysql> explain select k, count(*) c from tbl group by k order by k limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: index
possible_keys: k
         key: k
     key_len: 4
         ref: NULL
        rows: 5
    filtered: 100.00
       Extra: Using index
1 row in set, 1 warning (0.00 sec)

In this case, we have an index on the column we use for GROUP BY. This way, we can just scan data group by group and perform GROUP BY on the fly (inexpensively).

It works especially well when we use LIMIT to restrict the number of groups we retrieve or when a “covering index” is in use, as a sequential index-only scan is a very fast operation.

If you have a small number of groups though, and no covering index, index order scans can cause a lot of IO. So this might not be the most optimal plan.

2: External Sort GROUP BY in MySQL

mysql> explain select SQL_BIG_RESULT g, count(*) c from tbl group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 998490
    filtered: 100.00
       Extra: Using filesort
1 row in set, 1 warning (0.00 sec)
mysql> select SQL_BIG_RESULT g, count(*) c from tbl group by g limit 5;
+---+---+
| g | c |
+---+---+
| 0 | 1 |
| 1 | 2 |
| 4 | 1 |
| 5 | 1 |
| 6 | 2 |
+---+---+
5 rows in set (0.88 sec)

If we do not have an index that allows us to scan the data in group order, we can instead get data sorted through an external sort (also referred to as “filesort” in MySQL).

You may notice I’m using an SQL_BIG_RESULT hint here to get this plan. Without it, MySQL won’t choose this plan in this case.

In general, MySQL prefers to use this plan only if we have a large number of groups, because in this case sorting is more efficient than having a temporary table (which we will talk about next).

3: Temporary Table GROUP BY in MySQL

mysql> explain select  g, sum(g) s from tbl group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 998490
    filtered: 100.00
       Extra: Using temporary
1 row in set, 1 warning (0.00 sec)
mysql> select  g, sum(g) s from tbl group by g order by null limit 5;
+---+------+
| g | s    |
+---+------+
| 0 |    0 |
| 1 |    2 |
| 4 |    4 |
| 5 |    5 |
| 6 |   12 |
+---+------+
5 rows in set (7.75 sec)

In this case, MySQL also does a full table scan. But instead of running additional sort passes, it creates a temporary table instead. This temporary table contains one row per group, and with each incoming row the value for the corresponding group is updated. Lots of updates! While this might be reasonable in-memory, it becomes very expensive if the resulting table is so large that updates are going to cause a lot of disk IO. In this case, external sort plans are usually better.

Note that while MySQL selects this plan by default for this use case, if we do not supply any hints it is almost 10x slower than the plan we get using the SQL_BIG_RESULT hint.

You may notice I added “ORDER BY NULL” to this query. This is to show you “clean” the temporary table only plan. Without it, we get this plan:

mysql> explain select  g, sum(g) s from tbl group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 998490
    filtered: 100.00
       Extra: Using temporary; Using filesort
1 row in set, 1 warning (0.00 sec)

In it, we get the “worst of both worlds” with Using Temporary Table and filesort.  

MySQL 5.7 always returns GROUP BY results sorted in group order, even if this the query doesn’t require it (which can then require an expensive additional sort pass). ORDER BY NULL signals the application doesn’t need this.

You should note that in some cases – such as JOIN queries with aggregate functions accessing columns from different tables – using temporary tables for GROUP BY might be the only option.

If you want to force MySQL to use a plan that does temporary tables for GROUP BY, you can use the SQL_SMALL_RESULT  hint.

4:  Index Skip-Scan-Based GROUP BY in MySQL

The previous three GROUP BY execution methods apply to all aggregate functions. Some of them, however, have a fourth method.

mysql> explain select k,max(id) from tbl group by k G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: range
possible_keys: k
         key: k
     key_len: 4
         ref: NULL
        rows: 2
    filtered: 100.00
       Extra: Using index for group-by
1 row in set, 1 warning (0.00 sec)
mysql> select k,max(id) from tbl group by k;
+---+---------+
| k | max(id) |
+---+---------+
| 0 | 2340920 |
| 1 | 2340916 |
| 2 | 2340932 |
| 3 | 2340928 |
| 4 | 2340924 |
+---+---------+
5 rows in set (0.00 sec)

This method applies only to very special aggregate functions: MIN() and MAX(). These do not really need to go through all the rows in the group to compute the value at all.

They can just jump to the minimum or maximum group value in the group directly (if there is such an index).

How can you find MAX(ID) value for each group if the index is only built on (K) column? This is an InnoDB table. Remember InnoDB tables effectively append the PRIMARY KEY to all indexes. (K) becomes (K,ID), allowing us to use Skip-Scan optimization for this query.

This optimization is only enabled if there is a large number of rows per group. Otherwise, MySQL prefers more conventional means to execute this query (like Index Ordered GROUP BY detailed in approach #1).

While we’re on MIN()/MAX() aggregate functions, other optimizations apply to them as well. For example, if you have an aggregate function with no GROUP BY (effectively  having one group for all tables), MySQL fetches those values from indexes during a statistics analyzes phase and avoids reading tables during the execution stage altogether:

mysql> explain select max(k) from tbl G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: NULL
  partitions: NULL
        type: NULL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: NULL
    filtered: NULL
       Extra: Select tables optimized away
1 row in set, 1 warning (0.00 sec)

Filtering and Group By

We have looked at four ways MySQL executes GROUP BY.  For simplicity, I used GROUP BY on the whole table, with no filtering applied. The same concepts apply when you have a WHERE clause:

mysql> explain select  g, sum(g) s from tbl where k>4 group by g order by NULL limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: range
possible_keys: k
         key: k
     key_len: 4
         ref: NULL
        rows: 1
    filtered: 100.00
       Extra: Using index condition; Using temporary
1 row in set, 1 warning (0.00 sec)

For this case, we use the range on the K column for data filtering/lookup and do a GROUP BY when there is a temporary table.

In some cases, the methods do not conflict. In others, however, we have to choose either to use one index for GROUP BY or another index for filtering:

mysql> alter table tbl add key(g);
Query OK, 0 rows affected (4.17 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> explain select  g, sum(g) s from tbl where k>1 group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: index
possible_keys: k,g
         key: g
     key_len: 4
         ref: NULL
        rows: 16
    filtered: 50.00
       Extra: Using where
1 row in set, 1 warning (0.00 sec)
mysql> explain select  g, sum(g) s from tbl where k>4 group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: range
possible_keys: k,g
         key: k
     key_len: 4
         ref: NULL
        rows: 1
    filtered: 100.00
       Extra: Using index condition; Using temporary; Using filesort
1 row in set, 1 warning (0.00 sec)

Depending on specific constants used in this query, we can see that we either use an index ordered scan for GROUP BY (and  “give up”  benefiting from the index to resolve the WHERE clause), or use an index to resolve the WHERE clause (but use a temporary table to resolve GROUP BY).

In my experience, this is where MySQL GROUP BY does not always make the right choice. You might need to use FORCE INDEX to execute queries the way you want them to.

Summary

I hope this article provides a good overview of how MySQL executes GROUP BY.  In my next blog post, we will look into techniques you can use to optimize GROUP BY queries.

Jan
16
2018
--

Updating/Deleting Rows From Clickhouse (Part 2)

ClickHouse

ClickHouseIn this post, we’ll look at updating and deleting rows with ClickHouse. It’s the second of two parts.

In the first part of this post, we described the high-level overview of implementing incremental refresh on a ClickHouse table as an alternative support for UPDATE/DELETE. In this part, we will show you the actual steps and sample code.

Prepare Changelog Table

First, we create the changelog table below. This can be stored on any other MySQL instance separate from the source of our analytics table. When we run the change capture script, it will record the data on this table that we can consume later with the incremental refresh script:

CREATE TABLE `clickhouse_changelog` (
  `db` varchar(255) NOT NULL DEFAULT '',
  `tbl` varchar(255) NOT NULL DEFAULT '',
  `created_at` date NOT NULL,
  `updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `log_file` varchar(255) NOT NULL,
  `log_pos` int(10) unsigned NOT NULL,
  PRIMARY KEY (`db`,`tbl`,`created_at`),
  KEY `log_file` (`log_file`,`log_pos`)
) ENGINE=InnoDB;

Create ClickHouse Table

Next, let’s create the target ClickhHouse table. Remember, that the corresponding MySQL table is below:

CREATE TABLE `hits` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `type` varchar(100) DEFAULT NULL,
  `user_id` int(11) NOT NULL,
  `location_id` int(11) NOT NULL,
  `created_at` datetime DEFAULT NULL
  PRIMARY KEY (`id`),
  KEY `created_at` (`created_at`)
) ENGINE=InnoDB;

Converting this table to ClickHouse looks like below, with the addition of a “created_day” column that serves as the partitioning key:

CREATE TABLE hits (
  id Int32,
  created_day Date,
  type String,
  user_id Int32,
  location_id Int32,
  created_at Int32
) ENGINE = MergeTree PARTITION BY toMonday(created_day)
ORDER BY (created_at, id) SETTINGS index_granularity = 8192;

Run Changelog Capture

Once the tables are ready, running the change capture script. An example script can be found in this gist, which is written in Python and uses the python-mysql-replication library. This library acts as replication client, continuously downloads the binary logs from the source and sifts through it to find any UPDATE/DELETE executed against our source table.

There are a few configuration options that need to be customized in the script.

  • LOG_DB_HOST: The MySQL host where we created the
    clickhouse_changelog

     table.

  • LOG_DB_NAME: The database name where the
    clickhouse_changelog

     table is created.

  • SRC_DB_HOST: The MySQL host where we will be downloading binary logs from. This can either be a primary or secondary/replica as long as its the same server where our raw table is also located.
  • MYSQL_USER: MySQL username.
  • MYSQL_PASS: MySQL password.
  • TABLE: The table we want to watch for changes.

When the script is successfully configured and running, the

clickhouse_changelog

 table should start populating with data like below.

mysql> select * from mydb.clickhouse_changelog;
+------+------+------------+---------------------+------------------+-----------+
| db   | tbl  | created_at | updated_at          | log_file         | log_pos   |
+------+------+------------+---------------------+------------------+-----------+
| mydb | hits | 2014-06-02 | 2017-12-23 17:19:33 | mysql-bin.016353 |  18876747 |
| mydb | hits | 2014-06-09 | 2017-12-23 22:10:29 | mysql-bin.016414 |   1595314 |
| mydb | hits | 2014-06-16 | 2017-12-23 02:59:37 | mysql-bin.016166 |  33999981 |
| mydb | hits | 2014-06-23 | 2017-12-23 18:09:33 | mysql-bin.016363 |  84498477 |
| mydb | hits | 2014-06-30 | 2017-12-23 06:08:59 | mysql-bin.016204 |  23792406 |
| mydb | hits | 2014-08-04 | 2017-12-23 18:09:33 | mysql-bin.016363 |  84499495 |
| mydb | hits | 2014-08-18 | 2017-12-23 18:09:33 | mysql-bin.016363 |  84500523 |
| mydb | hits | 2014-09-01 | 2017-12-23 06:09:19 | mysql-bin.016204 |  27120145 |
+------+------+------------+---------------------+------------------+-----------+

Full Table Import

So we have our changelog capture in place, the next step is to initially populate the ClickHouse table from MySQL. Normally, we can easily do this with a 

mysqldump

 into a tab-separated format, but remember we have to transform the

created_at

 column from MySQL into ClickHouse’s

Date

 format to be used as partitioning key.

A simple way to do this is by using a simple set of shell commands like below:

SQL=$(cat <<EOF
SELECT
	id, DATE_FORMAT(created_at, "%Y-%m-%d"),
	type, user_id, location_id, UNIX_TIMESTAMP(created_at)
FROM hits
EOF
)
mysql -h source_db_host mydb -BNe "$sql" > hist.txt
cat hist.txt | clickhouse-client -d mydb --query="INSERT INTO hits FORMAT TabSeparated"

One thing to note about this process is that the

MySQL

 client buffers the results for the whole query, and it could eat up all the memory on the server you run this from if the table is really large. To avoid this, chunk the table into several million rows at a time. Since we already have the changelog capture running and in place from the previous step, you do not need to worry about any changes between chunks. We will consolidate those changes during the incremental refreshes.

Incremental Refresh

After initially populating the ClickHouse table, we then set up our continuous incremental refresh using a separate script. A template script we use for the table on our example can be found in this gist.

What this script does is twofold:

  • Determines the list of weeks recently modified based on
    clickhouse_changelog

    , dump rows for those weeks and re-imports to ClickHouse.

  • If the current week is not on the list of those with modifications, it also checks for new rows based on the auto-incrementing primary key and appends them to the ClickHouse table.

An example output of this script would be:

ubuntu@mysql~/clickhouse$ bash clickhouse-incr-refresh.sh hits
2017-12-24_00_20_19 incr-refresh Starting changelog processing for hits
2017-12-24_00_20_19 incr-refresh Current week is: 2017-12-18
2017-12-24_00_20_19 incr-refresh Processing week: 2017-12-18
2017-12-24_00_20_20 incr-refresh Changelog import for hits complete
ubuntu@mysql~/clickhouse$ bash clickhouse-incr-refresh.sh hits
2017-12-24_00_20_33 incr-refresh Starting changelog processing for hits
2017-12-24_00_20_33 incr-refresh Current week is: 2017-12-18
2017-12-24_00_20_33 incr-refresh Weeks is empty, nothing to do
2017-12-24_00_20_33 incr-refresh Changelog import for hits complete
2017-12-24_00_20_33 incr-refresh Inserting new records for hits > id: 5213438
2017-12-24_00_20_33 incr-refresh No new rows found
2017-12-24_00_20_33 incr-refresh Incremental import for hits complete
ubuntu@mysql~/clickhouse$

Note that, on step 4, if you imported a really large table and the changelog had accumulated a large number of changes to refresh, the initial incremental execution might take some time. After that though, it should be faster. This script can be run every minute, longer or shorter, depending on how often you want the ClickHouse table to be refreshed.

To wrap up, here is a query from MySQL on the same table, versus ClickHouse.

mysql> SELECT COUNT(DISTINCT user_id) FROM hits WHERE created_at
    -> BETWEEN '2016-01-01 00:00:00' AND '2017-01-01 00:00:00';
+-------------------------+
| COUNT(DISTINCT user_id) |
+-------------------------+
|                 3023028 |
+-------------------------+
1 row in set (25.89 sec)

:) SELECT COUNT(DISTINCT user_id) FROM hits WHERE created_at BETWEEN 1451606400 AND 1483228800;
SELECT COUNTDistinct(user_id)
FROM hits
WHERE (created_at >= 1451606400) AND (created_at <= 1483228800)
??uniqExact(user_id)??
?            3023028 ?
??????????????????????
1 rows in set. Elapsed: 0.491 sec. Processed 35.82 million rows, 286.59 MB (73.01 million rows/s., 584.06 MB/s.)

Enjoy!

Dec
21
2017
--

This Week in Data with Colin Charles 20: cPanel changes strategy, Percona Live CFP extended

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

I think the biggest news from last week was from cPanel – if you haven’t already read the post, please do – on Being a Good Open Source Community Member: Why we hesitated on MySQL 5.7. cPanel anticipated MariaDB being the eventual replacement for MySQL, based on movements from Red Hat, Wikipedia and Google. The advantage focused on transparency around security disclosure, and the added features/improvements. Today though, “MySQL now consistently matches or outpaces MariaDB when it comes to development and releases, which in turn is increasing the demand on us for providing those upgraded versions of MySQL by our users.” And maybe a little more telling, “when MariaDB 10.2 became stable in May 2017 it included many features found in MySQL 5.7. However, MySQL reached stable nearly 18 months earlier in October 2015.” (emphasis mine).

So cPanel is going forth and supporting MySQL 5.7. They will continue supporting MariaDB Server for the foreseeable future. This really is cPanel ensuring they are responsive to users: “The people using and building database-driven applications are doing so with MySQL in mind, and are hesitant to add support for MariaDB. Responding to our community’s desires is one of the most important things to us, and this is something that we are hearing asked for from our community consistently.”

I, of course, think this is a great move. Users deserve choice. And MySQL has features that are sometimes still not included in MariaDB Server. Have you seen the Complete list of new features in MySQL 5.7? Or my high-level response to a MariaDB Corporation white paper?

I can only hope to see more people think pragmatically like cPanel. Ubuntu as a Linux distribution still does – you get MySQL 5.7 as a default (very unlike the upstream Debian which ships MariaDB Server nowadays). I used to be a proponent of MariaDB Server being everywhere, when it was community-developed, feature-enhanced, and backward-compatible. However, the moment it stopped being a branch and a true fork is the moment where trouble lies for users. I think it was still marginally fine with 10.0, and maybe even 10.1, but the ability to maintain feature parity with enhanced features has long gone. Short of a rebase? But then… what would be different to the already popular branch of MySQL called Percona Server for MySQL?

While there are wins and support from cloud vendors, like Amazon AWS RDS and Microsoft Azure, you’ll notice that they offer both MySQL and MariaDB Server. Google Cloud SQL notably only offers MySQL. IBM may be a sponsor of the MariaDB Foundation, but I don’t see their services like Compose offering anything other than MySQL (with group replication nonetheless!). Platinum member Alibaba Cloud offers MySQL and PostgreSQL. However, Tencent seems to suggest that MariaDB is coming soon? One interesting statistic to watch would be user uptake naturally.

Events

From an events standpoint, the Percona Live 2018 Call for Papers has been extended to January 12, 2018. We expect an early announcement of maybe ten talks in the week of  January 5. Please submit to the CFP. Have you got your tickets yet? Nab them during our Percona Live 2018 super saver registration when they are the best price!

FOSDEM has got Sveta and myself speaking in the MySQL and Friends DevRoom, but we also have good news in the sense that Peter Zaitsev is also going to be at FOSDEM – speaking in the main track. We’ll also have plenty of schwag at the stand.

I think it’s important to take note of the updates to Percona bug tracking: yes, its Jira all the way. Would be good for everyone to start also looking at how the sausage is made.

Dragph, a “distributed fast graph database“, just raised $3m and released 1.0. Have you used it?

On a lighter note, there seems to be a tweet going around by many, so I thought I’d share it here. Merry Christmas and Happy Holidays.

He’s making a database
He’s sorting it twice
SELECT * FROM girls_boys WHERE behaviour = “nice”
SQL Claus is coming to town!

Releases

Link List

Upcoming appearances

  • FOSDEM 2018 – Brussels, Belgium – February 3-4 2018
  • SCALE16x – Pasadena, California, USA – March 8-11 2018

Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com