In this post, we’ll look at updating and deleting rows with ClickHouse. It’s the second of two parts.
In the first part of this post, we described the high-level overview of implementing incremental refresh on a ClickHouse table as an alternative support for UPDATE/DELETE. In this part, we will show you the actual steps and sample code.
Prepare Changelog Table
First, we create the changelog table below. This can be stored on any other MySQL instance separate from the source of our analytics table. When we run the change capture script, it will record the data on this table that we can consume later with the incremental refresh script:
CREATE TABLE `clickhouse_changelog` ( `db` varchar(255) NOT NULL DEFAULT '', `tbl` varchar(255) NOT NULL DEFAULT '', `created_at` date NOT NULL, `updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, `log_file` varchar(255) NOT NULL, `log_pos` int(10) unsigned NOT NULL, PRIMARY KEY (`db`,`tbl`,`created_at`), KEY `log_file` (`log_file`,`log_pos`) ) ENGINE=InnoDB;
Create ClickHouse Table
Next, let’s create the target ClickhHouse table. Remember, that the corresponding MySQL table is below:
CREATE TABLE `hits` ( `id` int(11) NOT NULL AUTO_INCREMENT, `type` varchar(100) DEFAULT NULL, `user_id` int(11) NOT NULL, `location_id` int(11) NOT NULL, `created_at` datetime DEFAULT NULL PRIMARY KEY (`id`), KEY `created_at` (`created_at`) ) ENGINE=InnoDB;
Converting this table to ClickHouse looks like below, with the addition of a “created_day” column that serves as the partitioning key:
CREATE TABLE hits ( id Int32, created_day Date, type String, user_id Int32, location_id Int32, created_at Int32 ) ENGINE = MergeTree PARTITION BY toMonday(created_day) ORDER BY (created_at, id) SETTINGS index_granularity = 8192;
Run Changelog Capture
Once the tables are ready, running the change capture script. An example script can be found in this gist, which is written in Python and uses the python-mysql-replication library. This library acts as replication client, continuously downloads the binary logs from the source and sifts through it to find any UPDATE/DELETE executed against our source table.
There are a few configuration options that need to be customized in the script.
- LOG_DB_HOST: The MySQL host where we created the
clickhouse_changelog
table.
- LOG_DB_NAME: The database name where the
clickhouse_changelog
table is created.
- SRC_DB_HOST: The MySQL host where we will be downloading binary logs from. This can either be a primary or secondary/replica as long as its the same server where our raw table is also located.
- MYSQL_USER: MySQL username.
- MYSQL_PASS: MySQL password.
- TABLE: The table we want to watch for changes.
When the script is successfully configured and running, the
clickhouse_changelog
table should start populating with data like below.
mysql> select * from mydb.clickhouse_changelog; +------+------+------------+---------------------+------------------+-----------+ | db | tbl | created_at | updated_at | log_file | log_pos | +------+------+------------+---------------------+------------------+-----------+ | mydb | hits | 2014-06-02 | 2017-12-23 17:19:33 | mysql-bin.016353 | 18876747 | | mydb | hits | 2014-06-09 | 2017-12-23 22:10:29 | mysql-bin.016414 | 1595314 | | mydb | hits | 2014-06-16 | 2017-12-23 02:59:37 | mysql-bin.016166 | 33999981 | | mydb | hits | 2014-06-23 | 2017-12-23 18:09:33 | mysql-bin.016363 | 84498477 | | mydb | hits | 2014-06-30 | 2017-12-23 06:08:59 | mysql-bin.016204 | 23792406 | | mydb | hits | 2014-08-04 | 2017-12-23 18:09:33 | mysql-bin.016363 | 84499495 | | mydb | hits | 2014-08-18 | 2017-12-23 18:09:33 | mysql-bin.016363 | 84500523 | | mydb | hits | 2014-09-01 | 2017-12-23 06:09:19 | mysql-bin.016204 | 27120145 | +------+------+------------+---------------------+------------------+-----------+
Full Table Import
So we have our changelog capture in place, the next step is to initially populate the ClickHouse table from MySQL. Normally, we can easily do this with a
mysqldump
into a tab-separated format, but remember we have to transform the
created_at
column from MySQL into ClickHouse’s
Date
format to be used as partitioning key.
A simple way to do this is by using a simple set of shell commands like below:
SQL=$(cat <<EOF SELECT id, DATE_FORMAT(created_at, "%Y-%m-%d"), type, user_id, location_id, UNIX_TIMESTAMP(created_at) FROM hits EOF ) mysql -h source_db_host mydb -BNe "$sql" > hist.txt cat hist.txt | clickhouse-client -d mydb --query="INSERT INTO hits FORMAT TabSeparated"
One thing to note about this process is that the
MySQL
client buffers the results for the whole query, and it could eat up all the memory on the server you run this from if the table is really large. To avoid this, chunk the table into several million rows at a time. Since we already have the changelog capture running and in place from the previous step, you do not need to worry about any changes between chunks. We will consolidate those changes during the incremental refreshes.
Incremental Refresh
After initially populating the ClickHouse table, we then set up our continuous incremental refresh using a separate script. A template script we use for the table on our example can be found in this gist.
What this script does is twofold:
- Determines the list of weeks recently modified based on
clickhouse_changelog
, dump rows for those weeks and re-imports to ClickHouse.
- If the current week is not on the list of those with modifications, it also checks for new rows based on the auto-incrementing primary key and appends them to the ClickHouse table.
An example output of this script would be:
ubuntu@mysql~/clickhouse$ bash clickhouse-incr-refresh.sh hits 2017-12-24_00_20_19 incr-refresh Starting changelog processing for hits 2017-12-24_00_20_19 incr-refresh Current week is: 2017-12-18 2017-12-24_00_20_19 incr-refresh Processing week: 2017-12-18 2017-12-24_00_20_20 incr-refresh Changelog import for hits complete ubuntu@mysql~/clickhouse$ bash clickhouse-incr-refresh.sh hits 2017-12-24_00_20_33 incr-refresh Starting changelog processing for hits 2017-12-24_00_20_33 incr-refresh Current week is: 2017-12-18 2017-12-24_00_20_33 incr-refresh Weeks is empty, nothing to do 2017-12-24_00_20_33 incr-refresh Changelog import for hits complete 2017-12-24_00_20_33 incr-refresh Inserting new records for hits > id: 5213438 2017-12-24_00_20_33 incr-refresh No new rows found 2017-12-24_00_20_33 incr-refresh Incremental import for hits complete ubuntu@mysql~/clickhouse$
Note that, on step 4, if you imported a really large table and the changelog had accumulated a large number of changes to refresh, the initial incremental execution might take some time. After that though, it should be faster. This script can be run every minute, longer or shorter, depending on how often you want the ClickHouse table to be refreshed.
To wrap up, here is a query from MySQL on the same table, versus ClickHouse.
mysql> SELECT COUNT(DISTINCT user_id) FROM hits WHERE created_at -> BETWEEN '2016-01-01 00:00:00' AND '2017-01-01 00:00:00'; +-------------------------+ | COUNT(DISTINCT user_id) | +-------------------------+ | 3023028 | +-------------------------+ 1 row in set (25.89 sec)
:) SELECT COUNT(DISTINCT user_id) FROM hits WHERE created_at BETWEEN 1451606400 AND 1483228800; SELECT COUNTDistinct(user_id) FROM hits WHERE (created_at >= 1451606400) AND (created_at <= 1483228800) ??uniqExact(user_id)?? ? 3023028 ? ?????????????????????? 1 rows in set. Elapsed: 0.491 sec. Processed 35.82 million rows, 286.59 MB (73.01 million rows/s., 584.06 MB/s.)
Enjoy!