Jan
16
2018
--

Updating/Deleting Rows From Clickhouse (Part 2)

ClickHouse

ClickHouseIn this post, we’ll look at updating and deleting rows with ClickHouse. It’s the second of two parts.

In the first part of this post, we described the high-level overview of implementing incremental refresh on a ClickHouse table as an alternative support for UPDATE/DELETE. In this part, we will show you the actual steps and sample code.

Prepare Changelog Table

First, we create the changelog table below. This can be stored on any other MySQL instance separate from the source of our analytics table. When we run the change capture script, it will record the data on this table that we can consume later with the incremental refresh script:

CREATE TABLE `clickhouse_changelog` (
  `db` varchar(255) NOT NULL DEFAULT '',
  `tbl` varchar(255) NOT NULL DEFAULT '',
  `created_at` date NOT NULL,
  `updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `log_file` varchar(255) NOT NULL,
  `log_pos` int(10) unsigned NOT NULL,
  PRIMARY KEY (`db`,`tbl`,`created_at`),
  KEY `log_file` (`log_file`,`log_pos`)
) ENGINE=InnoDB;

Create ClickHouse Table

Next, let’s create the target ClickhHouse table. Remember, that the corresponding MySQL table is below:

CREATE TABLE `hits` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `type` varchar(100) DEFAULT NULL,
  `user_id` int(11) NOT NULL,
  `location_id` int(11) NOT NULL,
  `created_at` datetime DEFAULT NULL
  PRIMARY KEY (`id`),
  KEY `created_at` (`created_at`)
) ENGINE=InnoDB;

Converting this table to ClickHouse looks like below, with the addition of a “created_day” column that serves as the partitioning key:

CREATE TABLE hits (
  id Int32,
  created_day Date,
  type String,
  user_id Int32,
  location_id Int32,
  created_at Int32
) ENGINE = MergeTree PARTITION BY toMonday(created_day)
ORDER BY (created_at, id) SETTINGS index_granularity = 8192;

Run Changelog Capture

Once the tables are ready, running the change capture script. An example script can be found in this gist, which is written in Python and uses the python-mysql-replication library. This library acts as replication client, continuously downloads the binary logs from the source and sifts through it to find any UPDATE/DELETE executed against our source table.

There are a few configuration options that need to be customized in the script.

  • LOG_DB_HOST: The MySQL host where we created the
    clickhouse_changelog

     table.

  • LOG_DB_NAME: The database name where the
    clickhouse_changelog

     table is created.

  • SRC_DB_HOST: The MySQL host where we will be downloading binary logs from. This can either be a primary or secondary/replica as long as its the same server where our raw table is also located.
  • MYSQL_USER: MySQL username.
  • MYSQL_PASS: MySQL password.
  • TABLE: The table we want to watch for changes.

When the script is successfully configured and running, the

clickhouse_changelog

 table should start populating with data like below.

mysql> select * from mydb.clickhouse_changelog;
+------+------+------------+---------------------+------------------+-----------+
| db   | tbl  | created_at | updated_at          | log_file         | log_pos   |
+------+------+------------+---------------------+------------------+-----------+
| mydb | hits | 2014-06-02 | 2017-12-23 17:19:33 | mysql-bin.016353 |  18876747 |
| mydb | hits | 2014-06-09 | 2017-12-23 22:10:29 | mysql-bin.016414 |   1595314 |
| mydb | hits | 2014-06-16 | 2017-12-23 02:59:37 | mysql-bin.016166 |  33999981 |
| mydb | hits | 2014-06-23 | 2017-12-23 18:09:33 | mysql-bin.016363 |  84498477 |
| mydb | hits | 2014-06-30 | 2017-12-23 06:08:59 | mysql-bin.016204 |  23792406 |
| mydb | hits | 2014-08-04 | 2017-12-23 18:09:33 | mysql-bin.016363 |  84499495 |
| mydb | hits | 2014-08-18 | 2017-12-23 18:09:33 | mysql-bin.016363 |  84500523 |
| mydb | hits | 2014-09-01 | 2017-12-23 06:09:19 | mysql-bin.016204 |  27120145 |
+------+------+------------+---------------------+------------------+-----------+

Full Table Import

So we have our changelog capture in place, the next step is to initially populate the ClickHouse table from MySQL. Normally, we can easily do this with a 

mysqldump

 into a tab-separated format, but remember we have to transform the

created_at

 column from MySQL into ClickHouse’s

Date

 format to be used as partitioning key.

A simple way to do this is by using a simple set of shell commands like below:

SQL=$(cat <<EOF
SELECT
	id, DATE_FORMAT(created_at, "%Y-%m-%d"),
	type, user_id, location_id, UNIX_TIMESTAMP(created_at)
FROM hits
EOF
)
mysql -h source_db_host mydb -BNe "$sql" > hist.txt
cat hist.txt | clickhouse-client -d mydb --query="INSERT INTO hits FORMAT TabSeparated"

One thing to note about this process is that the

MySQL

 client buffers the results for the whole query, and it could eat up all the memory on the server you run this from if the table is really large. To avoid this, chunk the table into several million rows at a time. Since we already have the changelog capture running and in place from the previous step, you do not need to worry about any changes between chunks. We will consolidate those changes during the incremental refreshes.

Incremental Refresh

After initially populating the ClickHouse table, we then set up our continuous incremental refresh using a separate script. A template script we use for the table on our example can be found in this gist.

What this script does is twofold:

  • Determines the list of weeks recently modified based on
    clickhouse_changelog

    , dump rows for those weeks and re-imports to ClickHouse.

  • If the current week is not on the list of those with modifications, it also checks for new rows based on the auto-incrementing primary key and appends them to the ClickHouse table.

An example output of this script would be:

ubuntu@mysql~/clickhouse$ bash clickhouse-incr-refresh.sh hits
2017-12-24_00_20_19 incr-refresh Starting changelog processing for hits
2017-12-24_00_20_19 incr-refresh Current week is: 2017-12-18
2017-12-24_00_20_19 incr-refresh Processing week: 2017-12-18
2017-12-24_00_20_20 incr-refresh Changelog import for hits complete
ubuntu@mysql~/clickhouse$ bash clickhouse-incr-refresh.sh hits
2017-12-24_00_20_33 incr-refresh Starting changelog processing for hits
2017-12-24_00_20_33 incr-refresh Current week is: 2017-12-18
2017-12-24_00_20_33 incr-refresh Weeks is empty, nothing to do
2017-12-24_00_20_33 incr-refresh Changelog import for hits complete
2017-12-24_00_20_33 incr-refresh Inserting new records for hits > id: 5213438
2017-12-24_00_20_33 incr-refresh No new rows found
2017-12-24_00_20_33 incr-refresh Incremental import for hits complete
ubuntu@mysql~/clickhouse$

Note that, on step 4, if you imported a really large table and the changelog had accumulated a large number of changes to refresh, the initial incremental execution might take some time. After that though, it should be faster. This script can be run every minute, longer or shorter, depending on how often you want the ClickHouse table to be refreshed.

To wrap up, here is a query from MySQL on the same table, versus ClickHouse.

mysql> SELECT COUNT(DISTINCT user_id) FROM hits WHERE created_at
    -> BETWEEN '2016-01-01 00:00:00' AND '2017-01-01 00:00:00';
+-------------------------+
| COUNT(DISTINCT user_id) |
+-------------------------+
|                 3023028 |
+-------------------------+
1 row in set (25.89 sec)

:) SELECT COUNT(DISTINCT user_id) FROM hits WHERE created_at BETWEEN 1451606400 AND 1483228800;
SELECT COUNTDistinct(user_id)
FROM hits
WHERE (created_at >= 1451606400) AND (created_at <= 1483228800)
??uniqExact(user_id)??
?            3023028 ?
??????????????????????
1 rows in set. Elapsed: 0.491 sec. Processed 35.82 million rows, 286.59 MB (73.01 million rows/s., 584.06 MB/s.)

Enjoy!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com