May
26
2022
--

Is MySQL Statement-Based / Mixed Replication Really Safe?

MySQL Statement-Based

MySQL Statement-BasedThe binary logging format in MySQL has been ROW by default since MySQL 5.7, yet there are still many users sticking with STATEMENT or MIXED formats for various reasons. In some cases, there is just simple hesitation from changing something that has worked for years on legacy applications. But in others, there may be serious blockers, most typically missing primary keys in badly designed schemas, which would lead to serious performance issues on the replicas.

As a Support Engineer, I can still see quite a few customers using STATEMENT or MIXED formats, even if they are already on MySQL 8.0. In many cases this is OK, but recently I had to deal with a pretty nasty case, where not using ROW format was found to cause the replicas to silently lose data updates, without raising any replication errors! Was it some really rare edge use case? Not at all! Let me demonstrate a very simple test case below to illustrate how easy it is to end up in such a bad situation.

— source 

mysql> select @@binlog_format,@@system_time_zone;
+-----------------+--------------------+
| @@binlog_format | @@system_time_zone |
+-----------------+--------------------+
| STATEMENT       | BST                |
+-----------------+--------------------+
1 row in set (0.00 sec)

mysql> CREATE TABLE `test1` (
    ->   `id` int(11) NOT NULL AUTO_INCREMENT,
    ->   `d` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    ->   `a` varchar(30) NOT NULL,
    ->   `name` varchar(25) DEFAULT NULL,
    ->   PRIMARY KEY (`a`),
    ->   UNIQUE KEY `id` (`id`)
    -> ) ENGINE=InnoDB;
Query OK, 0 rows affected, 1 warning (0.02 sec)

mysql> insert into test1 values (null,now(),"test1",0);
Query OK, 1 row affected (0.00 sec)

mysql> insert into test1 values (null,now(),"test2",0);
Query OK, 1 row affected (0.01 sec)

mysql> insert into test1 values (null,now(),"test3",0);
Query OK, 1 row affected (0.01 sec)

mysql> select * from test1;
+----+---------------------+-------+------+
| id | d                   | a     | name |
+----+---------------------+-------+------+
|  1 | 2022-05-22 10:13:37 | test1 | 0    |
|  2 | 2022-05-22 10:13:37 | test2 | 0    |
|  3 | 2022-05-22 10:13:38 | test3 | 0    |
+----+---------------------+-------+------+
3 rows in set (0.00 sec)

— replica

mysql> select @@system_time_zone;
+--------------------+
| @@system_time_zone |
+--------------------+
| UTC                |
+--------------------+
1 row in set (0.00 sec)

mysql> select * from db1.test1;
+----+---------------------+-------+------+
| id | d                   | a     | name |
+----+---------------------+-------+------+
|  1 | 2022-05-22 09:13:37 | test1 | 0    |
|  2 | 2022-05-22 09:13:37 | test2 | 0    |
|  3 | 2022-05-22 09:13:38 | test3 | 0    |
+----+---------------------+-------+------+
3 rows in set (0.00 sec)

— source 

mysql> UPDATE test1 SET name = 'foobar', d = CURRENT_TIMESTAMP WHERE a = 'test1' AND d = '2022-05-22 10:13:37';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select * from test1;
+----+---------------------+-------+--------+
| id | d                   | a     | name   |
+----+---------------------+-------+--------+
|  1 | 2022-05-22 10:16:15 | test1 | foobar |
|  2 | 2022-05-22 10:13:37 | test2 | 0      |
|  3 | 2022-05-22 10:13:38 | test3 | 0      |
+----+---------------------+-------+--------+
3 rows in set (0.00 sec)

— replica

mysql> select * from db1.test1;
+----+---------------------+-------+------+
| id | d                   | a     | name |
+----+---------------------+-------+------+
|  1 | 2022-05-22 09:13:37 | test1 | 0    |
|  2 | 2022-05-22 09:13:37 | test2 | 0    |
|  3 | 2022-05-22 09:13:38 | test3 | 0    |
+----+---------------------+-------+------+
3 rows in set (0.00 sec)

mysql> pager egrep "Running|SQL_Error"
PAGER set to 'egrep "Running|SQL_Error"'

mysql > show replica status\G
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
               Last_SQL_Error: 
    Replica_SQL_Running_State: Replica has read all relay log; waiting for more updates
     Last_SQL_Error_Timestamp: 
1 row in set (0.00 sec)

 

Another test, using UTC_TIME() compared against a column that was populated using the CURRENT_TIMESTAMP function:

— source 

mysql> select * from test1 WHERE TIME(d) > DATE_SUB(UTC_TIME(), INTERVAL 11 HOUR) AND id=3;
+----+---------------------+-------+------+
| id | d                   | a     | name |
+----+---------------------+-------+------+
|  3 | 2022-05-22 10:13:38 | test3 | 0    |
+----+---------------------+-------+------+
1 row in set (0.00 sec)

— replica

mysql> select * from test1 WHERE TIME(d) > DATE_SUB(UTC_TIME(), INTERVAL 11 HOUR) AND id=3;
Empty set (0.00 sec)

Therefore, when a similar condition is used for update:

— source

mysql> update test1 set name="bar" WHERE TIME(d) > DATE_SUB(UTC_TIME(), INTERVAL 11 HOUR) AND id=3;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select * from test1 where id=3;
+----+---------------------+-------+------+
| id | d                   | a     | name |
+----+---------------------+-------+------+
|  3 | 2022-05-22 22:12:15 | test3 | bar  |
+----+---------------------+-------+------+
1 row in set (0.01 sec)

— replica

mysql> select * from test1 where id=3;
+----+---------------------+-------+------+
| id | d                   | a     | name |
+----+---------------------+-------+------+
|  3 | 2022-05-22 09:13:38 | test3 | 0    |
+----+---------------------+-------+------+
1 row in set (0.01 sec)

mysql > show replica status\G
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
               Last_SQL_Error: 
    Replica_SQL_Running_State: Replica has read all relay log; waiting for more updates
     Last_SQL_Error_Timestamp: 
1 row in set (0.00 sec)

Again replica ignored the update while there is no replication error reported. This particular scenario can be expected to happen really often in geographically-distributed database environments.

As the functions used here are not considered unsafe for replication, two usual safety actions are not performed:

  • no warning is printed in the error log when the STATEMENT format used
  • replication event is not logged in RBR format when the MIXED format is used but rather left the original query as it is

I find this potentially dangerous and hence reported it: https://bugs.mysql.com/bug.php?id=107293

Summary

ROW-based replication has become the standard in MySQL and is the most reliable one. It is also the only one permitted for virtually synchronous replication solutions like Percona XtraDB Cluster/Galera and MySQL Group Replication.

At the same time, STATEMENT or even MIXED format, may lead to data consistency issues, which can be undetected for a long time, making it very difficult to investigate when finally replication error happens as a result.

If there is anything that has prevented you from switching to ROW format yet, the sooner you deal with it the better.

Dec
14
2016
--

Row Store and Column Store Databases

Row Store and Column Store

Row Store and Column StoreIn this blog post, we’ll discuss the differences between row store and column store databases.

Clients often ask us if they should or could be using columnar databases. For some applications, a columnar database is a great choice; for others, you should stick with the tried and true row-based option.

At a basic level, row stores are great for transaction processing. Column stores are great for highly analytical query models. Row stores have the ability to write data very quickly, whereas a column store is awesome at aggregating large volumes of data for a subset of columns.

One of the benefits of a columnar database is its crazy fast query speeds. In some cases, queries that took minutes or hours are completed in seconds. This makes columnar databases a good choice in a query-heavy environment. But you must make sure that the queries you run are really suited to a columnar database.

Data Storage

Let’s think about a basic database, like a stockbroker’s transaction records. In a row store, each client would have a record with their basic information – name, address, phone number, etc. – in a single table. It’s likely that each record would have a unique identifier. In our case, it would probably be an

account_number

.

There is another table that stored stock transactions. Again, each transaction is uniquely identified by something like a

transaction_id

. Each transaction is associated to one

account_number

, but each

account_number

 is associated with multiple transactions. This provides us with a one-to-many relationship, and is a classic example of a transactional database.

We store all these tables on a disk and, when we run a query, the system might access lots of data before it determines what information is relevant to the specific query. If we want to know the

account_number

,

first_name

,

last_name

,

stock

, and

purchase_price

 for a given time period, the system needs to access all of the information for the two tables, including fields that may not be relevant to the query. It then performs a join to relate the two tables’ data, and then it can return the information. This can be inefficient at scale, and this is just one example of a query that would probably run faster on a columnar database.

With a columnar database, each field from each table is stored in its own file or set of files. In our example database, all

account_number

 data is stored in one file, all

transaction_id

 data is stored in another file, and so on. This provides some efficiencies when running queries against wide tables, since it is unlikely that a query needs to return all of the fields in a single table. In the query example above, we’d only need to access the files that contained data from the requested fields. You can ignore all other fields that exist in the table. This ability to minimize i/o is one of the key reasons columnar databases can perform much faster.

Normalization Versus Denormalization

Additionally, many columnar databases prefer a denormalized data structure. In the example above, we have two separate tables: one for account information and one for transaction information. In many columnar databases, a single table could represent this information. With this denormalized design, when a query like the one presented is run, no joins would need to be processed in the columnar database, so the query will likely run much faster.

The reason for normalizing data is that it allows data to be written to the database in a highly efficient manner. In our row store example, we need to record just the relevant transaction details whenever an existing customer makes a transaction. The account information does not need to be written along with the transaction data. Instead, we reference the

account_number

 to gain access to all of the fields in the accounts table.

The place where a columnar database really shines is when we want to run a query that would, for example, determine the average price for a specific stock over a range of time. In the case of the columnar database, we only need a few fields – 

symbol

,

price

, and

transaction_date

– in order to complete the query. With a row store, we would gather additional data that was not needed for the query but was still part of the table structure.

Normalization of data also makes updates to some information much more efficient in a row store. If you change an account holder’s address, you simply update the one record in the accounts table. The updated information is available to all transactions completed by that account owner. In the columnar database, since we might store the account information with the transactions of that user, many records might need updating in order update the available address information.

Conclusion

So, which one is right for you? As with so many things, it depends. You can still perform data analysis with a row-based database, but the queries may run slower than they would on a column store. You can record transactions in a column-based model, but the writes may takes longer to complete. In an ideal world, you would have both options available to you, and this is what many companies are doing.

In most cases, the initial write is to a row-based system. We know them, we love them, we’ve worked with them forever. They’re kind of like that odd relative who has some real quirks. We’ve learned the best ways to deal with them.

Then, we write the data (or the relevant parts of the data) to a column based database to allow for fast analytic queries.

Both databases incurred write transactions, and both also likely incur read transactions. Due to the fact that a column-based database has each column’s data in a separate file, it is less than ideal for a “SELECT * FROM…” query, since the request must access numerous files to process the request. Similarly, any query that selects a single or small subset of files will probably perform better in a row store. The column store is awesome for performing aggregation over large volumes of data. Or when you have queries that only need a few fields from a wide table.

It can be tough to decide between the two if you only have one database. But it is more the norm that companies support multiple database platforms for multiple uses. Also, your needs might change over time. The sports car you had when you were single is less than optimal for your current family of five. But, if you could, wouldn’t you want both the sports car and the minivan? This is why we often see both database models in use within a single company.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com