Apr
02
2019
--

How to handle dark data compliance risk at your company

Slack and other consumer-grade productivity tools have been taking off in workplaces large and small — and data governance hasn’t caught up.

Whether it’s litigation, compliance with regulations like GDPR or concerns about data breaches, legal teams need to account for new types of employee communication. And that’s hard when work is happening across the latest messaging apps and SaaS products, which make data searchability and accessibility more complex.

Here’s a quick look at the problem, followed by our suggestions for best practices at your company.

Problems

The increasing frequency of reported data breaches and expanding jurisdiction of new privacy laws are prompting conversations about dark data and risks at companies of all sizes, even small startups. Data risk discussions necessarily include the risk of a data breach, as well as preservation of data. Just two weeks ago it was reported that Jared Kushner used WhatsApp for official communications and screenshots of those messages for preservation, which commentators say complies with record keeping laws but raises questions about potential admissibility as evidence.

Dec
07
2018
--

MySQL 8 and The FRM Drop… How To Recover Table DDL

MySQL 8 frm drop recover ddl

… or what I should keep in mind in case of disaster

MySQL 8 frm drop recover ddl

To retrieve and maintain in SQL format the definition of all tables in a database, is a best practice that we all should adopt. To have that under version control is also another best practice to keep in mind.

While doing that may seem redundant, it can become a life saver in several situations. From the need to review what has historically changed in a table, to knowing who changed what and why… to when you need to recover your data and have your beloved MySQL instance not start…

But let’s be honest, only a few do the right thing, and even fewer keep that information up to date. Given that’s the case, what can we do when we have the need to discover/recover the table structure?

From the beginning, MySQL has used some external files to describe its internal structure.

For instance, if I have a schema named windmills and a table named wmillAUTOINC1, on the file system I will see this:

-rw-r-----. 1 mysql mysql     8838 Mar 14 2018 wmillAUTOINC1.frm
-rw-r-----. 1 mysql mysql   131072 Mar 14 2018 wmillAUTOINC1.ibd

The ibd file contains the data, while the frm file contains the structure information.

Putting aside ANY discussion about if this is safe, if it’s transactional and more… when we’ve experienced some major crash and data corruption this approach has been helpful. Being able to read from the frm file was the easiest way to get the information we need.
Simple tools like DBSake made the task quite trivial, and allowed us to script table definition when needed to run long, complex tedious data recovery:

[root@master1 windmills]# /opt/tools/dbsake frmdump wmillAUTOINC1.frm
--
-- Table structure for table `wmillAUTOINC1`
-- Created with MySQL Version 5.7.20
--
CREATE TABLE `wmillAUTOINC1` (
  `id` bigint(11) NOT NULL AUTO_INCREMENT,
  `uuid` char(36) COLLATE utf8_bin NOT NULL,
  `millid` smallint(6) NOT NULL,
  `kwatts_s` int(11) NOT NULL,
  `date` date NOT NULL,
  `location` varchar(50) COLLATE utf8_bin NOT NULL,
  `active` tinyint(2) NOT NULL DEFAULT '1',
  `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `strrecordtype` char(3) COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`id`),
  KEY `IDX_millid` (`millid`,`active`),
  KEY `IDX_active` (`id`,`active`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin ROW_FORMAT=DYNAMIC;

Of course, if the frm file was also corrupt, then we could try to get the information from the ibdata dictionary. If that is corrupted too (trust me I’ve seen all of these situations) … well a last resource was hoping the customer has a recent table definition stored somewhere, but as mentioned before, we are not so diligent, are we?

Now, though, in MySQL8 we do not have FRM files, they were dropped. Even more interesting is that we do not have the same dictionary, most of the things that we knew have changed, including the dictionary location. So what can be done?

Well Oracle have moved the FRM information—and more—to what is called Serialized Dictionary Information (SDI), the SDI is written INSIDE the ibd file, and represents the redundant copy of the information contained in the data dictionary.

The SDI is updated/modified by DDL operations on tables that reside in that tablespace. This is it: if you have one file per table normally, then you will have in that file ONLY the SDI for that table, but if you have multiple tables in a tablespace, the SDI information will refer to ALL of the tables.

To extract this information from the IBD files, Oracle provides a utility called ibd2sdi. This application parses the SDI information and reports a JSON file that can be easily manipulated to extract and build the table definition.

One exception is represented by Partitioned tables. The SDI information is contained ONLY in the first partition, and if you drop it, it is moved to the next one. I will show that later.

But let’s see how it works. In the next examples I will look for the table’s name, attributes, and datatype starting from the dictionary tables.

To obtain the info I will do this:

/opt/mysql_templates/mysql-8P/bin/./ibd2sdi   /opt/mysql_instances/master8/data/mysql.ibd |jq  '.[]?|.[]?|.dd_object?|("------------------------------------"?,"TABLE NAME = ",.name?,"****",(.columns?|.[]?|(.name?,.column_type_utf8?)))'

The result will be something like:

"------------------------------------"
"TABLE NAME = "
"tables"
"****"
"id"
"bigint(20) unsigned"
"schema_id"
"bigint(20) unsigned"
"name"
"varchar(64)"
"type"
"enum('BASE TABLE','VIEW','SYSTEM VIEW')"
"engine"
"varchar(64)"
"mysql_version_id"
"int(10) unsigned"
"row_format"
"enum('Fixed','Dynamic','Compressed','Redundant','Compact','Paged')"
"collation_id"
"bigint(20) unsigned"
"comment"
"varchar(2048)"
<snip>
"------------------------------------"
"TABLE NAME = "
"tablespaces"
"****"
"id"
"bigint(20) unsigned"
"name"
"varchar(259)"
"options"
"mediumtext"
"se_private_data"
"mediumtext"
"comment"
"varchar(2048)"
"engine"
"varchar(64)"
"DB_TRX_ID"
""
"DB_ROLL_PTR"
""

I cut the output for brevity, but if you run the above command yourself you’ll be able to see that this retrieves the information for ALL the tables residing in the IBD.

The other thing I hope you noticed is that I am NOT parsing ibdata, but mysql.ibd. Why? Because the dictionary was moved out from ibdata and is now in mysql.ibd.

Look what happens if I try to parse ibdata:

[root@master1 ~]# /opt/mysql_templates/mysql-8P/bin/./ibd2sdi   /opt/mysql_instances/master8/data/ibdata1 |jq '.'
[INFO] ibd2sdi: SDI is empty.

Be very careful here to not mess up your mysql.ibd file.

Now what can I do to get information about my wmillAUTOINC1 table in MySQL8?

That is quite simple:

/opt/mysql_templates/mysql-8P/bin/./ibd2sdi   /opt/mysql_instances/master8/data/windmills/wmillAUTOINC.ibd |jq '.'
[
  "ibd2sdi",
  {
    "type": 1,
    "id": 1068,
    "object": {
      "mysqld_version_id": 80013,
      "dd_version": 80013,
      "sdi_version": 1,
      "dd_object_type": "Table",
      "dd_object": {
        "name": "wmillAUTOINC",
        "mysql_version_id": 80011,
        "created": 20180925095853,
        "last_altered": 20180925095853,
        "hidden": 1,
        "options": "avg_row_length=0;key_block_size=0;keys_disabled=0;pack_record=1;row_type=2;stats_auto_recalc=0;stats_sample_pages=0;",
        "columns": [
          {
            "name": "id",
            "type": 9,
            "is_nullable": false,
            "is_zerofill": false,
            "is_unsigned": false,
            "is_auto_increment": true,
            "is_virtual": false,
            "hidden": 1,
            "ordinal_position": 1,
            "char_length": 11,
            "numeric_precision": 19,
            "numeric_scale": 0,
            "numeric_scale_null": false,
            "datetime_precision": 0,
            "datetime_precision_null": 1,
            "has_no_default": false,
            "default_value_null": false,
            "srs_id_null": true,
            "srs_id": 0,
            "default_value": "AAAAAAAAAAA=",
            "default_value_utf8_null": true,
            "default_value_utf8": "",
            "default_option": "",
            "update_option": "",
            "comment": "",
            "generation_expression": "",
            "generation_expression_utf8": "",
            "options": "interval_count=0;",
            "se_private_data": "table_id=1838;",
            "column_key": 2,
            "column_type_utf8": "bigint(11)",
            "elements": [],
            "collation_id": 83,
            "is_explicit_collation": false
          },
<SNIP>
        "indexes": [
          {
            "name": "PRIMARY",
            "hidden": false,
            "is_generated": false,
            "ordinal_position": 1,
            "comment": "",
            "options": "flags=0;",
            "se_private_data": "id=2261;root=4;space_id=775;table_id=1838;trx_id=6585972;",
            "type": 1,
            "algorithm": 2,
            "is_algorithm_explicit": false,
            "is_visible": true,
            "engine": "InnoDB",
<Snip>
        ],
        "foreign_keys": [],
        "partitions": [],
        "collation_id": 83
      }
    }
  },
  {
    "type": 2,
    "id": 780,
    "object": {
      "mysqld_version_id": 80011,
      "dd_version": 80011,
      "sdi_version": 1,
      "dd_object_type": "Tablespace",
      "dd_object": {
        "name": "windmills/wmillAUTOINC",
        "comment": "",
        "options": "",
        "se_private_data": "flags=16417;id=775;server_version=80011;space_version=1;",
        "engine": "InnoDB",
        "files": [
          {
            "ordinal_position": 1,
            "filename": "./windmills/wmillAUTOINC.ibd",
            "se_private_data": "id=775;"
          }
        ]
      }
    }
  }
]

The JSON will contains:

  • A section describing the DB object at high level
  • Array of columns and related information
  • Array of indexes
  • Partition information (not here but in the next example)
  • Table space information

That is a lot more detail compared to what we had in the FRM, and it is quite relevant and interesting information as well.

Once you have extracted the SDI, any JSON parser tool script can generate the information for the SQL DDL.

I mention partitions, so let’s look at this a bit more, given they can be tricky.

As mentioned, the SDI information is present ONLY in the first partition. All other partitions hold ONLY the tablespace information. Given that, then the first thing to do is to identify which partition is the first… OR simply try to access all partitions, and when you are able to get the details, extract them.

The process is the same:

[root@master1 ~]# /opt/mysql_templates/mysql-8P/bin/./ibd2sdi   /opt/mysql_instances/master8/data/windmills/wmillAUTOINCPART#P#PT20170301.ibd |jq '.'
[
  "ibd2sdi",
  {
    "type": 1,
    "id": 1460,
    "object": {
      "mysqld_version_id": 80013,
      "dd_version": 80013,
      "sdi_version": 1,
      "dd_object_type": "Table",
      "dd_object": {
        "name": "wmillAUTOINCPART",
        "mysql_version_id": 80013,
        "created": 20181125110300,
        "last_altered": 20181125110300,
        "hidden": 1,
        "options": "avg_row_length=0;key_block_size=0;keys_disabled=0;pack_record=1;row_type=2;stats_auto_recalc=0;stats_sample_pages=0;",
        "columns": [<snip>
    	  "schema_ref": "windmills",
        "se_private_id": 18446744073709552000,
        "engine": "InnoDB",
        "last_checked_for_upgrade_version_id": 80013,
        "comment": "",
        "se_private_data": "autoinc=31080;version=2;",
        "row_format": 2,
        "partition_type": 7,
        "partition_expression": "to_days(`date`)",
        "partition_expression_utf8": "to_days(`date`)",
        "default_partitioning": 1,
        "subpartition_type": 0,
        "subpartition_expression": "",
        "subpartition_expression_utf8": "",
        "default_subpartitioning": 0,
       ],
<snip>
        "foreign_keys": [],
        "partitions": [
          {
            "name": "PT20170301",
            "parent_partition_id": 18446744073709552000,
            "number": 0,
            "se_private_id": 1847,
            "description_utf8": "736754",
            "engine": "InnoDB",
            "comment": "",
            "options": "",
            "se_private_data": "autoinc=0;version=0;",
            "values": [
              {
                "max_value": false,
                "null_value": false,
                "list_num": 0,
                "column_num": 0,
                "value_utf8": "736754"
              }
            ],

The difference, as you can see, is that the section related to partitions and sub partitions will be filled with all the details you might need to recreate the partitions.

We will have:

  • Partition type
  • Partition expression
  • Partition values
  • …more

Same for sub partitions.

Now again see what happens if I parse the second partition:

[root@master1 ~]# /opt/mysql_templates/mysql-8P/bin/./ibd2sdi   /opt/mysql_instances/master8/data/windmills/wmillAUTOINCPART#P#PT20170401.ibd |jq '.'
[
  "ibd2sdi",
  {
    "type": 2,
    "id": 790,
    "object": {
      "mysqld_version_id": 80011,
      "dd_version": 80011,
      "sdi_version": 1,
      "dd_object_type": "Tablespace",
      "dd_object": {
        "name": "windmills/wmillAUTOINCPART#P#PT20170401",
        "comment": "",
        "options": "",
        "se_private_data": "flags=16417;id=785;server_version=80011;space_version=1;",
        "engine": "InnoDB",
        "files": [
          {
            "ordinal_position": 1,
            "filename": "./windmills/wmillAUTOINCPART#P#PT20170401.ibd",
            "se_private_data": "id=785;"
          }
        ]
      }
    }
  }
]

I will get only the information about the tablespace, not the table.

As promised let me show you now what happens if I delete the first partition, and the second partition becomes the first:

(root@localhost) [windmills]>alter table wmillAUTOINCPART drop partition PT20170301;
Query OK, 0 rows affected (1.84 sec)
Records: 0  Duplicates: 0  Warnings: 0
[root@master1 ~]# /opt/mysql_templates/mysql-8P/bin/./ibd2sdi   /opt/mysql_instances/master8/data/windmills/wmillAUTOINCPART#P#PT20170401.ibd |jq '.'|more
[
  "ibd2sdi",
  {
    "type": 1,
    "id": 1461,
    "object": {
      "mysqld_version_id": 80013,
      "dd_version": 80013,
      "sdi_version": 1,
      "dd_object_type": "Table",
      "dd_object": {
        "name": "wmillAUTOINCPART",
        "mysql_version_id": 80013,
        "created": 20181129130834,
        "last_altered": 20181129130834,
        "hidden": 1,
        "options": "avg_row_length=0;key_block_size=0;keys_disabled=0;pack_record=1;row_type=2;stats_auto_recalc=0;stats_sample_pages=0;",
        "columns": [
          {
            "name": "id",
            "type": 9,
            "is_nullable": false,
            "is_zerofill": false,
            "is_unsigned": false,
            "is_auto_increment": true,
            "is_virtual": false,
            "hidden": 1,
            "ordinal_position": 1,

As I mentioned before, each DDL updates the SDI, and here we go: I will have all the information on what’s NOW the FIRST partition. Please note the value of the attribute “created” between the first time I queried the other partition, and the one that I have now:

/opt/mysql_instances/master8/data/windmills/wmillAUTOINCPART#P#PT20170301.ibd
       "created": 20181125110300,
/opt/mysql_instances/master8/data/windmills/wmillAUTOINCPART#P#PT20170401.ibd
       "created": 20181129130834,

To be clear the second created is NOW (PT20170401) from when I dropped the other partition (PT20170301).

Conclusions

In the end, this solution is definitely more powerful than the FRM files. It will allow us to parse the file and identify the table definition more easily, providing us with much more detail and information.

The problems will arise if and when the IBD file becomes corrupt.

As for the manual:  For InnoDB, an SDI record requires a single index page, which is 16KB in size by default. However, SDI data is compressed to reduce the storage footprint.

By which it means that for each table I have a page, if I associate record=table. Which means that in case of IBD corruption I should (likely) be able to read those pages. Unless I have bad (very bad) luck.

I still wonder how the dimension of an IBD affects the SDI retrieval, but given I have not tried it yet I will have to let you know.

As an aside, I am working on a script to facilitate the generation of the SQL, it’s not yet ready but you can find it here

Last note but keep this in mind! It is stated in the manual but in a hidden place and in small letters:
DDL operations take longer due to writing to storage, undo logs, and redo logs instead of .frm files.

References

https://stedolan.github.io/jq/

https://dev.mysql.com/doc/refman/8.0/en/ibd2sdi.html

https://dev.mysql.com/doc/refman/8.0/en/serialized-dictionary-information.html

https://dev.mysql.com/doc/refman/8.0/en/data-dictionary-limitations.html


Photo by chuttersnap on Unsplash

Nov
01
2018
--

How To Best Use Percona Server Column Compression With Dictionary

Database Compression

column compressionVery often, database performance is affected by the inability to cache all the required data in memory. Disk IO, even when using the fastest devices, takes much more time than a memory access. With MySQL/InnoDB, the main memory cache is the InnoDB buffer pool. There are many strategies we can try to fit as much data as possible in the buffer pool, and one of them is data compression.

With regular MySQL, to compress InnoDB data you can either use “Barracuda page compression” or “transparent page compression with punch holes”. The use of the ZFS filesystem is another possibility, but it is external to MySQL and doesn’t help with caching. All these solutions are transparent, but often they also have performance and management implications. If you are using Percona Server for MySQL, you have yet another option, “column compression with dictionary“. This feature is certainly not receiving the attention it merits. I think it is really cool—let me show you why.

We all know what compression means, who has not zipped a file before attaching it to an email? Compression removes redundancy from a file. What about the dictionary? A compression dictionary is a way to seed the compressor with expected patterns, in order to improve the compression ratio. Because you can specify a dictionary, the scope of usefulness of column compression with the Percona Server for MySQL feature is greatly increased. In the following sections, we’ll review the impacts of a good dictionary, and devise a way to create a good one without any guessing.

A simple use case

A compression algorithm needs a minimal amount of data in order to achieve a reasonable compression ratio. Typically, if the object is below a few hundred bytes, there is rarely enough data to have repetitive patterns and when the compression header is added, the compressed data can end up larger than the original.

mysql> select length('Hi!'), length(compress('Hi!'));
+---------------+-------------------------+
| length('Hi!') | length(compress('Hi!')) |
+---------------+-------------------------+
|             3 |                      15 |
+---------------+-------------------------+
1 row in set (0.02 sec)

Compressing a string of three bytes results in a binary object of 15 bytes. That’s counter productive.

In order to illustrate the potential of the dictionary, I used this dataset:

http://skeeto.s3.amazonaws.com/share/JEOPARDY_QUESTIONS1.json.gz

It is a set of 100k Jeopardy questions written in JSON. To load the data in MySQL, I created the following table:

mysql> show create table TestColCompression\G
*************************** 1. row ***************************
Table: TestColCompression
Create Table: CREATE TABLE `TestColCompression` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`question` text NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=79977 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

Then, I did some formatting to create insert statements:

zcat JEOPARDY_QUESTIONS1.json.gz | perl -p -e 's/\[\{/\{/g' | perl -p -e 's/\}, \{/\}\n\{/g' | perl -p -e "s/'/''/g" | \
  (while read line; do echo "insert into testColComp (questionJson) values ('$line');"; done )

And I executed the inserts. About 20% of the rows had some formatting issues but nevertheless, I ended up with close to 80k rows:

mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 78110
Avg_row_length: 316
Data_length: 24690688
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 15:16:41
Update_time: 2018-10-26 15:40:34
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)

The average row length is 316 bytes for a total data size of 23.55MB. The question JSON objects are large enough to matter, but barely large enough for compression. Here are the first five rows:

mysql> select question from TestColCompression limit 5\G
*************************** 1. row ***************************
question: {"category": "HISTORY", "air_date": "2004-12-31", "question": "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'", "value": "$200", "answer": "Copernicus", "round": "Jeopardy!", "show_number": "4680"}
*************************** 2. row ***************************
question: {"category": "ESPN's TOP 10 ALL-TIME ATHLETES", "air_date": "2004-12-31", "question": "'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'", "value": "$200", "answer": "Jim Thorpe", "round": "Jeopardy!", "show_number": "4680"}
*************************** 3. row ***************************
question: {"category": "EVERYBODY TALKS ABOUT IT...", "air_date": "2004-12-31", "question": "'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'", "value": "$200", "answer": "Arizona", "round": "Jeopardy!", "show_number": "4680"}
*************************** 4. row ***************************
question: {"category": "OLD FOLKS IN THEIR 30s", "air_date": "2009-05-08", "question": "'The district of conservative rep. Patrick McHenry in this state includes Mooresville, a home of NASCAR'", "value": "$800", "answer": "North Carolina", "round": "Jeopardy!", "show_number": "5690"}
*************************** 5. row ***************************
question: {"category": "MOVIES & TV", "air_date": "2009-05-08", "question": "'Tim Robbins played a public TV newsman in "Anchorman: The Legend of" him'", "value": "$800", "answer": "Ron Burgundy", "round": "Jeopardy!", "show_number": "5690"}

Let’s begin by a straight column compression without specifying a dictionary:

mysql> alter table TestColCompression modify question text COLUMN_FORMAT COMPRESSED;
Query OK, 79976 rows affected (4.25 sec)
Records: 79976 Duplicates: 0 Warnings: 0
mysql> analyze table TestColCompression;
+----------------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+----------------------------+---------+----------+----------+
| colcomp.TestColCompression | analyze | status | OK |
+----------------------------+---------+----------+----------+
mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 78995
Avg_row_length: 259
Data_length: 20496384
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 15:47:56
Update_time: 2018-10-26 15:47:56
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)

As expected the data didn’t compress much. The compression ration is 0.82 or if expressed as a percentage, 18%. Since the JSON headers are always the same, and are present in all questions, we should minimally use them for the dictionary. Trying a minimal dictionary made of the headers gives:

mysql> SET @dictionary_data = 'category' 'air_date' 'question' 'value' 'answer' 'round' 'show_number' ;
Query OK, 0 rows affected (0.01 sec)
mysql> CREATE COMPRESSION_DICTIONARY simple_dictionary (@dictionary_data);
Query OK, 0 rows affected (0.00 sec)
mysql> alter table TestColCompression modify question text COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY simple_dictionary;
Query OK, 79976 rows affected (4.72 sec)
Records: 79976 Duplicates: 0 Warnings: 0
mysql> analyze table TestColCompression;
+----------------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+----------------------------+---------+----------+----------+
| colcomp.TestColCompression | analyze | status | OK |
+----------------------------+---------+----------+----------+
1 row in set (0.00 sec)
mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 78786
Avg_row_length: 246
Data_length: 19447808
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 17:58:17
Update_time: 2018-10-26 17:58:17
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)

There is a little progress, we now have a compression ratio of 0.79. Obviously, we could do more but without a tool, we’ll have to guess. A compressor like zlib builds a dictionary as part of its compression effort, could we use that? Yes, but only if we can generate it correctly and access the result. That’s not readily available with the common compressors I know. Fortunately, someone else had the same issue and wrote a compressor able to save its dictionary. Please let me introduce femtozip.

Femtozip to the rescue

The tool, by itself, has no magic algorithm. It is based on zlib, from what I can understand from the code. Anyway, we won’t compress anything with it, we’ll use it to generate a good dictionary. In order to create a dictionary, the tool looks at a set of files and try to see patterns between them. The use of a single big file defeat the purpose. So, I generated one file per question with:

mkdir questions
cd questions
l=1; mysql -u blog -pblog colcomp -e 'select question from TestColCompression' | (while read line; do echo $line > ${l}; let l=l+1; done)

Then, I used the following command to generate a 1024 bytes dictionary using all the files starting by “1”:

../femtozip/cpp/fzip/src/fzip --model ../questions_1s.mod --build --dictonly --maxdict 1024 1*
Building dictionary...

In about 10s the job was done. I tried with all the 80k files and… I had to kill the process after thirty minutes. Anyway, there are 11111 files starting with “1”, a very decent sample. Our generated dictionary looks like:

cat ../questions_1s.mod
", "air_date", "round": "Double Jeopardy!", "show_number": " of this for 00", "answer": "the 0", "question": "'e", "round": "Jeopardy!", "show_number": "r", "round": "{"cate gory": "S", "air_date": "1998-s", "round": "Double Jeopardy!", "show_number": " of the ", "air_date": "2008-{"category": "THE {"category": "As", "round": "Jeopardy!", "show_number": "4", "question": "'Jeopardy!", "show_number": "2'", "value": "$1000", "answer": "7", "question": "'The ", "question": "'A'", "value": "$600", "answer": "9", "questi on": "'In ", "question": "'This 3", "question": "'2", "question": "'e'", "value": "$", "round": "Double Jeopardy!", "show_number": "4", "round": "Jeopardy!", "show_number": "4"'", "value": "$S", "air_date": "199", "round": "Double Jeopardy!", "show_number": "5s'", "value": "$", "round": "Double Jeopardy!", "show_number": "3", "round": "Jeopardy !", "show_number": "3", "round": "Jeopardy!", "show_number": "5'", "value": "$200", "answer": "'", "value": "$800", "answer": "'", "value": "$400", "answer": "

With some formatting, I was able to create a dictionary with the above data:

mysql> SET @dictionary_data = '", "air_date", "round": "Double Jeopardy!", "show_number": " of this for 00", "answer": "the 0", "question": "''e", "round": "Jeopardy!", "show_number": "r", "round": "{"category": "S", "air_date": "1998-s", "round": "Double Jeopardy!", "show_number": " of the ", "air_date": "2008-{"category": "THE {"category": "As", "round": "Jeopardy!", "show_number": "4", "question": "''Jeopardy!", "show_number": "2''", "value": "$1000", "answer": "7", "question": "''The ", "question": "''A''", "value": "$600", "answer": "9", "question": "''In ", "question": "''This 3", "question": "''2", "question": "''e''", "value": "$", "round": "Double Jeopardy!", "show_number": "4", "round": "Jeopardy!", "show_number": "4"''", "value": "$S", "air_date": "199", "round": "Double Jeopardy!", "show_number": "5s''", "value": "$", "round": "Double Jeopardy!", "show_number": "3", "round": "Jeopardy!", "show_number": "3", "round": "Jeopardy!", "show_number": "5''", "value": "$200", "answer": "''", "value": "$800", "answer": "''", "value": "$400", "answer": "' ;
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE COMPRESSION_DICTIONARY femtozip_dictionary (@dictionary_data);
Query OK, 0 rows affected (0.00 sec)
<\pre>
And then, I altered the table to use the new dictionary:

mysql> alter table TestColCompression modify question text COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY femtozip_dictionary;
Query OK, 79976 rows affected (4.05 sec)
Records: 79976 Duplicates: 0 Warnings: 0
mysql> analyze table TestColCompression;
+----------------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+----------------------------+---------+----------+----------+
| colcomp.TestColCompression | analyze | status | OK |
+----------------------------+---------+----------+----------+
1 row in set (0.00 sec)
mysql> show table status\G
*************************** 1. row ***************************
Name: TestColCompression
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 79861
Avg_row_length: 190
Data_length: 15220736
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: 79977
Create_time: 2018-10-26 17:56:09
Update_time: 2018-10-26 17:56:09
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)

That’s interesting, we are now achieving a ratio of 0.61, a significant improvement. I pushed my luck and tried with a 2048 bytes dictionary. It further reduced the ratio to 0.57 but that was about the best I got. Larger dictionaries didn’t lower the ratio below 0.57. Zlib supports up to 32KB for the dictionary.

So, to recap:

  • column compression without dictionary, ratio of 0.82
  • column compression with simple dictionary, ratio of 0.79
  • column compression with a 1k dictionary from femtozip, ratio of 0.61
  • column compression with a 2k dictionary from femtozip, ratio of 0.57

The above example stores a JSON document in a text column. MySQL 5.7 includes a JSON datatype which behaves a bit differently regarding the dictionary. Delimiting characters like ‘{}’ are removed in the on disk representation of a JSON column. If you have TBs of data in similar tables, you should really consider column compression and a systematic way of determining the dictionary with femtozip. In addition to improve the compression, it is likely to be the less performance impacting solution. Would it be interesting to generate a dictionary from existing data with a command like this one?

CREATE COMPRESSION_DICTIONARY_FROM_DATA A_good_dictionary (2048, select questions from TestColCompression limit 10000);

where the dictionary creation process would implicitly includes steps similar to the ones I did with femtozip.

Jul
20
2018
--

InnoDB Cluster in a Nutshell Part 3: MySQL Shell

MySQL InnoDB Cluster MySQL Shell

MySQL InnoDB Cluster MySQL ShellWelcome to the third part of this series. I’m glad you’re still reading, as hopefully this means you find this subject interesting at least. Previously we presented the first two components of MySQL InnoDB Cluster: Group Replication and MySQL Router and now we will discuss the last component, MySQL Shell.

MySQL Shell

This is the last component in the cluster and I love it. Oracle have created this tool to centralize cluster management, providing a friendly, command-line based user interface.

The tool can be defined as an advanced MySQL shell, which is much more powerful than the well known MySQL client. With the capacity to work with both relational and document (JSON) data, the tool provides an extended capability to interact with the database from a single place.

MySQL Shell is also able to understand different languages:

  • JavaScript (default) which includes several built-in functions to administer the cluster—create, destroy, restart, etc.—in a very easy way.
  • Python it provides an easy way to write Python code to interact with the database. This is particularly useful for developers who don’t need to have SQL skills or run applications to test code.
  • SQL to work in classic mode to query database as we used to do with the old MySQL client.

A very interesting feature provided with MySQL Shell is the ability to establish different connections to different servers/clusters from within the same shell. There is no need to exit to connect to a different server, just issuing the command \connect will make this happen. As DBA, I find this pretty useful when handling multiple clusters/servers.

Some of the features present in this tool:

  • Capacity to use both Classic and X protocols.
  • Online switch mode to change languages (JavaScript, Python and SQL)
  • Auto-completion of commands using tab, a super expected feature in MySQL client.
  • Colored formatting output that also supports different formats like Table, Tab-separated and Json formats.
  • Batch mode that processes batches of commands allowing also an interactive mode to print output according each line is processed.

Some sample commands

Samples of new tool and execution modes:

#switch modes
\sql
\js
\py
#connect to instance
\connect user@host:[port]
#create a cluster (better to handle through variables)
var cluster=dba.createCluster('percona')
#add instances to cluster
cluster.addInstance(‘root@192.168.70.2:3306’)
#check cluster status
cluster.status()
#using another variable
var cluster2=dba.getCluster(‘percona’)
cluster.status()
#get cluster structure
cluster.describe()
#rejoin instance to cluster - needs to be executed locally to the instance
cluster.rejoinInstance()
#rejoin instance to cluster - needs to be executed locally to the instance
cluster.rejoinInstance()
#recover from lost quorum
cluster.forceQuorumUsingPartitionOf(‘root@localhost:3306’)
#recover from lost quorum
cluster.rebootClusterFromCompleteOutage()
#destroy cluster
cluster.dissolve({force:true});

Personally, I think this tool is a very good replacement for the classic MySQL client. Sadly, mysql-server installations do not include MySQL shell by default, but it is worth getting used to. I recommend you try it.

Conclusion

We finally reached the end of this series. I hope you have enjoyed this short introduction to what seems to be Oracle’s bid to have a built-in High Availability solution based on InnoDB. It may become a good competitor to Galera-based solutions. Still, there is a long way to go, as the tool was only just released as GA (April 2018). There are a bunch of things that need to be addressed before it becomes consistent enough to be production-ready. In my personal opinion, it is not—yet. Nevertheless, I think it is a great tool that will eventually be a serious player in the HA field as it’s an excellent, flexible and easy to deploy solution.

The post InnoDB Cluster in a Nutshell Part 3: MySQL Shell appeared first on Percona Database Performance Blog.

Mar
23
2018
--

This Week in Data with Colin Charles 33: Reporting from FOSSASIA 2018 and Azure Announces Database Services for MySQL and PostgreSQL

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

Writing to you on the ground from FOSSASIA 2018, where I gave a track introduction yesterday since we have a pretty awesome database track most Saturday, and generally, all MySQL focused on Sunday. There’s even a list of talks by Oracle MySQL’ers (yes, there’s more than just Oracle folk, but for that, you got to get the schedule).

The Percona Live Community Dinner happens again this year during Percona Live 2018, at Pedro’s on 24 April 2018. It starts at 7 pm, and I highly recommend you purchase the $30 ticket. It usually sells out, so don’t wait till it’s too late.

Some big news from a MySQL in the cloud perspective: Announcing general availability of Azure database services for MySQL and PostgreSQL. There’s also a quick guide: Create an Azure Database for MySQL server by using the Azure portal. Note that next comes MariaDB Server too.

Releases

Link List

Upcoming appearances

The post This Week in Data with Colin Charles 33: Reporting from FOSSASIA 2018 and Azure Announces Database Services for MySQL and PostgreSQL appeared first on Percona Database Performance Blog.

Nov
20
2017
--

InnoDB Page Compression: the Good, the Bad and the Ugly

InnoDB Page CompressionIn this blog post, we’ll look at some of the facets of InnoDB page compression.

Somebody recently asked me about the best way to handle JSON data compression in MySQL. I took a quick look at InnoDB page compression and wanted to share my findings.

There is also some great material on this topic that was prepared and presented by Yura Sorokin at Percona Live Europe 2017: https://www.percona.com/live/e17/sessions/percona-xtradb-compressed-columns-with-dictionaries-an-alternative-to-innodb-table-compression. Yura also implemented Compressed Columns in Percona Server.

First, the good part.

InnoDB page compression is actually really easy to use and provides a decent compression ratio. To use it, I just ran

CREATE TABLE commententry (...) COMPRESSION="zlib";

 – and that’s all. By the way, for my experiment I used the subset of Reddit comments stored in JSON (described here: Big Dataset: All Reddit Comments – Analyzing with ClickHouse).

This method got me a compressed table of 3.9GB. Compare this to 8.4GB for an uncompressed table and it’s about a 2.15x compression ratio.

Now, the bad part.

As InnoDB page compression uses “hole punching,” the standard Linux utils do not always properly support files created this way. In fact, to see the size “3.9GB” I had to use

du --block-size=1 tablespace_name.ibd

 , as the standard

ls -l tablespace_name.ibd

 shows the wrong size (8.4GB). There is a similar limitation on copying files. The standard way

cp old_file new_file

 may not always work, and to be sure I had to use

cp --sparse=always old_file new_file

.

Speaking about copying, here’s the ugly part.

The actual time to copy the sparse file was really bad.

On a fairly fast device (a Samsung SM863), copying the sparse file mentioned above in its compressed size of 3.9GB took 52 minutes! That’s shocking, so let me repeat it again: 52 minutes to copy a 3.9GB file on an enterprise SATA SSD.

By comparison, copying regular 8.4GB file takes 9 seconds! Compare 9 sec and 52 mins.

To be fair, the NMVe device (Intel® SSD DC D3600) handles sparse files much better. It took only 12 seconds to copy the same sparse file on this device.

Having considered all this, it is hard to recommend that you use InnoDB page compression for serious production. Well, unless you power your database servers with NVMe storage.

For JSON data, the Compressed Columns in Percona Server for MySQL should work quite well using Dictionary to store JSON keys – give it a try!

Sep
07
2017
--

Percona Live Europe Featured Talks: NoSQL Best Practices for PostgreSQL with Dmitry Dolgov

Colin Charles

Percona Live EuropeWelcome to another post our series of interview blogs for the upcoming Percona Live Europe 2017 in Dublin. This series highlights a number of talks that will be at the conference and gives a short preview of what attendees can expect to learn from the presenter.

This blog post is with Dmitry Dolgov, a software engineer at Zalando SE. His talk is titled NoSQL Best Practices for PostgreSQL. One of PostgreSQL’s most attractive features is the Jsonb data type. It allows efficient work with semi-structured data without sacrificing strong consistency and ability to use all the power of proven relational technology.  In our conversation, we discussed how to use this NoSQL feature in PostgreSQL:

Percona: How did you get into databases? What do you love about it?  

Dmitry: I grew extremely interested in databases not so long ago, mostly due to the influence of Oleg Bartunov, who is a longtime contributor to PostgreSQL. Initially, I just implemented one patch for the Jsonb data type that was eventually included in the core. After that I couldn’t stop. So I still try to help the PostgreSQL community as much as I can.

What I love is just that: PostgreSQL has an awesome community. And I mean it, there are a lot of people that are excited about databases and possess valuable expertise in this area. My most vivid memory so far about the community was someone asking a question in the hackers mailing list that got answered within minutes – even before I started to type my own reply.

Percona: How can NoSQL Jsonb data type get used effectively with PostgreSQL?

Dmitry: The trick is that you don’t need to do anything supernatural for that. Jsonb is already effective enough right out of the box. But as always there are some limitations, implementation details and tricks (which I’ll show in my talk).

Percona: What do you want attendees to take away from your session? Why should they attend?

Dmitry: The biggest idea behind this talk is to show that we live in interesting times. It’s not that easy to stick with only one data model/data storage. And to mitigate this issue, most modern databases are trying to provide more that one approach. We have to evaluate them each carefully.

Or you can attend if you expect a holy war of PostgreSQL vs. MongoDB vs. MySQL vs. whatever else. But you won’t see anything like that, because we’re all grown up people. ?

Percona: What are you most looking forward to at Percona Live Europe 2017?

Dmitry: I look forward to meeting a lot of interesting people to collaborate with, and to share my own experiences.

Want to find out more about Dmitry and PostgreSQL and the Jsonb data type? Register for Percona Live Europe 2017, and see his talk NoSQL Best Practices for PostgreSQL. Register now to get the best price! Use discount code SeeMeSpeakPLE17 to get 10% off your registration.

Percona Live Open Source Database Conference Europe 2017 in Dublin is the premier European open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, MariaDB, MongoDB, time series database, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Open Source Database Conference Europe will be September 25-27, 2017 at the Radisson Blu Royal Hotel, Dublin.

Aug
17
2017
--

IMDb Data in a Graph Database

Graph Database 1

Graph Database 1In this first of its kind, Percona welcomes Dehowe Feng, Software Developer from Bitnine as a guest blogger. In his blog post, Dehowe discusses how viewing imported data from IMDb into a graph database (AgensGraph) lets you quickly see how data nodes relate to each other. This blog echoes a talk given by Bitnine at the Percona Live Open Source Database Conference 2017.

Graphs help illustrate the relationships between entities through nodes, drawing connections between people and objects. Relationships in IMDb are inherently visual. Seeing how things are connected grants us a better understanding of the context underneath. By importing IMDb data as graph data, you simplify the schema can obtain key insights.

In this post, we will examine how importing IMDb into a graph database (in this case, AgensGraph) allows us to look at data relationships in a much more visual way, providing more intuitive insights into the nature of related data.

For install instructions to the importing scripts, go here.

Internet Movie Database (IMDb) owned by Amazon.com is one of the largest movie databases. It contains 4.1 million titles and 7.7 million personalities (https://en.wikipedia.org/wiki/IMDb).

Relational Schema for IMDb

Graph Database 2

Relational Schema of IMDb Info

Picture courtesy of user ofthelit on StackOverflow, https://goo.gl/SpS6Ca

Because IMDb’s file format is not easy to read and parse, rather than implementing the file directly we use an additional step to load it into relational tables. For this project, we used IMDbpy to load relational data into AgensGraph in relational form. The above figure is the relational schema which IMDbpy created. This schema is somewhat complicated, but essentially there are four basic entries: Production, Person, Company and Keyword. Because there are many N-to-N relationships between these entities, the relational schema has more tables than the number of entities. This makes the schema harder to understand. For example, a person can be related to many movies and a movie can have many characters.

Concise Graph Modeling

From there, we developed our own graph schema using Production, Person, Company and Keyword as our nodes (or end data points).

Productions lie at the “center” of the graph, with everything leading to them. Keywords describing Productions, Persons and Companies are credited for their contributions to Productions. Productions are linked to other productions as well.

Graph Database 3

Simplified Graph Database Schema

With the data in graph form, one can easily see the connections between all the nodes. The data can be visualized as a network and querying the data with Cypher allows users to explore the connections between entities.

Compared to the relational schema of IMDb, the graph schema is much simpler to understand. By merging related information for the main entities into nodes, we can access all relevant information to that node through that node, rather than having to match IDs across tables to get the information that we need. If we want to examine how a node relates to another node, we can query its edges to see the connections it forms. Being able to visually “draw a connection” from one node to another helps to illustrate how they are connected.

Furthermore, the labels of the edges describe how the nodes are connected. Edge labels in the IMDb Graph describe what kind of connection is formed, and pertinent information may be stored in attributes in the edges. For example, for the connections ACTOR_IN and ACTRESS_IN, we store role data, such as character name and character id.

Data Migration

To make vertexes’ and edges’ properties we use “views”, which join related tables. The data is migrated into a graph format by querying the relational data using selects and joins into a single table with the necessary information for creating each node.

For example, here is the SQL query used to create the jsonb_keyword view:

CREATE VIEW jsonb_keyword AS
SELECT row_to_json(row(keyword)) AS data
FROM keyword;

We use a view to make importing queries simpler. Once this view is created, its content can be migrated into the graph. After the graph is created, the graph_path is set, and the VLABEL is created, we can use the convenient LOAD keyword to load the JSON values from the relational table into the graph:

LOAD FROM jsonb_keyword AS keywords
CREATE (a:Keyword = data(keywords) );

Note that here LOAD is used to load data in from a relational table, but LOAD can also be used to load data from external sources as well.

Creating edges is a similar process. We load edges from the tables that store id tuples of the between the entities after creating their ELABELs:

LOAD FROM movie_keyword AS rel_key_movie
MATCH (a:Keyword), (b:Production)
WHERE a.id::int = (rel_key_movie).keyword_id AND
b.id::int = (rel_key_movie).movie_id
CREATE (a)-[:KEYWORD_OF]->(b);

As you can see, AgensGraph is not restricted to the CSV format when importing data. We can import relational data into its graph portion using the LOAD feature and SQL statements to refine our data sets.

How is information stored?

Most of the pertinent information is held in the nodes (vertexes). Nodes are labeled either as Productions, Persons, Companies or Keywords, and their relative information is stored as JSONs. Since IMDB information is constantly updated, many fields for certain entities are left incomplete. Since JSON is semi-structured, if an entity does not have a certain piece of information the field will not exist at all – rather than having a field and marking it as NULL.

We also use nested JSON arrays to store data that may have multiple fields, such as quotes that persons might have said or alternate titles to productions. This makes it possible to store “duplicate” fields in each node.

How can this information be used?

In the graph IMDb database, querying between entities is very easy to learn. Using the Cypher Query Language, a user can find things such as all actors that acted in a certain production, all productions that a person has worked on or all other companies that have worked with a certain company on any production. Graph database strength is the simplicity of visualizing the data. There are many ways you can query a graph database to find what you need!

Find the name of all actors that acted in Night at the Museum:

MATCH (a:Person)-[:ACTOR_IN]->(b:Production)
WHERE title = 'Night at the Museum'
RETURN a.name,b.title;

Result:

name | title
-----------------------+---------------------
Asprinio, Stephen | Night at the Museum
Blais, Richard | Night at the Museum
Bougere, Teagle F. | Night at the Museum
Bourdain, Anthony | Night at the Museum
Cherry, Jake | Night at the Museum
Cheng, Paul Chih-Ping | Night at the Museum
...
(56 rows)

Find all productions that Ben Stiller worked on:

MATCH (a:Person)-[b]->(c:Production)
WHERE a.name = 'Stiller, Ben'
RETURN a.name,label(b),c.title;

Result:

name | label | title
-------------+-------------+-----------------------------------------------
...
Stiller, Ben | actor_in | The Heartbreak Kid: The Egg Toss
Stiller, Ben | producer_of | The Hardy Men
Stiller, Ben | actor_in | The Heartbreak Kid: Ben & Jerry
Stiller, Ben | producer_of | The Polka King
Stiller, Ben | actor_in | The Heartbreak Kid
Stiller, Ben | actor_in | The Watch
Stiller, Ben | actor_in | The History of 'Walter Mitty'
Stiller, Ben | producer_of | The Making of 'The Pick of Destiny'
Stiller, Ben | actor_in | The Making of 'The Pick of Destiny'
...
(901 rows)

Find all actresses that worked with Sarah Jessica Parker:

MATCH (a:Person)-[b:ACTRESS_IN]->(c:Production)<-[d:ACTRESS_IN]-(e:Person)
WHERE a.name = 'Parker, Sarah Jessica'
RETURN DISTINCT e.name;

Result:

name
---------------------------------
Aaliyah
Aaron, Caroline
Aaron, Kelly
Abascal, Nati
Abbott, Diane
Abdul, Paula
...
(3524 rows)

Summary

The most powerful aspects of a graph database are flexibility and visualization capabilities.

In the future, we plan to implement a one-step importing script. Currently, the importing script is two-phased: the first step is to load into relational tables and the second step is to load into the graph. Additionally, AgensGraph has worked with Gephi to release a data import plugin. The Gephi Connector allows for graph visualization and analysis. For more information, please visit www.bitnine.net and www.agensgraph.com.

Mar
09
2017
--

Services Monitoring with Probabilistic Fault Detection

Services Monitoring

In this blog post, we’ll discuss services monitoring using probabilistic fault detection.

Let’s admit it, the task of monitoring services is one of the most difficult. It is time-consuming, error-prone and difficult to automate. The usual monitoring approach has been pretty straightforward in the last few years: setup a service like Nagios, or pay money to get a cloud-based monitoring tool. Then choose the metrics you are interested in and set the thresholds. This is a manual process that works when you have a small number of services and servers, and you know exactly how they behave and what you should monitor. These days, we have hundred of servers with thousands of services sending us millions of metrics. That is the first problem: the manual approach to configuration doesn’t work.

That is not the only problem. We know that no two servers perform the same because no two servers have exactly the same workload. The thresholds that you setup for one server might not be the correct one for all of the other thousand. There are some approaches to the problem that will make it even worse (like taking averages and setting the thresholds based on those, for example, hoping it will work). Let me tell you a secret: it won’t work. Here we have a second problem: instances of the same type can demonstrate very different behaviors.

The last problem is that new shiny services you company may want to use are announced every week. It is impossible, because of time constraints, to know all of those services well enough to create a perfect monitoring template. In other words: sometimes we are asked to monitor software we don’t completely understand.

In summary, you have thousands of services, some of them you don’t even know how they work, that are sending you million of metrics that mean nothing to you. Now, set the thresholds and enable the pager alert. The nightmare has started. Is there a different approach?

Machine Learning

We have to stop thinking that monitoring is a bunch of config files with thresholds that we copy from one server to another. There are no magic templates that will work. We need to use a different technique that removes us from the process. That template is “machine learning.” As stated in Wikipedia, it is a subfield of computer science that gives computers the ability to learn without being explicitly programmed. In it’s most basic form, it can be used to solve classification problems. For example, open pet photos and identify if it is a cat or a dog. This is a classification problem that both humans and computers can solve, but we are much much slower. The computer has to take the time to learn the patterns, but at some point it will do the classification in no time.

I hope you are starting to see a pattern here. Why do we need to care about monitoring and its configuration if we have computers that can learn patterns and classify things for us?

There are two main ways of doing probabilistic fault detection: Novelty Detection and Outlier Detection.

Novelty Detection

Novelty Detection is easy to visualize and understand. It takes a series of inputs and tries to find anomalies, something that hasn’t been seen before. For example, our credit card company has a function that takes “category, expense, date, hour, country” as arguments and returns an integer so that they can classify and identify all the purchases. Your monthly use of the credit card looks like this:

[0,4,4,5,5,5,4,3]

That is the normal model that defines your use of the credit card. Now, it can be used to detect anomalies.

  • [0] – OK
  • [4] – OK
  • [4] – OK
  • [1] – Anomaly! Operation canceled.

Easy and straightforward. It is simple and very useful in a lot of areas to generate alerts when something anomalous happens. One of the machine learning models that can do this is One-Class Support Vector Machines, but since this is not the kind of fault detection we are looking for I won’t go into details. If you are interested, follow these links:

Outlier Detection

Let’s say we have this data:

[0, 3, 5, 1, -2, 19, 2, 10, -9, 2, 1, 8, 3, 21, -1, 3]

Now we know how to find anomalies, but how do we find outliers? Looking at the numbers above, it seems 21, 19 and -9 could be outliers. But a more exact definition is needed (not just intuition). The most simple and usual way of doing it is the following:

We divide our data into three pieces. One cut will be done at 25%, the second cut at 75%. The number that it is at 25% is called the First Quartile, and the value of the second cut is called the Third Quartile. The IQR or Interquartile Range is the subtraction of the Third Quartile with the First Quartile. Now, an outlier is any number that falls in one of these two categories:

  • If the value is below: (First Quartile) – (1.5 × IQR)
  • If the value is above: (Third Quartile + (1.5 × IQR)

Using Python:

inputs = [0, 3, 5, 1, -2, 19, 2, 10, -9, 2, 1, 8, 3, 21, -1, 3]
Q1 = np.percentile(inputs,25)
Q3 = np.percentile(inputs,75)
step = (Q3-Q1)*1.5
Q1
0.75
Q3
5.75
step
7.5
outliers = [x for x in inputs if x < Q1-step or x > Q3+step]
outliers
[19, -9, 21]

This looks more like what we need. If we are monitoring a metric, and outliers are detected, then something is happening there that requires investigation. Some of the most used outlier detection models in scikit-learn are:

  • Elliptic Envelope: a robust co-variance estimate that assumes that our data is Gaussian distributed. It will define the shape of the data we have, creating a frontier that delimits the contour. As you probably guessed, it will be elliptical in shape. Don’t worry about the assumption of Gaussian distribution, data can be standardized. More about this later on.

 

  • Isolation Forest: this is the well-known “forest of random trees,” but applied to outlier detection. This is more suitable when we have many different input metrics. In the example I use below, I just use a single metric, so this model would not work that well.

Therefore, Elliptic Envelope looks like the best option for our proof-of-concept.

For visual reference, this is how the three models look like when they try to shape two data inputs:

Services Monitoring
Source: scikit-learn.org

 

Proof-of-Concept

I haven’t explained the model in detail, but a high level explanation should be enough to understand the problem and the possible solution. Let’s start building a proof-of-concept.

For this test, I got data from our Prometheus setup, where all the time-series monitoring data from our customers is stored. In this particular example, I got numbers from the “Threads Running” metric. Those will be used to train our Elliptical Envelope. It is important to take the following into account:

    • We need to collect enough data so that it captures the correct shape of our baseline performance. For example, usually nighttime hours have less of a workload than during the day (same with weekend days, in some cases).
    • As explained before, it assumes a Gaussian distribution, which means that the data needs to be scaled. I am going to standardize the data so that it has 0 mean and 1 variance. The same standardization needs to be applied to the data we test after the training process, when the monitoring is already in place. That standardization also needs to be applied to each metric individually. This is the formula:
Services Monitoring
Source: dataminingblog.com

With ? as the mean and ? as the standard deviation.

This is the summary of what our proof-of-concept will do:

  • Read Prometheus JSON dump.
  • Separate some data for training, standardizing it first.
  • Separate some data for testing, standardizing it first as well.
  • Make predictions on test data.
  • For those rows identified as outliers, get the original non-standardize data to see the number of threads running.

So, let’s start:

Import the Libraries

import pandas as pd
import numpy as np
import json
from datetime import datetime
from sklearn.preprocessing import StandardScaler
from sklearn.covariance import EllipticEnvelope

Load the Data

All the information is in a JSON output from Prometheus that has the “threads_running” of a particular server. It has one second granularity for the last four weeks. I also converted “timestamps” to a normal “datetime” object so that it is easier to read:

with open('query.json') as data_file:
    data = json.load(data_file)
data = pd.DataFrame(data["data"]["result"][0]["values"])
data[0] = data[0].astype(int)
data[0] = data[0].apply(lambda x: datetime.fromtimestamp(x))

The data looks like this:

DateTime Threads Running
2017-01-19 20:32:44 1
2017-01-19 20:32:45 1
2017-01-19 20:32:46 2

 

Create the Training and Testing Dataset

First, separate some of the data for use as training:

train_data = data[(data[0] >= "2017-01-22") & (data[0] <= "2017-01-28" )]

Ignore the date column, and just store the metrics:

train_data = train_data.iloc[:,[1]]

Standardize it:

escaler = StandardScaler()
train_data = escaler.fit_transform(train_data)

Now the data looks like this:

Standardized Threads Running
-0.4072634
-0.4072634
0.47153585

To create the test dataset we need to follow the exact same procedure, only select a different timeframe:

test_original_data = data[(data[0] >= "2017-02-2") & (data[0] <= "2017-02-17" )]
test_data = test_original_data.iloc[:,[1]]
test_data = escaler.transform(test_data)

Train the Model

Let’s create our model with the training data! I am using two parameters here:

  • assume_centered: to specify that our data is already Gaussian distributed.
  • contamination: to specify the ratio of outliers our training data has.
clf = EllipticEnvelope(assume_centered=True,contamination=0)
clf.fit(train_data)

Search for Outliers

Now that we’ve trained the model and we have our test data, we can ask the model if it finds any outliers. It will return

1

 or

-1

 for each row. “1” means that the value of threads running is normal and within the boundaries, while “-1” means that the value is an outlier:

predictions = clf.predict(test_data)
outliers = np.where(predictions==-1)

The array “outliers” stores the row numbers where -1 was predicted.

At this point we have three important variables:

  • test_data: standardized testing data.
  • test_original_data: the original test data without modification.
  • outliers: the row numbers where an outlier was detected (-1).

Investigate the Outliers

Since we have the row number where an outlier was detected, now we can just query

test_original_data

 and search for those rows. In this example, I show some random ones:

for indice in outliers[0]:
    if np.random.randn() > 2.5:
        print("{} - {} threads running".format(test_original_data.iloc[indice][0], test_original_data.iloc[indice][1]))
2017-02-03 11:26:03 - 41 threads running
2017-02-03 11:26:40 - 43 threads running
2017-02-03 11:27:50 - 48 threads running
2017-02-03 11:32:07 - 78 threads running
2017-02-03 11:33:25 - 90 threads running
2017-02-12 10:06:58 - 36 threads running
2017-02-12 10:12:11 - 60 threads running
2017-02-12 10:12:30 - 64 threads running

And there we have it! Dates and hours when something really out of the ordinary happened. No need to create a config file for each service, guess thresholds, adjust them … nothing. Just let the model learn, and you get alerts when something unexpected happens. Push all the metrics from your services to these models, and let them do the hard work.

Summary

Most companies have similar situations. Companies add new services on hundred of servers, and monitoring is an essential part of the infrastructure. The old method of monolithic config files with some thresholds doesn’t scale, because it needs a lot of manual work with a trial/error approach. The types of techniques explained in this blog post can help us deploy monitoring on hundred of servers, not really caring about the different nuances of each service or workload. It is even possible to start monitoring a service without even knowing anything about it — just let the probabilistic model take care of it.

It is important to clarify that, in my opinion, these fault detection models are not going to be a substitute for software like Nagios. In those areas where a binary test is needed (service is up/down for example), Nagios and other similar services do a good job. Actually, a Nagios check can use the procedure explained here. When there are many metrics to analyze, probabilistic methods can save us from a nightmare.

Jan
18
2017
--

Elasticsearch Ransomware: Open Source Database Security Part 2

Elasticsearch Ransomware

Elasticsearch RansomwareIn this blog post, we’ll look at a new Elasticsearch ransomware outbreak and what you can do to prevent it happening to you.

Mere weeks after reports of MongoDB servers getting hacked and infected with ransomware, Elasticsearch clusters are experiencing the same difficulties. David Murphy’s blog discussed the situation and the solution for MongoDB servers. In this blog post, we look at how you can prevent ransomware attacks on your Elasticsearch clusters.

First off, what is Elasticsearch? Elasticsearch is an open source distributed index based on Apache Lucene. It provides a full-text search with an HTTP API, using schemaless JSON documents. By its nature, it is also distributed and redundant. Companies use Elasticsearch with logging via the ELK stack and data-gathering software, to assist with data analytics and visualizations. It is also used to back search functionality in a number of popular apps and web services.

In this new scenario, the ransomware completed wiped away the cluster data, and replaced it with the following warning index:

“SEND 0.2 BTC TO THIS WALLET: 1DAsGY4Kt1a4LCTPMH5vm5PqX32eZmot4r IF YOU WANT RECOVER YOUR DATABASE! SEND TO THIS EMAIL YOUR SERVER IP AFTER SENDING THE BITCOINS.”

As with the MongoDB situation, this isn’t a flaw in the Elasticsearch software. This vulnerability stems from not correctly using the security settings provided by Elasticsearch. As the PCWorld article sums up:

According to experts, there is no reason to expose Elasticsearch clusters to the internet. In response to these recent attacks, search technologies and distributed systems architect Itamar Syn-Hershko has published a blog post with recommendations for securing Elasticsearch deployments.

The blog post they reference has excellent advice and examples of how to protect your Elasticsearch installations from exploitation. To summarize its advice (from the post itself):

Whatever you do, never expose your cluster nodes to the web directly.

So how do you prevent hackers from getting into your Elasticsearch cluster? Using the advice from Syn-Hershko’s blog, here are some bullet points for shoring up your Elasticsearch security:

  • HTTP-enabled nodes should only listen to private IPs. You can configure what IPs Elasticsearch listens to: localhost, private IPs, public IPs or several of these options.
    network.bind_host

     and 

    network.host

     control the IP types (manual). Never set Elasticsearch to listen to a public IP or a publicly accessible DNS name.

  • Use proxies to communicate with clients. You should pass any application queries to Elasticsearch through some sort of software that can filter requests, perform audit-logging and password-protect the data. Your client-side javascript shouldn’t talk to Elastic directly, and should only communicate with your server-side software. That software can translate all client-side requests to Elasticsearch DSL, execute the query, and then send the response in a format the clients expect.
  • Don’t use default ports. Once again for clarity: DON’T USE DEFAULT PORTS. You can easily change Elasticsearch’s default ports by modifying the .YML file. The relevant parameters are
    http.port

     and

    transport.tcp.port

     (manual).

  • Disable HTTP if you don’t need it. Only Elasticsearch client nodes should enable HTTP, and your private network applications should be the only ones with access to them. You can completely disable the HTTP module by setting
    http.enabled

     to

    false

     (manual).

  • Secure publicly available client nodes. You should protect your Elasticsearch client and any UI it communicates with (such as Kibana and Kopf) behind a VPN. If you choose to allow some nodes access to the public network, use HTTPS and don’t transmit data and credentials as plain-text. You can use plugins like Elastic’s Shield or SearchGuard to secure your cluster.
  • Disable scripting (pre-5.x). Malicious scripts can hack clusters via the Search API. Earlier versions of Elasticscript allowed unsecured scripts to access the software. If you are using an older version (pre-5.x), upgrade to a newer version or disable dynamic scripting completely.

Go to Syn-Hershko’s blog for more details.

This should get you started on correctly protecting yourself against Elasticsearch ransomware (and other security threats). If you want to have someone review your security, please contact us.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com