Dec
04
2018
--

Fivetran announces $15M Series A to build automated data pipelines

Fivetran, a startup that builds automated data pipelines between data repositories and cloud data warehouses and analytics tools, announced a $15 million Series A investment led by Matrix Partners.

Fivetran helps move data from source repositories like Salesforce and NetSuite to data warehouses like Snowflake or analytics tools like Looker. Company CEO and co-founder George Fraser says the automation is the key differentiator here between his company and competitors like Informatica and SnapLogic.

“What makes Fivetran different is that it’s an automated data pipeline to basically connect all your sources. You can access your data warehouse, and all of the data just appears and gets kept updated automatically,” Fraser explained. While he acknowledges that there is a great deal of complexity behind the scenes to drive that automation, he stresses that his company is hiding that complexity from the customer.

The company launched out of Y Combinator in 2012, and other than $4 million in seed funding along the way, it has relied solely on revenue up until now. That’s a rather refreshing approach to running an enterprise startup, which typically requires piles of cash to build out sales and marketing organizations to compete with the big guys they are trying to unseat.

One of the key reasons they’ve been able to take this approach has been the company’s partner strategy. Having the ability to get data into another company’s solution with a minimum of fuss and expense has attracted data-hungry applications. In addition to the previously mentioned Snowflake and Looker, the company counts Google BigQuery, Microsoft Azure, Amazon Redshift, Tableau, Periscope Data, Salesforce, NetSuite and PostgreSQL as partners.

Ilya Sukhar, general partner at Matrix Partners, who will be joining the Fivetran board under the terms of deal sees a lot of potential here. “We’ve gone from companies talking about the move to the cloud to preparing to execute their plans, and the most sophisticated are making Fivetran, along with cloud data warehouses and modern analysis tools, the backbone of their analytical infrastructure,” Sukhar said in a statement.

They currently have 100 employees spread out across four offices in Oakland, Denver, Bangalore and Dublin. They boast 500 customers using their product including Square, WeWork, Vice Media and Lime Scooters, among others.

Oct
22
2018
--

One Billion Tables in MySQL 8.0 with ZFS

one billion tables MySQL

The short version

I created > one billion InnoDB tables in MySQL 8.0 (tables, not rows) just for fun. Here is the proof:

$ mysql -A
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1425329
Server version: 8.0.12 MySQL Community Server - GPL
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> select count(*) from information_schema.tables;
+------------+
| count(*)   |
+------------+
| 1011570298 |
+------------+
1 row in set (6 hours 57 min 6.31 sec)

Yes, it took 6 hours and 57 minutes to count them all!

Why does anyone need one billion tables?

In my previous blog post, I created and tested MySQL 8.0 with 40 million tables (that was a real case study). The One Billion Tables project is not a real world scenario, however. I was challenged by Billion Tables Project (BTP) in PostgreSQL, and decided to repeat it with MySQL, creating 1 billion InnoDB tables.

As an aside: I think MySQL 8.0 is the first MySQL version where creating 1 billion InnoDB tables is even practically possible.

Challenges with one billion InnoDB tables

Disk space

The first and one of the most important challenges is disk space. InnoDB allocates data pages on disk when creating .ibd files. Without disk level compression we need > 25Tb of disk. The good news: we have ZFS which provides transparent disk compression. Here’s how the disk utilization looks:

Actual data (apparent-size):

# du -sh --apparent-size /mysqldata/
26T     /mysqldata/

Compressed data:

# du -sh /mysqldata/
2.4T    /mysqldata/

Compression ratio:

# zfs get compression,compressratio
...
mysqldata/mysql/data             compressratio         7.14x                      -
mysqldata/mysql/data             compression           gzip                       inherited from mysqldata/mysql

(Looks like the compression ratio reported is not 100% correct, we expect ~10x compression ratio.)

Too many tiny files

This is usually the big issue with databases that create a file per table. With MySQL 8.0 we can create a shared tablespace and “assign” a table to it. I created a tablespace per database, and created 1000 tables in each database.

The result:

mysql> select count(*) from information_schema.schemata;
+----------+
| count(*) |
+----------+
|  1011575 |
+----------+
1 row in set (1.31 sec)

Creating tables

Another big challenge is how to create tables fast enough so it will not take months. I have used three approaches:

  1. Disabled all possible consistency checks in MySQL, and decreased the innodb page size to 4K (these config options are NOT for production use)
  2. Created tables in parallel: as the mutex contention bug in MySQL 8.0 has been fixed, creating tables in parallel works fine.
  3. Use local NVMe cards on top of an AWS ec2 i3.8xlarge instance

my.cnf config file (I repeat: do not use this in production):

[mysqld]
default-authentication-plugin = mysql_native_password
performance_schema=0
datadir=/mysqldata/mysql/data
socket=/mysqldata/mysql/data/mysql.sock
log-error = /mysqldata/mysql/log/error.log
skip-log-bin=1
innodb_log_group_home_dir = /mysqldata/mysql/log/
innodb_doublewrite = 0
innodb_checksum_algorithm=none
innodb_log_checksums=0
innodb_flush_log_at_trx_commit=0
innodb_log_file_size=2G
innodb_buffer_pool_size=100G
innodb_page_size=4k
innodb_flush_method=nosync
innodb_io_capacity_max=20000
innodb_io_capacity=5000
innodb_buffer_pool_instances=32
innodb_stats_persistent = 0
tablespace_definition_cache = 524288
schema_definition_cache = 524288
table_definition_cache = 524288
table_open_cache=524288
table_open_cache_instances=32
open-files-limit=1000000

ZFS pool:

# zpool status
  pool: mysqldata
 state: ONLINE
  scan: scrub repaired 0B in 1h49m with 0 errors on Sun Oct 14 02:13:17 2018
config:
        NAME        STATE     READ WRITE CKSUM
        mysqldata   ONLINE       0     0     0
          nvme0n1   ONLINE       0     0     0
          nvme1n1   ONLINE       0     0     0
          nvme2n1   ONLINE       0     0     0
          nvme3n1   ONLINE       0     0     0
errors: No known data errors

A simple “deploy” script to create tables in parallel (includes the sysbench table structure):

#/bin/bash
function do_db {
        db_exist=$(mysql -A -s -Nbe "select 1 from information_schema.schemata where schema_name = '$db'")
        if [ "$db_exist" == "1" ]; then echo "Already exists: $db"; return 0; fi;
        tbspace="create database $db; use $db; CREATE TABLESPACE $db ADD DATAFILE '$db.ibd' engine=InnoDB";
        #echo "Tablespace $db.ibd created!"
        tables=""
        for i in {1..1000}
        do
                table="CREATE TABLE sbtest$i ( id int(10) unsigned NOT NULL AUTO_INCREMENT, k int(10) unsigned NOT NULL DEFAULT '0', c varchar(120) NOT NULL DEFAULT '', pad varchar(60) NOT NULL DEFAULT '', PRIMARY KEY (id), KEY k_1 (k) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 tablespace $db;"
                tables="$tables; $table;"
        done
        echo "$tbspace;$tables" | mysql
}
c=0
echo "starting..."
c=$(mysql -A -s -Nbe "select max(cast(SUBSTRING_INDEX(schema_name, '_', -1) as unsigned)) from information_schema.schemata where schema_name like 'sbtest_%'")
for m in {1..100000}
do
        echo "m=$m"
        for i in {1..30}
        do
                let c=$c+1
                echo $c
                db="sbtest_$c"
                do_db &
        done
        wait
done

How fast did we create tables? Here are some stats:

# mysqladmin -i 10 -r ex|grep Com_create_table
...
| Com_create_table                                      | 6497                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| Com_create_table                                      | 6449

So we created ~650 tables per second. The average, above, is per 10 seconds.

Counting the tables

It took > 6 hours to do “count(*) from information_schema.tables”! Here is why:

  1. MySQL 8.0 uses a new data dictionary (this is great as it avoids creating 1 billion frm files). Everything is stored in this file:
    # ls -lah /mysqldata/mysql/data/mysql.ibd
    -rw-r----- 1 mysql mysql 6.1T Oct 18 15:02 /mysqldata/mysql/data/mysql.ibd
  2. The information_schema.tables is actually a view:
mysql> show create table information_schema.tables\G
*************************** 1. row ***************************
                View: TABLES
         Create View: CREATE ALGORITHM=UNDEFINED DEFINER=`mysql.infoschema`@`localhost` SQL SECURITY DEFINER VIEW `information_schema`.`TABLES` AS select `cat`.`name` AS `TABLE_CATALOG`,`sch`.`name` AS `TABLE_SCHEMA`,`tbl`.`name` AS `TABLE_NAME`,`tbl`.`type` AS `TABLE_TYPE`,if((`tbl`.`type` = 'BASE TABLE'),`tbl`.`engine`,NULL) AS `ENGINE`,if((`tbl`.`type` = 'VIEW'),NULL,10) AS `VERSION`,`tbl`.`row_format` AS `ROW_FORMAT`,internal_table_rows(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`table_rows`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `TABLE_ROWS`,internal_avg_row_length(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`avg_row_length`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `AVG_ROW_LENGTH`,internal_data_length(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`data_length`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `DATA_LENGTH`,internal_max_data_length(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`max_data_length`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `MAX_DATA_LENGTH`,internal_index_length(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`index_length`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `INDEX_LENGTH`,internal_data_free(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`data_free`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `DATA_FREE`,internal_auto_increment(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`auto_increment`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0),`tbl`.`se_private_data`) AS `AUTO_INCREMENT`,`tbl`.`created` AS `CREATE_TIME`,internal_update_time(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(cast(`stat`.`update_time` as unsigned),0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `UPDATE_TIME`,internal_check_time(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(cast(`stat`.`check_time` as unsigned),0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `CHECK_TIME`,`col`.`name` AS `TABLE_COLLATION`,internal_checksum(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`checksum`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `CHECKSUM`,if((`tbl`.`type` = 'VIEW'),NULL,get_dd_create_options(`tbl`.`options`,if((ifnull(`tbl`.`partition_expression`,'NOT_PART_TBL') = 'NOT_PART_TBL'),0,1))) AS `CREATE_OPTIONS`,internal_get_comment_or_error(`sch`.`name`,`tbl`.`name`,`tbl`.`type`,`tbl`.`options`,`tbl`.`comment`) AS `TABLE_COMMENT` from (((((`mysql`.`tables` `tbl` join `mysql`.`schemata` `sch` on((`tbl`.`schema_id` = `sch`.`id`))) join `mysql`.`catalogs` `cat` on((`cat`.`id` = `sch`.`catalog_id`))) left join `mysql`.`collations` `col` on((`tbl`.`collation_id` = `col`.`id`))) left join `mysql`.`tablespaces` `ts` on((`tbl`.`tablespace_id` = `ts`.`id`))) left join `mysql`.`table_stats` `stat` on(((`tbl`.`name` = `stat`.`table_name`) and (`sch`.`name` = `stat`.`schema_name`)))) where (can_access_table(`sch`.`name`,`tbl`.`name`) and is_visible_dd_object(`tbl`.`hidden`))
character_set_client: utf8
collation_connection: utf8_general_ci

and the explain plan looks like this:

mysql> explain select count(*) from information_schema.tables \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: cat
   partitions: NULL
         type: index
possible_keys: PRIMARY
          key: name
      key_len: 194
          ref: NULL
         rows: 1
     filtered: 100.00
        Extra: Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: tbl
   partitions: NULL
         type: ALL
possible_keys: schema_id
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1023387060
     filtered: 100.00
        Extra: Using where; Using join buffer (Block Nested Loop)
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: sch
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY,catalog_id
          key: PRIMARY
      key_len: 8
          ref: mysql.tbl.schema_id
         rows: 1
     filtered: 11.11
        Extra: Using where
*************************** 4. row ***************************
           id: 1
  select_type: SIMPLE
        table: stat
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 388
          ref: mysql.sch.name,mysql.tbl.name
         rows: 1
     filtered: 100.00
        Extra: Using index
*************************** 5. row ***************************
           id: 1
  select_type: SIMPLE
        table: ts
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: mysql.tbl.tablespace_id
         rows: 1
     filtered: 100.00
        Extra: Using index
*************************** 6. row ***************************
           id: 1
  select_type: SIMPLE
        table: col
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: mysql.tbl.collation_id
         rows: 1
     filtered: 100.00
        Extra: Using index

Conclusions

  1. I have created more than 1 billion real InnoDB tables with indexes in MySQL 8.0, just for fun, and it worked. It took ~2 weeks to create.
  2. Probably MySQL 8.0 is the first version where it is even practically possible to create billion InnoDB tables
  3. ZFS compression together with NVMe cards makes it reasonably cheap to do, for example, by using i3.4xlarge or i3.8xlarge instances on AWS.

one billion tables MySQL

Oct
10
2018
--

Nvidia launches Rapids to help bring GPU acceleration to data analytics

Nvidia, together with partners like IBM, HPE, Oracle, Databricks and others, is launching a new open-source platform for data science and machine learning today. Rapids, as the company is calling it, is all about making it easier for large businesses to use the power of GPUs to quickly analyze massive amounts of data and then use that to build machine learning models.

“Businesses are increasingly data-driven,” Nvidia’s VP of Accelerated Computing Ian Buck told me. “They sense the market and the environment and the behavior and operations of their business through the data they’ve collected. We’ve just come through a decade of big data and the output of that data is using analytics and AI. But most it is still using traditional machine learning to recognize complex patterns, detect changes and make predictions that directly impact their bottom line.”

The idea behind Rapids then is to work with the existing popular open-source libraries and platforms that data scientists use today and accelerate them using GPUs. Rapids integrates with these libraries to provide accelerated analytics, machine learning and — in the future — visualization.

Rapids is based on Python, Buck noted; it has interfaces that are similar to Pandas and Scikit, two very popular machine learning and data analysis libraries, and it’s based on Apache Arrow for in-memory database processing. It can scale from a single GPU to multiple notes and IBM notes that the platform can achieve improvements of up to 50x for some specific use cases when compared to running the same algorithms on CPUs (though that’s not all that surprising, given what we’ve seen from other GPU-accelerated workloads in the past).

Buck noted that Rapids is the result of a multi-year effort to develop a rich enough set of libraries and algorithms, get them running well on GPUs and build the relationships with the open-source projects involved.

“It’s designed to accelerate data science end-to-end,” Buck explained. “From the data prep to machine learning and for those who want to take the next step, deep learning. Through Arrow, Spark users can easily move data into the Rapids platform for acceleration.”

Indeed, Spark is surely going to be one of the major use cases here, so it’s no wonder that Databricks, the company founded by the team behind Spark, is one of the early partners.

“We have multiple ongoing projects to integrate Spark better with native accelerators, including Apache Arrow support and GPU scheduling with Project Hydrogen,” said Spark founder Matei Zaharia in today’s announcement. “We believe that RAPIDS is an exciting new opportunity to scale our customers’ data science and AI workloads.”

Nvidia is also working with Anaconda, BlazingDB, PyData, Quansight and scikit-learn, as well as Wes McKinney, the head of Ursa Labs and the creator of Apache Arrow and Pandas.

Another partner is IBM, which plans to bring Rapids support to many of its services and platforms, including its PowerAI tools for running data science and AI workloads on GPU-accelerated Power9 servers, IBM Watson Studio and Watson Machine Learning and the IBM Cloud with its GPU-enabled machines. “At IBM, we’re very interested in anything that enables higher performance, better business outcomes for data science and machine learning — and we think Nvidia has something very unique here,” Rob Thomas, the GM of IBM Analytics told me.

“The main benefit to the community is that through an entirely free and open-source set of libraries that are directly compatible with the existing algorithms and subroutines that their used to — they now get access to GPU-accelerated versions of them,” Buck said. He also stressed that Rapids isn’t trying to compete with existing machine learning solutions. “Part of the reason why Rapids is open source is so that you can easily incorporate those machine learning subroutines into their software and get the benefits of it.”

Sep
20
2018
--

AI could help push Neo4j graph database growth

Graph databases have always been useful to help find connections across a vast data set, and it turns out that capability is quite handy in artificial intelligence and machine learning too. Today, Neo4j, the makers of the open source and commercial graph database platform, announced the release of Neo4j 3.5, which has a number of new features aimed specifically at AI and machine learning.

Neo4j founder and CEO Emil Eifrem says he had recognized the connection between AI and machine learning and graph databases for a while, but he says that it has taken some time for the market to catch up to the idea.

“There has been a lot momentum around AI and graphs…Graphs are very fundamental to AI. At the same time we were seeing some early use cases, but not really broad adoption, and that’s what we’re seeing right now,” he explained.

AI graph uses cases. Graphic: Neo4j

To help advance AI uses cases, today’s release includes a new full text search capability, which Eifrem says has been one of the most requested features. This is important because when you are making connections between entities, you have to be able to find all of the examples regardless of how it’s worded — for example, human versus humans versus people.

Part of that was building their own indexing engine to increase indexing speed, which becomes essential with ever more data to process. “Another really important piece of functionality is that we have improved our data ingestion very significantly. We have 5x end-to-end performance improvements when it comes to importing data. And this is really important for connected feature extraction, where obviously, you need a lot of data to be able to train the machine learning,” he said. That also means faster sorting of data too.

Other features in the new release include improvements to the company’s own Cypher database query language and better visualization of the graphs to give more visibility, which is useful for visualizing how machine learning algorithms work, which is known as AI explainability. They also announced support for the Go language and increased security.

Graph databases are growing increasingly important as we look to find connections between data. The most common use case is the knowledge graph, which is what lets us see connections in a huge data sets. Common examples include who we are connected to on a social network like Facebook, or if we bought one item, we might like similar items on an ecommerce site.

Other use cases include connected feature extraction, a common machine learning training techniques that can look at a lot of data and extract the connections, the context and the relationships for a particular piece of data, such as suspects in a criminal case and the people connected to them.

Neo4j has over 300 large enterprise customers including Adobe, Microsoft, Walmart, UBS and NASA. The company launched in 2007 and has raised $80 million. The last round was $36 million in November 2016.

Sep
03
2018
--

40 million tables in MySQL 8.0 with ZFS

40 million tables in MySQL 8

40 million tables in MySQL 8In my previous blog post about millions of table in MySQL 8, I was able to create one million tables and test the performance of it. My next challenge is to create 40 million tables in MySQL 8 using shared tablespaces (one tablespace per schema). In this blog post I’m showing how to do it and what challenges we can expect.

Background

Once again – why do we need so many tables in MySQL, what is the use case? The main reason is: customer isolation. With the new focus on security and privacy (take GDPR for example) it is much easier and more beneficial to create a separate schema (or “database” in MySQL terms) for each customer. That creates a new set of challenges that we will need to solve. Here is the summary:

  1. Too many files. For each table MySQL creates an FRM file. With MySQL 8.0, this is not the case for InnoDB tables (new data dictionary): it does not create FRM files, only creates IBD file.
  2. Too much storage overhead. Just to create 40 million tables we will need to have ~4 – 5 Tb of space. The ZFS filesystem can help here a lot, through compression – see below.
  3. MySQL does not work well with so many tables. We have observed a lot of overhead (MySQL needs to open/close table definition files) and contention (table definitions needs to be stored in memory to avoid performance penalty, which introduce mutex contention)

Challenges

When I approached the task of creating 40 million tables, my first challenge was disk space. Just to create them, I needed at least 5Tb of fast disk storage. The good news is: we have the ZFS filesystem which provides compression out of the box. With compression I was able to use just a 250G drive with ZFS – the compression ratio is > 10x:

# du -sh --apparent-size /var/lib/mysql-data
4.7T    /var/lib/mysql-data
# du -sh /var/lib/mysql-data
131G    /var/lib/mysql-data

The second challenge is how to create those tables in a reasonable amount of time. I created a script to “provision” the databases (create all 40 millions tables). The good new is that the performance regression in “create table” speed and scalability bug was fixed so I was able to use this script to create 40 million tables using shared tablespaces (one tablespace per schema):

#/bin/bash
function do_db {
        db_exist=$(mysql -A -s -Nbe "select 1 from information_schema.schemata where schema_name = '$db'")
        if [ "$db_exist" == "1" ]; then echo "Already exists $db"; return 0; fi;
        mysql -vvv -e "create database $db";
        mysql -vvv $db -e "CREATE TABLESPACE $db ADD DATAFILE '$db.ibd' engine=InnoDB;"
        for i in {1..100}
        do
                table="CREATE TABLE sbtest$i ( id int(10) unsigned NOT NULL AUTO_INCREMENT, k int(10) unsigned NOT NULL DEFAULT '0', c varchar(120) NOT NULL DEFAULT '', pad varchar(60) NOT NULL DEFAULT '', PRIMARY KEY (id), KEY k_1 (k) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 tablespace $db;"
                mysql $db -e "$table"
        done
}
c=0
for m in {1..4000000}
do
        for i in {1..40}
        do
                let c=$c+1
                echo $c
                db="sbtest_$c"
                do_db &
        done
        wait
        #if [ $c > 4000000 ]; then exit; fi
done

40 million tables in MySQL 8

Now it’s time for a real test. I’m using the latest MySQL 8 version (at the time of writing): 8.0.12. This implements the new data dictionary.

MySQL config file:

[mysqld]
datadir=/var/lib/mysql-data
socket=/var/lib/mysql-data/mysql.sock
datadir=/var/lib/mysql-data
log-error = /var/lib/mysql-log/error.log
server_id = 12345
log_bin = /var/lib/mysql-log/binlog
relay_log=/var/lib/mysql-log/relay-bin
skip-log-bin=1
innodb_log_group_home_dir = /var/lib/mysql-log
innodb_doublewrite = 0
innodb_flush_log_at_trx_commit=0
innodb_log_file_size=2G
innodb_buffer_pool_size=4G
tablespace_definition_cache = 524288
schema_definition_cache = 524288
table_definition_cache = 524288
table_open_cache=524288
open-files-limit=1000000

Sysbench shell script:

function run_sb() {
conn=" --db-driver=mysql --mysql-socket=/var/lib/mysql-data/mysql.sock  --mysql-db=sbtest_1 --mysql-user=sbtest --mysql-password=abc "
sysbench $conn --oltp_db_count=$db_count --oltp_tables_count=$table_count --oltp-table-size=10000 --report-interval=1 --num-threads=$num_threads --max-requests=0 --max-time=$max_time ./select_custom.lua run | tee -a sysbench_2.txt
}
let db_count=400000
table_count=100
max_time=10000
num_threads=32
run_sb

Sysbench lua script:

pathtest = "/usr/share/sysbench/tests/include/oltp_legacy/"
if pathtest then
   dofile(pathtest .. "common.lua")
else
   require("common")
end
function thread_init(thread_id)
   set_vars()
end
function event()
   local table_name
   local i
   local c_val
   local k_val
   local pad_val
   oltp_db_count = tonumber(oltp_db_count) or 1
   -- local oltp_db_count = 4
   table_name = "sbtest_" .. sb_rand(1, oltp_db_count)..".sbtest".. sb_rand(1, oltp_tables_count)
   k_val = sb_rand(1, oltp_table_size)
   c_val = sb_rand_str([[
###########-###########-###########-###########-###########-###########-###########-###########-###########-###########]])
   pad_val = sb_rand_str([[
###########-###########-###########-###########-###########]])
      rs = db_query("SELECT id FROM " .. table_name .." LIMIT 1")
end

Please note that the tables are empty – no data.

Now we can run the benchmark. Unfortunately, we have a serious mutex contention in the data dictionary. Here are the results:

[ 453s ] thds: 32 tps: 1203.96 qps: 1203.96 (r/w/o: 1203.96/0.00/0.00) lat (ms,95%): 41.10 err/s: 0.00 reconn/s: 0.00
[ 454s ] thds: 32 tps: 1202.32 qps: 1202.32 (r/w/o: 1202.32/0.00/0.00) lat (ms,95%): 42.61 err/s: 0.00 reconn/s: 0.00
[ 455s ] thds: 32 tps: 1196.74 qps: 1196.74 (r/w/o: 1196.74/0.00/0.00) lat (ms,95%): 41.10 err/s: 0.00 reconn/s: 0.00
[ 456s ] thds: 32 tps: 1197.18 qps: 1197.18 (r/w/o: 1197.18/0.00/0.00) lat (ms,95%): 41.10 err/s: 0.00 reconn/s: 0.00
[ 457s ] thds: 32 tps: 887.11 qps: 887.11 (r/w/o: 887.11/0.00/0.00) lat (ms,95%): 41.10 err/s: 0.00 reconn/s: 0.00
[ 458s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 459s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 460s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 461s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 462s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 463s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 464s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 465s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 466s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 467s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 468s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 469s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 470s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 471s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 472s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 473s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 474s ] thds: 32 tps: 403.96 qps: 403.96 (r/w/o: 403.96/0.00/0.00) lat (ms,95%): 16819.24 err/s: 0.00 reconn/s: 0.00
[ 475s ] thds: 32 tps: 1196.00 qps: 1196.00 (r/w/o: 1196.00/0.00/0.00) lat (ms,95%): 41.85 err/s: 0.00 reconn/s: 0.00
[ 476s ] thds: 32 tps: 1208.96 qps: 1208.96 (r/w/o: 1208.96/0.00/0.00) lat (ms,95%): 41.85 err/s: 0.00 reconn/s: 0.00
[ 477s ] thds: 32 tps: 1192.06 qps: 1192.06 (r/w/o: 1192.06/0.00/0.00) lat (ms,95%): 41.85 err/s: 0.00 reconn/s: 0.00
[ 478s ] thds: 32 tps: 1173.89 qps: 1173.89 (r/w/o: 1173.89/0.00/0.00) lat (ms,95%): 43.39 err/s: 0.00 reconn/s: 0.00

As we can see, for ~15 seconds no queries were processed: a complete MySQL stall. That situation – complete stall – happens constantly, every ~25-30 seconds.

Show engine innodb status query shows mutex contention:

SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 498635
--Thread 140456572004096 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140451898689280 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140451896919808 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140456571119360 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140457044215552 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140456572299008 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140457043035904 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140456571709184 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140451897214720 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140451896624896 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140457042740992 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140451899279104 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
--Thread 140457042446080 has waited at que0que.cc line 1072 for 0.00 seconds the semaphore:
Mutex at 0x34a3fe0, Mutex PARSER created pars0pars.cc:98, lock var 1
OS WAIT ARRAY INFO: signal count 89024
RW-shared spins 11216, rounds 14847, OS waits 3641

I’ve filed a  new MySQL bug: DICT_SYS mutex contention causes complete stall when running with 40 mill tables.

I’ve also tested with pareto distribution in sysbench, and even set the ratio to 0.05 (5%) and 0.01 (1%), and mutex contention is still an issue. I have used the following updated sysbench script:

function run_sb() {
conn=" --db-driver=mysql --mysql-socket=/var/lib/mysql-data/mysql.sock  --mysql-db=sbtest_1 --mysql-user=sbtest --mysql-password=abc "
sysbench $conn --rand-type=$rand_type --rand-pareto-h=$pareto_h --oltp_db_count=$db_count --oltp_tables_count=$table_count --oltp-table-size=10000 --report-interval=1 --num-threads=$num_threads --max-requests=0 --max-time=$max_time $test_name run | tee -a sysbench_2.txt
}
let db_count=400000
table_count=100
max_time=10000
num_threads=32
rand_type="pareto"
pareto_h=0.01
test_name="./select_custom.lua"
echo "Now running $rand_type for $max_time seconds, test=$test_name"
run_sb

And the results with 0.01 (1%) are the following:

[ 55s ] thds: 32 tps: 72465.29 qps: 72465.29 (r/w/o: 72465.29/0.00/0.00) lat (ms,95%): 0.53 err/s: 0.00 reconn/s: 0.00
[ 56s ] thds: 32 tps: 68641.04 qps: 68641.04 (r/w/o: 68641.04/0.00/0.00) lat (ms,95%): 0.61 err/s: 0.00 reconn/s: 0.00
[ 57s ] thds: 32 tps: 70479.82 qps: 70479.82 (r/w/o: 70479.82/0.00/0.00) lat (ms,95%): 0.57 err/s: 0.00 reconn/s: 0.00
[ 58s ] thds: 32 tps: 31395.55 qps: 31395.55 (r/w/o: 31395.55/0.00/0.00) lat (ms,95%): 0.49 err/s: 0.00 reconn/s: 0.00
[ 59s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 61s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 62s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 63s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 64s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 65s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 66s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 67s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 68s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 69s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 32 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 71s ] thds: 32 tps: 18879.04 qps: 18879.04 (r/w/o: 18879.04/0.00/0.00) lat (ms,95%): 0.75 err/s: 0.00 reconn/s: 0.00
[ 72s ] thds: 32 tps: 70924.82 qps: 70924.82 (r/w/o: 70924.82/0.00/0.00) lat (ms,95%): 0.48 err/s: 0.00 reconn/s: 0.00
[ 73s ] thds: 32 tps: 72395.57 qps: 72395.57 (r/w/o: 72395.57/0.00/0.00) lat (ms,95%): 0.47 err/s: 0.00 reconn/s: 0.00
[ 74s ] thds: 32 tps: 72483.22 qps: 72484.22 (r/w/o: 72484.22/0.00/0.00) lat (ms,95%): 0.58 err/s: 0.00 reconn/s: 0.00

ZFS

The ZFS filesystem provides compression, which helps tremendously in this case. When MySQL creates an InnoDB table it will create a new blank .ibd file and pre-allocate some pages, which will be blank. I have configured ZFS compression and can see > 10x compression ratio:

# zfs get all | grep compressratio
mysqldata                       compressratio         12.47x                      -
mysqldata                       refcompressratio      1.00x                       -
mysqldata/mysql                 compressratio         12.47x                      -
mysqldata/mysql                 refcompressratio      1.00x                       -
mysqldata/mysql/data            compressratio         12.51x                      -
mysqldata/mysql/data            refcompressratio      12.54x                      -
mysqldata/mysql/log             compressratio         2.79x                       -
mysqldata/mysql/log             refcompressratio      4.57x                       -

Conclusion

It is possible to create 40 million tables with MySQL 8.0 using shared tablespaces. ZFS provides an excellent compression ratio (with gzip) which can help by reducing the overhead of “schema per customer” architecture. Unfortunately, the new data dictionary in MySQL 8.0.12 suffers from the DICT_SYS mutex contention and causes constant “stalls”.

The post 40 million tables in MySQL 8.0 with ZFS appeared first on Percona Database Performance Blog.

Jul
24
2018
--

Outlier raises $6.2 M Series A to change how companies use data

Traditionally, companies have gathered data from a variety of sources, then used spreadsheets and dashboards to try and make sense of it all. Outlier wants to change that and deliver a handful of insights right to your inbox that matter most for your job, company and industry. Today the company announced a $6.2 million Series A to further develop that vision.

The round was led by Ridge Ventures with assistance from 11.2 Capital, First Round Capital, Homebrew, Susa Ventures and SV Angel. The company has raised over $8 million.

The startup is trying to solve a difficult problem around delivering meaningful insight without requiring the customer to ask the right questions. With traditional BI tools, you get your data and you start asking questions and seeing if the data can give you some answers. Outlier wants to bring a level of intelligence and automation by pointing out insight without having to explicitly ask the right question.

Company founder and CEO Sean Byrnes says his previous company, Flurry, helped deliver mobile analytics to customers, but in his travels meeting customers in that previous iteration, he always came up against the same question: “This is great, but what should I look for in all that data?”

It was such a compelling question that after he sold Flurry in 2014 to Yahoo for more than $200 million, that question stuck in the back of his mind and he decided to start a business to solve it. He contends that the first 15 years of BI was about getting answers to basic questions about company performance, but the next 15 will be about finding a way to get the software to ask good questions based on the huge amounts of data.

Byrnes admits that when he launched, he didn’t have much sense of how to put this notion into action, and most people he approached didn’t think it was a great idea. He says he heard “No” from a fair number of investors early on because the artificial intelligence required to fuel a solution like this really wasn’t ready in 2015 when he started the company.

He says that it took four or five iterations to get to today’s product, which lets you connect to various data sources, and using artificial intelligence and machine learning delivers a list of four or five relevant questions to the user’s email inbox that points out data you might not have noticed, what he calls “shifts below the surface.” If you’re a retailer that could be changing market conditions that signal you might want to change your production goals.

Outlier email example. Photo: Outlier

The company launched in 2015. It took some time to polish the product, but today they have 14 employees and 14 customers including Jack Rogers, Celebrity Cruises and Swarovski.

This round should allow them to continuing working to grow the company. “We feel like we hit the right product-market fit because we have customers [generating] reproducible results and really changing the way people use the data,” he said.

Jul
23
2018
--

SessionM customer loyalty data aggregator snags $23.8 M investment

SessionM announced a $23.8 million Series E investment led by Salesforce Ventures. A bushel of existing investors including Causeway Media Partners, CRV, General Atlantic, Highland Capital and Kleiner Perkins Caufield & Byers also contributed to the round. The company has now raised over $97 million.

At its core, SessionM aggregates loyalty data for brands to help them understand their customer better, says company co-founder and CEO Lars Albright. “We are a customer data and engagement platform that helps companies build more loyal and profitable relationships with their consumers,” he explained.

Essentially that means, they are pulling data from a variety of sources and helping brands offer customers more targeted incentives, offers and product recommendations “We give [our users] a holistic view of that customer and what motivates them,” he said.

Screenshot: SessionM (cropped)

To achieve this, SessionM takes advantage of machine learning to analyze the data stream and integrates with partner platforms like Salesforce, Adobe and others. This certainly fits in with Adobe’s goal to build a customer service experience system of record and Salesforce’s acquisition of Mulesoft in March to integrate data from across an organization, all in the interest of better understanding the customer.

When it comes to using data like this, especially with the advent of GDPR in the EU in May, Albright recognizes that companies need to be more careful with data, and that it has really enhanced the sensitivity around stewardship for all data-driven businesses like his.

“We’ve been at the forefront of adopting the right product requirements and features that allow our clients and businesses to give their consumers the necessary control to be sure we’re complying with all the GDPR regulations,” he explained.

The company was not discussing valuation or revenue. Their most recent round prior to today’s announcement, was a Series D in 2016 for $35 million also led by Salesforce Ventures.

SessionM, which was founded in 2011, has around 200 employees with headquarters in downtown Boston. Customers include Coca-Cola, L’Oreal and Barney’s.

Jun
12
2018
--

Sumo Logic brings data analysis to containers

Sumo Logic has long held the goal to help customers understand their data wherever it lives. As we move into the era of containers, that goal becomes more challenging because containers by their nature are ephemeral. The company announced a product enhancement today designed to instrument containerized applications in spite of that.

They are debuting these new features at DockerCon, Docker’s customer conference taking place this week in San Francisco.

Sumo’s CEO Ramin Sayer says containers have begun to take hold over the last 12-18 months with Docker and Kubernetes emerging as tools of choice. Given their popularity, Sumo wants to be able to work with them. “[Docker and Kubernetes] are by far the most standard things that have developed in any new shop, or any existing shop that wants to build a brand new modern app or wants to lift and shift an app from on prem [to the cloud], or have the ability to migrate workloads from Vendor A platform to Vendor B,” he said.

He’s not wrong of course. Containers and Kubernetes have been taking off in a big way over the last 18 months and developers and operations alike have struggled to instrument these apps to understand how they behave.

“But as that standardization of adoption of that technology has come about, it makes it easier for us to understand how to instrument, collect, analyze, and more importantly, start to provide industry benchmarks,” Sayer explained.

They do this by avoiding the use of agents. Regardless of how you run your application, whether in a VM or a container, Sumo is able to capture the data and give you feedback you might otherwise have trouble retrieving.

Screen shot: Sumo Logic (cropped)

The company has built in native support for Kubernetes and Amazon Elastic Container Service for Kubernetes (Amazon EKS). It also supports the open source tool Prometheus favored by Kubernetes users to extract metrics and metadata. The goal of the Sumo tool is to help customers fix issues faster and reduce downtime.

As they work with this technology, they can begin to understand norms and pass that information onto customers. “We can guide them and give them best practices and tips, not just on what they’ve done, but how they compare to other users on Sumo,” he said.

Sumo Logic was founded in 2010 and has raised $230 million, according to data on Crunchbase. Its most recent round was a $70 million Series F led by Sapphire Ventures last June.

Jun
07
2018
--

Devo scores $25 million and cool new name

Logtrust is now known as Devo in one of the cooler name changes I’ve seen in a long time. Whether they intended to pay homage to the late 70s band is not clear, but investors probably didn’t care, as they gave the data operations startup a bushel of money today.

The company now known as Devo announced a $25 million Series C round led by Insight Venture Partners with participation from Kibo Ventures. Today’s investment brings the total raised to $71 million.

The company changed its name because it was about much more than logs, according to CEO Walter Scott. It offers a cloud service that allows customers to stream massive amounts of data — think terabytes or even petabytes — relieving the need to worry about all of the scaling and hardware requirements processing this amount of data would require. That could be from logs from web servers, security data from firewalls or transactions taking place on backend systems, as some examples.

The data can live on prem if required, but the processing always gets done in the cloud to provide for the scaling needs. Scott says this is about giving companies this ability to process and understand massive amounts of data that previously was only in reach of web scale companies like Google, Facebook or Amazon.

But it involves more than simply collecting the data. “It’s the combination of us being able to collect all of that data together with running analytics on top of it all in a unified platform, then allowing a very broad spectrum of the business [to make use of it],” Scott explained.

Devo dashboard. Photo: Devo

Devo sees Sumo Logic, Elastic and Splunk as its primary competitors in this space, but like many startups they often battle companies trying to build their own systems as well, a difficult approach for any company to take when you are dealing with this amount of data.

The company, which was founded in Spain is now based in Cambridge, Massachusetts, and has close to 100 employees. Scott says he has the budget to double that by the end of the year, although he’s not sure they will be able to hire that many people that rapidly

May
31
2018
--

Don’t Drown in your Data Lake

Don't drown in your data lake

Don't drown in your data lakeA data lake is “…a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms…”1. Many companies find value in using a data lake but aren’t clear that they need to properly plan for it and maintain it in order to prevent issues.

The idea of a data lake rose from the need to store data in a raw format that is accessible to a variety of applications and authorized users. Hadoop is often used to query the data, and the necessary structures for querying are created through the query tool (schema on read) rather than as part of the data design (schema on write). There are other tools available for analysis, and many cloud providers are actively developing additional options for creating and managing your data lake. The cloud is often viewed as an ideal place for your data lake since it is inherently elastic and can expand to meet the needs of your data.

Data Lake or Data Swamp?

One of the key components of a functioning data lake is the continuing inflow and egress of data. Some data must be kept indefinitely but some can be archived or deleted after a defined period of time. Failure to remove stale data can result in a data swamp, where the out of date data is taking up valuable and costly space and may be causing queries to take longer to complete. This is one of the first issues that companies encounter in maintaining their data lake. Often, people view the data lake as a “final resting place” for data, but it really should be used for data that is accessed often, or at least occasionally.

A natural spring-fed lake can turn into a swamp due to a variety of factors. If fresh water is not allowed to flow into the lake, this can cause stagnation, meaning that plants and animals that previously were not able to be supported by the lake take hold. Similarly, if water cannot exit the lake at some point, the borders will be breached, and the surrounding land will be inundated. Both of these conditions can cause a once pristine lake to turn into a fetid and undesirable swamp. If data is no longer being added to your data lake, the results will become dated and eventually unreliable. Also, if data is always being added to the lake but is not accessed on a regular basis, this can lead to unrestricted growth of your data lake, with no real plan for how the data will be used. This can become an expensive “cold storage” facility that is likely more expensive than archived storage.

If bad or undesirable items, like old cars or garbage, are thrown into a lake, this can damage the ecosystem, causing unwanted reactions. In a data lake, this is akin to simply throwing data into the data lake with no real rules or rationale. While the data is saved, it may not be useful and can cause negative consequences across the whole environment since it is consuming space and may slow response times. Even though a basic concept of a data lake is that the data does not need to conform to a predefined structure, like you would see with a relational database, it is important that some rules and guidelines exist regarding the type and quality of data that is included in the lake. In the absence of some guidelines, it becomes difficult to access the relevant data for your needs. Proper definition and tagging of content help to ensure that the correct data is accessible and available when needed.

Unrestricted Growth Consequences

Many people have a junk drawer somewhere in their house; a drawer that is filled with old receipts, used tickets, theater programs, and the like. Some of this may be stored for sentimental reasons, but a lot of it is put into this drawer since it was a convenient dropping place for things. Similarly, if we look to the data lake as the “junk drawer” for our company, it is guaranteed to be bigger and more expensive than it truly needs to be.

It is important that the data that is stored in your data lake has a current or expected purpose. While you may not have a current use for some data, it can be helpful to keep it around in case a need arises. An example of this is in the area of machine learning. Providing more ancillary data enables better decisions since it provides a deeper view into the decision process. Therefore, maintaining some data that may not have a specific and current need can be helpful. However, there are cases where maintaining a huge volume of data can be counterproductive. Consider temperature information delivered from a switch. If the temperature reaches a specific threshold, the switch should be shut down. Reporting on the temperature in an immediate and timely manner is important to make an informed decision, but stable temperature data from days, week, or months ago could be summarized and stored in a more efficient manner. The granular details can then be purged from the lake.

So, where is the balance? If you keep all the data, it can make your data lake unwieldy and costly. If you only keep data that has a specific current purpose, you may be impairing your future plans. Obviously, the key is to monitor your access and use of the data frequently, and purge or archive some of the data that is not being regularly used.

Uncontrolled Access Concerns

Since much of the data in your data lake is company confidential, it is imperative that access to that data be controlled. The fact that the data in the lake is stored in its raw format means that it is more difficult to control access. The structures of a relational database provide some of the basis for access control, allowing us to limit who has access to specific queries, tables, fields, schemas, databases, and other objects. In the absence of these structures, controlling access requires more finesse. Determining who has access to what parts of the data in the lake must be handled, as well as isolating the data within your own network environment. Many of these restrictions may already be in place in your current environment, but they should be reviewed before being relied on fully, since the data lake may store information that was previously unavailable to some users. Access should be regularly reviewed to identify potential rogue activities. Encryption options also exist to further secure the data from unwanted access, and file system security can be used to limit access. All of these components must be considered, implemented, and reviewed to ensure that the data is secure.

User Considerations

In a relational database, the data structure inherently determines some of the consistencies and format of the data. This enables users to easily query the data and be assured that they are returning valid results. The lack of such structures in the data lake means that users must be more highly skilled at data manipulation. Having users with less skill accessing the data is possible, but it may not provide the best results. A data scientist is better positioned to access and query the complete data set. Obviously, users with a higher skill set are rare and cost more to hire, but the return may be worth it in the long run.

So What Do I Do Now?

This is an area where there are no hard and fast rules. Each company must develop and implement processes and procedures that make sense for their individual needs. Only with a plan for monitoring inputs, outputs, access patterns, and the like are you able to make a solid determination for your company’s needs. Percona can help to determine a plan for reporting usage, assess security settings, and more. As you are using the data in your data lake, we can also provide guidance regarding tools used to access the data.

1 Wikipedia, May 22, 2018

The post Don’t Drown in your Data Lake appeared first on Percona Database Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com