Nov
05
2019
--

Chronosphere launches with $11M Series A to build scalable, cloud-native monitoring tool

Chronosphere, a startup from two ex-Uber engineers who helped create the open-source M3 monitoring project to handle Uber-level scale, officially launched today with the goal of building a commercial company on top of the open-source project.

It also announced an $11 million investment led by Greylock, with participation from venture capitalist Lee Fixel.

While the founders, CEO Martin Mao and CTO Rob Skillington, were working at Uber, they recognized a gap in the monitoring industry, particularly around cloud-native technologies like containers and microservices. There weren’t any tools available on the market that could handle Uber’s scaling requirements — so like any good engineers, they went out and built their own.

“We looked around at the market at the time and couldn’t find anything in open source or commercially available that could really scale to our needs. So we ended up building and open sourcing our solution, which is M3. Over the last three to four years we’ve scaled M3 to one of the largest production monitoring systems in the world today,” Mao explained.

The essential difference between M3 and other open-source, cloud-native monitoring solutions like Prometheus is that ability to scale, he says.

One of the main reasons they left to start a company, with the blessing of Uber, was that the community began asking for features that didn’t really make sense for Uber. By launching Chronosphere, Mao and Skillington would be taking on the management of the project moving forward (although sharing governance for the time being with Uber), while building those enterprise features the community has been requesting.

The new company’s first product will be a cloud version of M3 to help reduce some of the complexity associated with managing an M3 project. “M3 itself is a fairly complex piece of technology to run. It is solving a fairly complex problem at large scale, and running it actually requires a decent amount of investment to run at large scale, so the first thing we’re doing is taking care of that management,” Mao said.

Jerry Chen, who led the investment at Greylock, saw a company solving a big problem. “They were providing such a high-resolution view of what’s going on in your cloud infrastructure and doing that at scale at a cost that actually makes sense. They solved that problem at Uber, and I saw them, and I was like wow, the rest of the market needs what guys built and I wrote the Series A check. It was as simple as that,” Chen told TechCrunch.

The cloud product is currently in private beta; they expect to open to public beta early next year.

Nov
04
2019
--

Robocorp announces $5.6M seed to bring open-source option to RPA

Robotic Process Automation (RPA) has been a hot commodity in recent years as it helps automate tedious manual workflows inside large organizations. Robocorp, a San Francisco startup, wants to bring open source and RPA together. Today it announced a $5.6 million seed investment.

Benchmark led the round, with participation from Slow Ventures, firstminute Capital, Bret Taylor (president and chief product officer at Salesforce) and Docker CEO Rob Bearden. In addition, Benchmark’s Peter Fenton will be joining the company’s board.

Robocorp co-founder and CEO Antti Karjalainen has been around open-source projects for years, and he saw an enterprise software category that was lacking in open-source options. “We actually have a unique angle on RPA, where we are introducing open source and cloud native technology into the market and focusing on developer-led technologies,” Karjalainen said.

He sees a market that’s top-down and focused on heavy sales cycles. He wants to bring the focus back to the developers who will be using the tools. “We are all about removing friction from developers. So, we are focused on giving developers tools that they like to use, and want to use for RPA, and doing it in an open-source model where the tools themselves are free to use,” he said.

The company is built on the open-source Robot Framework project, which was originally developed as an open-source software testing environment, but he sees RPA having a lot in common with testing, and his team has been able to take the project and apply it to RPA.

If you’re wondering how the company will make money, they are offering a cloud service to reduce the complexity even further of using the open-source tools, and that includes the kinds of features enterprises tend to demand from these projects, like security, identity and access management, and so forth.

Benchmark’s Peter Fenton, who has invested in several successful open-source startups, including JBoss, SpringSource and Elastic, sees RPA as an area that’s ripe for a developer-focused open-source option. “We’re living in the era of the developer, where cloud-native and open source provide the freedom to innovate without constraint. Robocorp’s RPA approach provides developers the cloud native, open-source tools to bring RPA into their organizations without the burdensome constraints of existing offerings,” Fenton said.

The company intends to use the money to add new employees and continue scaling the cloud product, while working to build the underlying open-source community.

While UIPath, a fast-growing startup with a hefty $7.1 billion valuation recently announced it was laying off 400 people, Gartner published a study in June showing that RPA is the fastest growing enterprise software category.

Oct
30
2019
--

Spooktacularly Scary Database Stories

scary database stories

scary database storiesThe nights are lengthening and the spookiest day of the year is nearly upon us, Halloween! In the spirit of the holiday, we asked our team to share their scariest tales of database dread, blood-curdling BIOS failures, and dastardly data destruction, and some of the responses are downright chilling.

Grab some candy and check out the stories that Perconians are too afraid to talk about after dark!

Brian Walters, Director of Solution Engineering:

Rookie DBA at a company with a shoestring budget = dev, test, and prod on the same server. What could possibly go wrong?

So, I’m about two years into my career as a DBA and part of my job is maintaining the database for the company MRP system. This system is critical, without it the whole plant shuts down.

During the implementation of the MRP system, the company had fallen into the bad practice of using the production database server for development and testing purposes as well. I’d made several requests for dedicated dev/test hardware, but this just didn’t seem to be a priority for those controlling the budget.

My days usually started with the same few tasks: checking the backups from the night before, deleting the prior day’s testing database, and creating a new testing environment by restoring last night’s backup. I had my routine pretty tight, most of it was scripted. All I had to do was change an environment variable and run the daily scripts.

This all worked fine until one day… that morning, I was just a little more tired than normal. I logged into the database server and thought that I had set the environment variables so they pointed to the testing database. By some mistaken force of habit, mental lapse, or temporary spooky hallucination, I had actually, accidentally set the environment variables to prod… and then I ran the delete-database scripts.

Somehow, I realized my mistake almost before the enter-key was fully depressed. But by that time it was already too late. It took less than ten seconds for the first phone call to arrive. Naturally, I sent it to voicemail, along with the next three phone calls that immediately followed. My next instinct was to press the Send All Calls button. Head in my hands, fully realizing what had just happened, I struggled to bring myself to understand how I did it.

After a quick conversation with my boss, who also happened to be the CFO, my schedule was cleared for the day. The remainder of the morning was spent practicing my Point in Time Recovery skills. To this day, I am grateful that I had a solid and tested backup and restore plan. I went home early that day. And that was the last time I made a mistake like that. We purchased dedicated dev/test hardware about a month later.

Robert Bernier, PostgreSQL Consultant

I was working as a newly hired Senior DBA when two developers suddenly appeared at my desk in a very agitated and excited state of mind. They were upset because they had accidentally pushed OS updates to embedded devices that were installed on many of our client devices. These premature updates effectively bricked them, which numbered in the tens of thousands.

I suppose I shouldn’t have been surprised but they actually “demanded” that I execute a rollback at once. Needless to say, being a new hire, there were a lot of things I still didn’t know and I was anxious to avoid making the situation worse. So I invited them to find a couple of chairs, sit next to me and taking their time “explain” who they were and what they did at the company. Eventually, I got around to the events leading up to the incident. Slowing them down was my primary concern as the rollback’s success hinged their state of mind. In time, they were able to demonstrate the issue and its resolution. Within a couple of hours, we staged the rollback across the affected devices and unbricked them.

In retrospect it could have been worse as the data-store I was managing held several hundred TB representing 600,000 embedded devices.

The moral of the story, heh heh… always know where to find your towel.

 

Audrey Swagerty, Customer Success Manager

This is not a tech story but a real nightmare ?

When I started as a CSR, 3.5 years ago, I was not familiar with our industry and the technologies so it was quite the challenge… for some time (I guess until I got more comfortable with my new job), I used to have a recurring nightmare. I was being chased through the woods by a girl… and she would finally catch up with me, grab me and I would ask her name (don’t ask me why…instead of asking her not to kill me right ? )… And she would say: I am MongoDB!!! Then, I would wake up!

I have not thought about that story in a long time (and have not had the same nightmare ever since) so it was fun to share it again with you! Hopefully, it won’t happen again this weekend… I have been trying to help a customer with Kubernetes questions so you never know! ?

 

Marcos Albe, Principal Technical Services Engineer

The worst horror story is a hospital introducing inconsistencies into a database… I always feared someone with asthma would end up as an amputee due to broken data!

 

Patrick Birch, Senior Technical Writer

While I was a SQL Server DBA, my CTO allowed the report writers to access the production database. The report writers were very nice people but were lacking in certain qualities, such as the ability to write SQL code. I wrote their queries and asked them to run their requirements through me. Well, I went to lunch one day, and when I returned the CTO and other managers were running around. One of the report writers had written a cartesian join (cross join) and the production database was slowing everything down to a crawl!

I killed the connection and had a long talk with the report writers. The managers approved my building a data warehouse the next day.

 

Martin James, Vice President of Sales EMEA & APAC

At my last company, I started to look at the scary reality of data security and healthcare. As mists and mellow fruitfulness provide the perfect backdrop to the spooky traditions of Halloween, ghostly goings-on were being uncovered in unexpected areas. Not in gloomy churchyards nor crumbling manor houses, but in the databases of general practitioners in England.

In 2016, the data of 57 million patients were held within general practitioners’ records. However, census data at the time suggested this should have stood at only 54 million. So who are these extra 3 million people? These records belong to ‘ghost patients’: patients who have deceased or emigrated or are duplications/inaccuracies in record keeping. Either way, it has an impact on surgeries, on funding, and on the services provided, adding unnecessary pressure to the NHS and leaving it unable to provide a precise picture of its patients and the care it provides.

So, NHS England began a ghost hunt to track down and identify the owners of these records so they can update their data and save money. If a patient hasn’t seen their GP for five years, they’ll be sent a letter requesting them to respond. If they don’t respond, they’ll be sent a second letter, after which they’ll be removed from the patient register. This could be a measure that makes an instant saving, as, according to the BBC, family doctors are paid for every patient registered on their list (ghost or not) and the Times quantifies this at around £400 million a year.

Taking a deeper dive into patient data and the connections within it could be used to great benefit in the health service. It would enable a tighter hold on data and driving compliance, as well as help the NHS improve precision and accuracy in its records. Its data will become an asset, and a means of national health intelligence. A utopian view? Perhaps – but without the need for ghostbusters!

What’s your scariest database story?  Let us know in the comments or reply to us on social so we can be on the lookout for database ghouls and goblins! And to learn how Percona’s experts can take the scary out of database management, check out our open source database support, managed services, and consulting services.

Oct
29
2019
--

Column Histograms on Percona Server and MySQL 8.0

MySQL Column HIstorgrams

MySQL Column HIstorgramsFrom time to time you may have experienced that MySQL was not able to find the best execution plan for a query. You felt the query should have been faster. You felt that something didn’t work, but you didn’t realize exactly what.

Maybe some of you did tests and discovered there was a better execution plan that MySQL wasn’t able to find (forcing the order of the tables with STRAIGHT_JOIN for example).

In this article, we’ll see a new interesting feature available on MySQL 8.0 as well as Percona Server for MySQL 8.0: the histogram-based statistics.

Today, we’ll see what a histogram is, how you can create and manage it, and how MySQL’s optimizer can use it.

Just for completeness, histogram statistics have been available on MariaDB since version 10.0.2, with a slightly different implementation. Anyway, what we’ll see here is related to Percona Server and MySQL 8.0 only.

 

What is a histogram

We can define a histogram as a good approximation of the data distribution of the values in a column.

Histogram-based statistics were introduced to give the optimizer more execution plans to investigate and solve a query. Until then, in some cases, the optimizer was not able to find out the best possible execution plan because non-indexed columns were ignored.

With histogram statistics, now the optimizer may have more options because also non-indexed columns can be considered. In some specific cases, a query can run faster than usual.

Let’s consider the following table to store departing times of the trains:

CREATE TABLE train_schedule(
id INT PRIMARY KEY,
train_code VARCHAR(10),
departure_station VARCHAR(100),
departure_time TIME);

We can assume that during peak hours, from 7 AM until 9 AM, there are more rows, and during the night hours we have very few rows.

Let’s take a look at the following two queries:

SELECT * FROM train_schedule WHERE departure_time BETWEEN '07:30:00' AND '09:15:00';
SELECT * FROM train_schedule WHERE departure_time BETWEEN '01:00:00' AND '03:00:00';

Without any kind of statistics, the optimizer assumes by default that the values in the departure_time column are evenly distributed, but they aren’t. In fact, the first query returns more rows because of this assumption.

Histograms were invented to provide to the optimizer a good estimation of the rows returned. This seems to be trivial for the simple queries we have seen so far. But let’s think now about having the same table involved in JOINs with other tables. In such a case, the number of rows returned can be very important for the optimizer to decide the order to consider the tables in the execution plan.

A good estimation of the rows returned gives the optimizer the capability to open the table in the first stages in case it returns few rows. This minimizes the total amount of rows for the final cartesian product. Then the query can run faster.

MySQL supports two different types of histograms: “singleton” and “equi-height”. Common for all histogram types is that they split the data set into a set of “buckets”, and MySQL automatically divides the values into the buckets and will also automatically decide what type of histogram to create.

Singleton histogram

  • one value per bucket
  • each bucket stores
    • value
    • cumulative frequency
  • well suited for equality and range conditions

Equi-height histogram

  • multiple values per bucket
  • each bucket stores
    • minimum value
    • maximum value
    • cumulative frequency
    • number of distinct values
  • not really equi-height: frequent values are in separated buckets
  • well suited for range conditions

How to use histograms

The histogram feature is available and enabled on the server, but not usable by the optimizer. Without an explicit creation, the optimizer works the same as usual and cannot get any benefit from the histogram-bases statistics.

There is some manual operation to do. Let’s see.

In the next examples, we’ll use the world sample database you can download from here: https://dev.mysql.com/doc/index-other.html

Let’s start executing a query joining two tables to find out all the languages spoken on the largest cities of the world, with more than 10 million people.

mysql> select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000; 
+-----------------+-----------+ 
| name            | language  | 
+-----------------+-----------+ 
| Mumbai (Bombay) | Asami     | 
| Mumbai (Bombay) | Bengali   | 
| Mumbai (Bombay) | Gujarati  | 
| Mumbai (Bombay) | Hindi     | 
| Mumbai (Bombay) | Kannada   | 
| Mumbai (Bombay) | Malajalam | 
| Mumbai (Bombay) | Marathi   | 
| Mumbai (Bombay) | Orija     | 
| Mumbai (Bombay) | Punjabi   | 
| Mumbai (Bombay) | Tamil     | 
| Mumbai (Bombay) | Telugu    | 
| Mumbai (Bombay) | Urdu      | 
+-----------------+-----------+ 
12 rows in set (0.04 sec)

The query takes 0.04 seconds. It’s not a lot, but consider that the database is very small. Use the BENCHMARK function to have more relevant response times if you like.

Let’s see the EXPLAIN:

mysql> explain select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000; 
+----+-------------+-----------------+------------+-------+---------------------+-------------+---------+-----------------------------------+------+----------+-------------+ 
| id | select_type | table           | partitions | type  | possible_keys       | key         | key_len | ref                               | rows | filtered | Extra       | 
+----+-------------+-----------------+------------+-------+---------------------+-------------+---------+-----------------------------------+------+----------+-------------+ 
| 1  | SIMPLE      | countrylanguage | NULL       | index | PRIMARY,CountryCode | CountryCode | 3       | NULL                              | 984  | 100.00   | Using index | 
| 1  | SIMPLE      | city            | NULL       | ref   | CountryCode         | CountryCode | 3       | world.countrylanguage.CountryCode | 18   | 33.33    | Using where | 
+----+-------------+-----------------+------------+-------+---------------------+-------------+---------+-----------------------------------+------+----------+-------------+

Indexes are used for both the tables and the estimated cartesian product has 984 * 18 = 17,712 rows.

Now generate the histogram on the Population column. It’s the only column used for filtering the data and it’s not indexed.

For that, we have to use the ANALYZE command:

mysql> ANALYZE TABLE city UPDATE HISTOGRAM ON population WITH 1024 BUCKETS; 
+------------+-----------+----------+-------------------------------------------------------+ 
| Table      | Op        | Msg_type | Msg_text                                              | 
+------------+-----------+----------+-------------------------------------------------------+ 
| world.city | histogram | status   | Histogram statistics created for column 'Population'. | 
+------------+-----------+----------+-------------------------------------------------------+

We have created a histogram using 1024 buckets. The number of buckets is not mandatory, and it can be any number from 1 to 1024. If omitted, the default value is 100.

The number of chunks affects the reliability of the statistics. The more distinct values you have, the more the chunks you need.

Let’s have a look now at the execution plan and execute the query again.

mysql> explain select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000;
+----+-------------+-----------------+------------+------+---------------------+-------------+---------+------------------------+------+----------+-------------+
| id | select_type | table           | partitions | type | possible_keys       | key         | key_len | ref                    | rows | filtered | Extra       |
+----+-------------+-----------------+------------+------+---------------------+-------------+---------+------------------------+------+----------+-------------+
|  1 | SIMPLE      | city            | NULL       | ALL  | CountryCode         | NULL        | NULL    | NULL                   | 4188 |     0.06 | Using where |
|  1 | SIMPLE      | countrylanguage | NULL       | ref  | PRIMARY,CountryCode | CountryCode | 3       | world.city.CountryCode |  984 |   100.00 | Using index |
+----+-------------+-----------------+------------+------+---------------------+-------------+---------+------------------------+------+----------+-------------+

mysql> select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000;
+-----------------+-----------+
| name            | language  |
+-----------------+-----------+
| Mumbai (Bombay) | Asami     |
| Mumbai (Bombay) | Bengali   |
| Mumbai (Bombay) | Gujarati  |
| Mumbai (Bombay) | Hindi     |
| Mumbai (Bombay) | Kannada   |
| Mumbai (Bombay) | Malajalam |
| Mumbai (Bombay) | Marathi   |
| Mumbai (Bombay) | Orija     |
| Mumbai (Bombay) | Punjabi   |
| Mumbai (Bombay) | Tamil     |
| Mumbai (Bombay) | Telugu    |
| Mumbai (Bombay) | Urdu      |
+-----------------+-----------+
12 rows in set (0.00 sec)

The execution plan is different, and the query runs faster.

We can notice that the order of the tables is the opposite as before. Even if it requires a full scan, the city table is in the first stage. It’s because of the filtered value that is only 0.06. It means that only 0.06% of the rows returned by the full scan will be used to be joined with the following table. So, it’s only 4188 * 0.06% = 2.5 rows. In total, the estimated cartesian product is 2.5 * 984 = 2.460 rows. This is significantly lower than the previous execution and explains why the query is faster.

What we have seen sounds a little counterintuitive, doesn’t it? In fact, until MySQL 5.7, we were used to considering full scans as very bad in most cases. In our case, instead, forcing a full scan using a histogram statistic on a non-indexed column lets the query to get optimized. Awesome.

 

Where are the histogram statistics

Histogram statistics are stored in the column_statistics table in the data dictionary and are not directly accessible by the users. Instead the INFORMATION_SCHEMA.COLUMN_STATISTICS table, which is implemented as a view of the data dictionary, can be used for the same purpose.

Let’s see the statistics for our table.

mysql> SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM)  
    -> FROM information_schema.column_statistics  
    -> WHERE COLUMN_NAME = 'population'\G
*************************** 1. row ***************************
           SCHEMA_NAME: world
            TABLE_NAME: city
           COLUMN_NAME: Population
JSON_PRETTY(HISTOGRAM): {
  "buckets": [
    [
      42,
      455,
      0.000980632507967639,
      4
    ],
    [
      503,
      682,
      0.001961265015935278,
      4
    ],
    [
      700,
      1137,
      0.0029418975239029173,
      4
    ],
...
...
    [
      8591309,
      9604900,
      0.9990193674920324,
      4
    ],
    [
      9696300,
      10500000,
      1.0,
      4
    ]
  ],
  "data-type": "int",
  "null-values": 0.0,
  "collation-id": 8,
  "last-updated": "2019-10-14 22:24:58.232254",
  "sampling-rate": 1.0,
  "histogram-type": "equi-height",
  "number-of-buckets-specified": 1024
}

We can see for any chunk the min and max values, the cumulative frequency, and the number of items. Also, we can see that MySQL decided to use an equi-height histogram.

Let’s try to generate a histogram on another table and column.

mysql> ANALYZE TABLE country UPDATE HISTOGRAM ON Region;
+---------------+-----------+----------+---------------------------------------------------+
| Table         | Op        | Msg_type | Msg_text                                          |
+---------------+-----------+----------+---------------------------------------------------+
| world.country | histogram | status   | Histogram statistics created for column 'Region'. |
+---------------+-----------+----------+---------------------------------------------------+
1 row in set (0.01 sec)

mysql> SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM)  FROM information_schema.column_statistics  WHERE COLUMN_NAME = 'Region'\G
*************************** 1. row ***************************
           SCHEMA_NAME: world
            TABLE_NAME: country
           COLUMN_NAME: Region
JSON_PRETTY(HISTOGRAM): {
  "buckets": [
    [
      "base64:type254:QW50YXJjdGljYQ==",
      0.02092050209205021
    ],
    [
      "base64:type254:QXVzdHJhbGlhIGFuZCBOZXcgWmVhbGFuZA==",
      0.04184100418410042
    ],
    [
      "base64:type254:QmFsdGljIENvdW50cmllcw==",
      0.05439330543933054
    ],
    [
      "base64:type254:QnJpdGlzaCBJc2xhbmRz",
      0.06276150627615062
    ],
    [
      "base64:type254:Q2FyaWJiZWFu",
      0.1631799163179916
    ],
    [
      "base64:type254:Q2VudHJhbCBBZnJpY2E=",
      0.20083682008368198
    ],
    [
      "base64:type254:Q2VudHJhbCBBbWVyaWNh",
      0.23430962343096232
    ],
    [
      "base64:type254:RWFzdGVybiBBZnJpY2E=",
      0.3179916317991631
    ],
    [
      "base64:type254:RWFzdGVybiBBc2lh",
      0.35146443514644343
    ],
    [
      "base64:type254:RWFzdGVybiBFdXJvcGU=",
      0.39330543933054385
    ],
    [
      "base64:type254:TWVsYW5lc2lh",
      0.41422594142259406
    ],
    [
      "base64:type254:TWljcm9uZXNpYQ==",
      0.44351464435146437
    ],
    [
      "base64:type254:TWljcm9uZXNpYS9DYXJpYmJlYW4=",
      0.4476987447698744
    ],
    [
      "base64:type254:TWlkZGxlIEVhc3Q=",
      0.5230125523012552
    ],
    [
      "base64:type254:Tm9yZGljIENvdW50cmllcw==",
      0.5523012552301255
    ],
    [
      "base64:type254:Tm9ydGggQW1lcmljYQ==",
      0.5732217573221757
    ],
    [
      "base64:type254:Tm9ydGhlcm4gQWZyaWNh",
      0.602510460251046
    ],
    [
      "base64:type254:UG9seW5lc2lh",
      0.6443514644351465
    ],
    [
      "base64:type254:U291dGggQW1lcmljYQ==",
      0.7029288702928871
    ],
    [
      "base64:type254:U291dGhlYXN0IEFzaWE=",
      0.7489539748953975
    ],
    [
      "base64:type254:U291dGhlcm4gQWZyaWNh",
      0.7698744769874477
    ],
    [
      "base64:type254:U291dGhlcm4gYW5kIENlbnRyYWwgQXNpYQ==",
      0.8284518828451883
    ],
    [
      "base64:type254:U291dGhlcm4gRXVyb3Bl",
      0.891213389121339
    ],
    [
      "base64:type254:V2VzdGVybiBBZnJpY2E=",
      0.9623430962343097
    ],
    [
      "base64:type254:V2VzdGVybiBFdXJvcGU=",
      1.0
    ]
  ],
  "data-type": "string",
  "null-values": 0.0,
  "collation-id": 8,
  "last-updated": "2019-10-14 22:29:13.418582",
  "sampling-rate": 1.0,
  "histogram-type": "singleton",
  "number-of-buckets-specified": 100
}

In this case, a singleton histogram was generated.

Using the following query we can see more human-readable statistics.

mysql> SELECT SUBSTRING_INDEX(v, ':', -1) value, concat(round(c*100,1),'%') cumulfreq,         
    -> CONCAT(round((c - LAG(c, 1, 0) over()) * 100,1), '%') freq   
    -> FROM information_schema.column_statistics, JSON_TABLE(histogram->'$.buckets','$[*]' COLUMNS(v VARCHAR(60) PATH '$[0]', c double PATH '$[1]')) hist  
    -> WHERE schema_name  = 'world' and table_name = 'country' and column_name = 'region';
+---------------------------+-----------+-------+
| value                     | cumulfreq | freq  |
+---------------------------+-----------+-------+
| Antarctica                | 2.1%      | 2.1%  |
| Australia and New Zealand | 4.2%      | 2.1%  |
| Baltic Countries          | 5.4%      | 1.3%  |
| British Islands           | 6.3%      | 0.8%  |
| Caribbean                 | 16.3%     | 10.0% |
| Central Africa            | 20.1%     | 3.8%  |
| Central America           | 23.4%     | 3.3%  |
| Eastern Africa            | 31.8%     | 8.4%  |
| Eastern Asia              | 35.1%     | 3.3%  |
| Eastern Europe            | 39.3%     | 4.2%  |
| Melanesia                 | 41.4%     | 2.1%  |
| Micronesia                | 44.4%     | 2.9%  |
| Micronesia/Caribbean      | 44.8%     | 0.4%  |
| Middle East               | 52.3%     | 7.5%  |
| Nordic Countries          | 55.2%     | 2.9%  |
| North America             | 57.3%     | 2.1%  |
| Northern Africa           | 60.3%     | 2.9%  |
| Polynesia                 | 64.4%     | 4.2%  |
| South America             | 70.3%     | 5.9%  |
| Southeast Asia            | 74.9%     | 4.6%  |
| Southern Africa           | 77.0%     | 2.1%  |
| Southern and Central Asia | 82.8%     | 5.9%  |
| Southern Europe           | 89.1%     | 6.3%  |
| Western Africa            | 96.2%     | 7.1%  |
| Western Europe            | 100.0%    | 3.8%  |
+---------------------------+-----------+-------+

 

Histogram maintenance

Histogram statistics are not automatically recalculated. If you have a table that is very frequently updated with a lot of INSERTs, UPDATEs, and DELETEs, the statistics can run out of date very soon. Having unreliable histograms can lead the optimizer to the wrong choice.

When you find a histogram was useful to optimize a query, you need to also have a scheduled plan to refresh the statistics from time to time, in particular after doing massive modifications to the table.

To refresh a histogram you just need to run the same ANALYZE command we have seen before.

To completely drop a histogram you may run the following:

ANALYZE TABLE city DROP HISTOGRAM ON population;

 

Sampling

The histogram_generation_max_mem_size system variable controls the maximum amount of memory available for histogram generation. The global and session values may be set at runtime.

If the estimated amount of data to be read into memory for histogram generation exceeds the limit defined by the variable, MySQL samples the data rather than reading all of it into memory. Sampling is evenly distributed over the entire table.

The default value is 20000000 but you can increase it in the case of a large column if you want more accurate statistics. For very large columns, pay attention not to increase the threshold more than the memory available in order to avoid excessive overhead or outage.

 

Conclusion

Histogram statistics are particularly useful for non-indexed columns, as shown in the example.

Execution plans that can rely on indexes are usually the best, but histograms can help in some edge cases or when creating a new index is a bad idea.

Since this is not an automatic feature, some manual testing is required to investigate if you really can get the benefit of a histogram. Also, the maintenance requires some scheduled and manual activity.

Use histograms if you really need them, but don’t abuse them since histograms on very large tables can consume a lot of memory.

Usually, the best candidates for a histogram are the columns with:

  • values that do not change much over time
  • low cardinality values
  • uneven distribution

Install Percona Server 8.0, test and enjoy the histograms.

 

Further reading on the same topic: Billion Goods in Few Categories – How Histograms Save a Life?

 

Oct
29
2019
--

Datameer announces $40M investment as it pivots away from Hadoop roots

Datameer, the company that was born as a data prep startup on top of the open-source Hadoop project, announced a $40 million investment and a big pivot away from Hadoop, while staying true to its big data roots.

The investment was led by existing investor ST Telemedia . Existing investors Redpoint Ventures, Kleiner Perkins, Nextworld Capital, Citi Ventures and Top Tier Capital Partners also participated. Today’s investment brings the total raised to almost $140 million, according to Crunchbase data.

Company CEO Christian Rodatus says the company’s original mission was about making Hadoop easier to use for data scientists, business analysts and engineers. In the last year, the three biggest commercial Hadoop vendors — Cloudera, Hortonworks and MapR — fell on hard times. Cloudera and Hortonworks merged and MapR was sold to HPE in a fire sale.

Starting almost two years ago, Datameer recognized that against this backdrop, it was time for a change. It began developing a couple of new products. It didn’t want to abandon its existing customer base entirely, of course, so it began rebuilding its Hadoop product and is now calling it Datameer X. It is a modern cloud-native product built to run on Kubernetes, the popular open-source container orchestration tool. Instead of Hadoop, it will be based on Spark. He reports they are about two-thirds done with this pivot, but the product has been in the hands of customers.

The company also announced Neebo, an entirely new SaaS tool to give data scientists the ability to process data in whatever form it takes. Rodatus sees a world coming where data will take many forms, from traditional data to Python code from data analysts or data scientists to SaaS vendor dashboards. He sees Neebo bringing all of this together in a managed service with the hope that it will free data scientists to concentrate on getting insight from the data. It will work with data visualization tools like Tableau and Looker, and should be generally available in the coming weeks.

The money should help them get through this pivot, hire more engineers to continue the process and build a go-to-market team for the new products. It’s never easy pivoting like this, but the investors are likely hoping that the company can build on its existing customer base, while taking advantage of the market need for data science processing tools. Time will tell if it works.

Oct
24
2019
--

Grafana Labs nabs $24M Series A for open source-based data analytics stack

Grafana Labs, the commercial company built to support the open-source Grafana project, announced a healthy $24 million Series A investment today. Lightspeed Venture Partners led the round with participation from Lead Edge Capital.

Company CEO and co-founder Raj Dutt says the startup started life as a way to offer a commercial layer on top of the open-source Grafana tool, but it has expanded and now supports other projects, including Loki, an open-source monitoring tool not unlike Prometheus, which the company developed last year.

All of this in the service of connecting to data sources and monitoring data. “Grafana has always been about connecting data together no matter where it lives, whether it’s in a proprietary database, on-prem database or cloud database. There are over 42 data sources that Grafana connects together,” Dutt explained.

But the company has expanded far beyond that. As it describes the product set, “Our products have begun to evolve to unify into a single offering: the world’s first composable open-source observability platform for metrics, logs and traces. Centered around Grafana.” This is exactly where other monitoring and logging tools like Elastic, New Relic and Splunk have been heading this year. The term “observability” is a term that’s been used often to describe these combined capabilities of metrics, logging and tracing.

Grafana Labs is the commercial arm of the open-source projects, and offers a couple of products built on top of these tools. First of all it has Grafana Enterprise, a package that includes enterprise-focused data connectors, enhanced authentication and security and enterprise-class support over and above what the open-source Grafana tool offers.

The company also offers a SaaS version of the Grafana tool stack, which is fully managed and takes away a bunch of the headaches of trying to download raw open-source code, install it, manage it and deal with updates and patches. In the SaaS version, all of that is taken care of for the customer for a monthly fee.

Dutt says the startup took just $4 million in external investment over the first five years, and has been able to build a business with 100 employees and 500 customers. He is particularly proud of the fact that the company is cash flow break-even at this point.

Grafana Labs decided the time was right to take this hefty investment and accelerate the startup’s growth, something they couldn’t really do without a big cash infusion. “We’ve seen this really virtuous cycle going with value creation in the community through these open-source projects that builds mind share, and that can translate into building a sustainable business. So we really want to accelerate that, and that’s the main reason behind the raise.”

Oct
22
2019
--

Databricks announces $400M round on $6.2B valuation as analytics platform continues to grow

Databricks is a SaaS business built on top of a bunch of open-source tools, and apparently it’s been going pretty well on the business side of things. In fact, the company claims to be one of the fastest growing enterprise cloud companies ever. Today the company announced a massive $400 million Series F funding round on a hefty $6.2 billion valuation. Today’s funding brings the total raised to almost a $900 million.

Andreessen Horowitz’s Late Stage Venture Fund led the round with new investors BlackRock, Inc., T. Rowe Price Associates, Inc. and Tiger Global Management also participating. The institutional investors are particularly interesting here because as a late-stage startup, Databricks likely has its eye on a future IPO, and having those investors on board already could give them a head start.

CEO Ali Ghodsi was coy when it came to the IPO, but it sure sounded like that’s a direction he wants to go. “We are one of the fastest growing cloud enterprise software companies on record, which means we have a lot of access to capital as this fundraise shows. The revenue is growing gangbusters, and the brand is also really well known. So an IPO is not something that we’re optimizing for, but it’s something that’s definitely going to happen down the line in the not-too-distant future,” Ghodsi told TechCrunch.

The company announced as of Q3 it’s on a $200 million run rate, and it has a platform that consists of four products, all built on foundational open source: Delta Lake, an open-source data lake product; MLflow, an open-source project that helps data teams operationalize machine learning; Koalas, which creates a single machine framework for Spark and Pandos, greatly simplifying working with the two tools; and, finally, Spark, the open-source analytics engine.

You can download the open-source version of all of these tools for free, but they are not easy to use or manage. The way that Databricks makes money is by offering each of these tools in the form of Software as a Service. They handle all of the management headaches associated with using these tools and they charge you a subscription price.

It’s a model that seems to be working, as the company is growing like crazy. It raised $250 million just last February on a $2.75 billion valuation. Apparently the investors saw room for a lot more growth in the intervening six months, as today’s $6.2 billion valuation shows.

Oct
08
2019
--

Blog Poll: Adding/Upgrading Instances, Hardware, and Migration

Time for a new question in our blog poll series! This time, it’s about adding or upgrading to meet database needs.  Here’s the question: In the last 24 months, how often have you added or upgraded database instances, added hardware to existing servers, or migrated to a new hosting/cloud provider?

Last year, we asked you a few questions in a blog poll and we received a great amount of feedback. We wanted to follow up on those some of those same survey questions to see what may have changed. We’d love to hear from you!

Note: There is a poll embedded within this post, please visit the site to participate in this post’s poll.

This poll will be up for one month and will be maintained over in the sidebar should you wish to come back at a later date and take part. We look forward to seeing your responses!

 

Oct
04
2019
--

Percona XtraDB Cluster 8.0 (experimental release) : SST Improvements

xtradb sst improvements

xtradb sst improvementsStarting with the experimental release of Percona XtraDB Cluster 8.0, we have made changes to the SST process to make the process more robust and easier to use.

  • mysqldump and rsync are no longer supported SST methods.

    Support for mysqldump was deprecated starting with PXC 5.7 and has now been completely removed.

    MySQL 8.0 introduced a new Redo Log format that limited the use of rsync while upgrading from PXC 5.7 to 8.0. In addition, the new Galera-4 also introduced changes that further limits the use of rsync.

    The only supported SST method is xtrabackup-v2.

  • A separate Percona XtraBackup installation is no longer required.

    The required Percona XtraBackup (PXB) binaries are now shipped as part of PXC 8.0, they are not installed for general use. So if you want to use PXB outside of an SST, you will have to install PXB separately.

  • SST logging now uses MySQL error logging

    Previously, the SST script would write directly to the error log file. Now, the SST script uses MySQL error logging. A side effect of this change is that the SST logs are not immediately visible. This is due to the logging subsystem being initialized after the SST has completed.

  • The wsrep_sst_auth variable has been removed.

    PXC 8.0 now creates an internal user (mysql.pxc.sst.user) with a random password for use by PXB to take the backup. The cleartext of the password is not saved and the user is deleted after the SST has completed.

    (This feature is still in development and may change before PXC 8.0 GA)

  • PXC SST auto-upgrade

    When PXC 8.0 detects that the SST came from a lower version, mysql_upgrade is automatically invoked. Also “RESET SLAVE ALL” is run on the new node if needed. This is invoked when receiving an SST from PXC 5.7 and PXC 8.0.

    (This feature is still in development and may change before PXC 8.0 GA)

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

Sep
30
2019
--

Confluent adds free tier to Kafka real-time streaming data cloud service

When Confluent launched a cloud service in 2017, it was trying to reduce some of the complexity related to running a Kafka streaming data application. Today, it introduced a free tier to that cloud service. The company hopes to expand its market beyond large technology company customers, and the free tier should make it easier for smaller companies to get started.

The new tier provides up to $50 of service a month for up to three months. Company CEO Jay Kreps says that while $50 might not sound like much, it’s actually hundreds of gigabytes of throughput and makes it easy to get started with the tool.

“We felt like we can make this technology really accessible. We can make it as easy as we can. We want to make it something where you can just get going in seconds, and not have to pay anything to start building an application that uses real-time streams of data,” Kreps said.

Kafka has been available as an open-source product since 2011, so it’s been free to download, install and build applications, but still required a ton of compute and engineering resources to pull off. The cloud service was designed to simplify that, and the free tier lets developers get comfortable building a small application without making a large financial investment.

Once they get used to working with Kafka on the free version, users can then buy in whatever increments make sense for them, and only pay for what they use. It can be pennies’ worth of Kafka or hundreds of dollars, depending on a customer’s individual requirements. “After free, you can buy 11 cents’ worth of Kafka or you can buy it $10 worth, all the way up to these massive users like Lyft that use Confluent Cloud at huge scale as part of their ridesharing service,” he said.

While a free SaaS trial might feel like a common kind of marketing approach, Kreps says for a service like Kafka, it’s actually much more difficult to pull off. “With something like a distributed system where you get a whole chunk of infrastructure, it’s actually technically an extraordinarily difficult thing to provide zero to elastic scale up capabilities. And a huge amount of engineering goes into making that possible,” Kreps explained.

Kafka processes massive streams of data in real time. It was originally developed inside LinkedIn and open-sourced in 2011. Confluent launched as a commercial entity on top of the open-source project in 2014. In January the company raised $125 million on a $2.5 billion valuation. It has raised than $205 million, according to Crunchbase data.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com