Jan
21
2020
--

Snyk snags $150M investment as its valuation surpasses $1B

Snyk, the company that wants to help developers secure their code as part of the development process, announced a $150 million investment today. The company indicated the investment brings its valuation to more than $1 billion (although it did not share the exact figure).

Today’s round was led by Stripes, a New York City investment firm, with help from Coatue, Tiger Global, BoldStart,Trend Forward, Amity and Salesforce Ventures. The company reports it has now raised more than $250 million.

The idea behind Snyk is to fit security firmly in the development process. Rather than offloading it to a separate team, something that can slow down a continuous development environment, Snyk builds in security as part of the code commit.

The company offers an open-source tool that helps developers find open-source vulnerabilities when they commit their code to GitHub, Bitbucket, GitLab or any CI/CD tool. It has built up a community of more than 400,000 developers with this approach.

Snyk makes money with a container security product, and by making available to companies as a commercial product the underlying vulnerability database they use in the open-source product.

CEO Peter McKay, who came on board last year as the company was making a move to expand into the enterprise, says the open-source product drives the revenue-producing products and helped attract this kind of investment. “Getting to [today’s] funding round was the momentum in the open source model from the community to freemium to [land] and expand — and that’s where we are today,” he told TechCrunch.

He said the company wasn’t looking for this money, but investors came knocking and gave them a good offer, based on Snyk’s growing market momentum. “Investors said we want to take advantage of the market, and we want to make sure you can invest the way you want to invest and take advantage of what we all believe is this very large opportunity,” McKay said.

In fact, the company has been raising money at a rapid clip since it came out of the gate in 2016 with a $3 million seed round. A $7 million Series A and $22 million Series B followed in 2018, with a $70 million Series C last fall.

The company reports over 4X revenue growth in 2019 (without giving exact revenue figures), and some major customer wins, including the likes of Google, Intuit, Nordstrom and Salesforce. It’s worth noting that Salesforce thought enough of the company that it also invested in this round through its Salesforce Ventures investment arm.

Jan
15
2020
--

Kadena fulfills hybrid blockchain vision with launch of public chain

For the last few years, blockchain startup Kadena has been working on a vision of bringing blockchain to the enterprise. Today it announced the final piece of that vision with the launch of the Kadena public blockchain.

In earlier releases, the company offered the ability to build private blockchains on AWS or Azure. Company co-founder and CEO Will Martino says the public network brings together for the first time public and private chains in a hybrid vision.

“The big exciting thing is that the public chain is out, smart contracts are about to turn on, and that allows us to then go and hit the market with what we’re calling these hybrid applications. These are applications that run both on a private blockchain, but have public smart contracts that allow people on the public side to interact with the private chain,” Martino explained.

The smart contracts are a set of rules that must be met and validated for the private and public chains to interact. Only valid actors and actions as defined in the smart contract and will be allowed to move between the two chains.

Overcoming scaling issues

One of the major challenges with building a chain like this has been scaling it to meet the needs of enterprise users. Martino says that his company has solved this problem and can scale from the 10 chains today to 10,000 or more in the future as the company grows. He further claims that his company is the only one with a tractable roadmap capable of achieving this.

Martino says this could help push companies who have been dabbling in blockchain technology in the last couple of years to take a bigger leap. “This is a watershed moment for enterprises. Up until now, they’ve never had a platform that they could go and use on a public blockchain platform and know that it’s going to have the throughput they need if the product they deployed on that blockchain has legs and starts to take off.” Martino says this blockchain has that.

Kadena public blockchain in action

Kadena has also developed an open-source smart contract language called Pact that Martino says allows a lawyer with Excel-level programming understanding to write these contracts and place them on the chain.

“There are a lot of lawyers who are good with Excel, so you can actually hand the smart contracts to a lawyer and have them review them for compliance. And that’s a crazy idea, but we think it’s fundamental because when you’re representing core business workflows that are sensitive, you need to be absolutely certain they are compliant.”

Show me the money

The company is making all of the basic pieces available for free. That includes the private chain development tools on AWS and Azure, the public chain released today, along with the Pact smart contract language.

Martino says there are a couple of ways for the business to make money. For starters, it’s building partnerships where it helps companies in various sectors, from financial services to insurance and healthcare, build viable hybrid applications on the Kadena blockchain. When they make money, so will Kadena.

Secondly, they control a bushel of tokens on their public network, which have value, and if the vision comes to fruition, will have much more over time. They will be able to sell some of these tokens on the public market and make money. Right now he says the tokens have a value of between 20 cents and a dollar, but he expects that to increase as the network becomes more viable.

The blockchain has lost some of its luster as it has moved through the enterprise hype cycle in recent years, but if Kadena can succeed in building a fully decentralized, scalable blockchain, it could help push the technology deeper into the enterprise.

Dec
10
2019
--

Microsoft announces public preview of Microsoft Teams for Linux

Today, Microsoft announced a public preview of Microsoft Teams for Linux, the first Office 365 tool that’s available for the open-source operating system.

The hope is that by making it available for preview, the company can get feedback from the community and improve it before it becomes generally available. “Starting today, Microsoft Teams is available for Linux users in public preview, enabling high quality collaboration experiences for the open source community at work and in educational institutions,” the company wrote in the blog post announcing the release.

The goal here ultimately is to help get Teams into the hands of more customers by expanding the platforms it runs on. “Most of our customers have devices running on a variety of different platforms such as Windows 10, Linux and others. We are committed to supporting mixed environments across our cloud and productivity offerings, and with this announcement, we are pleased to extend the Teams experience to Linux users,” the company wrote in the blog post.

This announcement is significant for a couple of reasons. For starters, Microsoft has had a complicated history with Linux and open source, although in recent years, under Satya Nadella, it has embraced open source. This shows that Microsoft is willing to put its tools wherever customers need them, regardless of the platform or operating system.

Secondly, as it marks the first Office 365 app on Linux, if there is positive feedback, it could open the door for more apps on the platform down the road.

The announcement also comes against the backdrop of the company’s ongoing battles with Slack for enterprise collaboration platform users. In July, Microsoft announced 13 million daily active users on Teams. Meanwhile, Slack has 12 million DAUs. It’s worth noting that Slack has been available on Linux for almost two years.

Nov
22
2019
--

Making sense of a multi-cloud, hybrid world at KubeCon

More than 12,000 attendees gathered this week in San Diego to discuss all things containers, Kubernetes and cloud-native at KubeCon.

Kubernetes, the container orchestration tool, turned five this year, and the technology appears to be reaching a maturity phase where it accelerates beyond early adopters to reach a more mainstream group of larger business users.

That’s not to say that there isn’t plenty of work to be done, or that most enterprise companies have completely bought in, but it’s clearly reached a point where containerization is on the table. If you think about it, the whole cloud-native ethos makes sense for the current state of computing and how large companies tend to operate.

If this week’s conference showed us anything, it’s an acknowledgment that it’s a multi-cloud, hybrid world. That means most companies are working with multiple public cloud vendors, while managing a hybrid environment that includes those vendors — as well as existing legacy tools that are probably still on-premises — and they want a single way to manage all of this.

The promise of Kubernetes and cloud-native technologies, in general, is that it gives these companies a way to thread this particular needle, or at least that’s the theory.

Kubernetes to the rescue

Photo: Ron Miller/TechCrunch

If you were to look at the Kubernetes hype cycle, we are probably right about at the peak where many think Kubernetes can solve every computing problem they might have. That’s probably asking too much, but cloud-native approaches have a lot of promise.

Craig McLuckie, VP of R&D for cloud-native apps at VMware, was one of the original developers of Kubernetes at Google in 2014. VMware thought enough of the importance of cloud-native technologies that it bought his former company, Heptio, for $550 million last year.

As we head into this phase of pushing Kubernetes and related tech into larger companies, McLuckie acknowledges it creates a set of new challenges. “We are at this crossing the chasm moment where you look at the way the world is — and you look at the opportunity of what the world might become — and a big part of what motivated me to join VMware is that it’s successfully proven its ability to help enterprise organizations navigate their way through these disruptive changes,” McLuckie told TechCrunch.

He says that Kubernetes does actually solve this fundamental management problem companies face in this multi-cloud, hybrid world. “At the end of the day, Kubernetes is an abstraction. It’s just a way of organizing your infrastructure and making it accessible to the people that need to consume it.

“And I think it’s a fundamentally better abstraction than we have access to today. It has some very nice properties. It is pretty consistent in every environment that you might want to operate, so it really makes your on-prem software feel like it’s operating in the public cloud,” he explained.

Simplifying a complex world

One of the reasons Kubernetes and cloud-native technologies are gaining in popularity is because the technology allows companies to think about hardware differently. There is a big difference between virtual machines and containers, says Joe Fernandes, VP of product for Red Hat cloud platform.

“Sometimes people conflate containers as another form of virtualization, but with virtualization, you’re virtualizing hardware, and the virtual machines that you’re creating are like an actual machine with its own operating system. With containers, you’re virtualizing the process,” he said.

He said that this means it’s not coupled with the hardware. The only thing it needs to worry about is making sure it can run Linux, and Linux runs everywhere, which explains how containers make it easier to manage across different types of infrastructure. “It’s more efficient, more affordable, and ultimately, cloud-native allows folks to drive more automation,” he said.

Bringing it into the enterprise

Photo: Ron Miller/TechCrunch

It’s one thing to convince early adopters to change the way they work, but as this technology enters the mainstream. Gabe Monroy, partner program manager at Microsoft says to carry this technology to the next level, we have to change the way we talk about it.

Nov
14
2019
--

Blog Poll: Who Chooses Your Database Technology?

Who Chooses Database Tech?

Who Chooses Database Tech?It’s that time again – a new blog poll! This time around, we’re interested in hearing who chooses your database technology. Is it DBAs? Management?  Here’s the question: Who Chooses the Database Technology for a New Application in Your Organization?

Last year, we asked you a few questions in a blog poll and we received a great amount of feedback. We wanted to follow up on those some of those same survey questions to see what may have changed. We’d love to hear from you!

Note: There is a poll embedded within this post, please visit the site to participate in this post’s poll.

This poll will be up for one month and will be maintained over in the sidebar should you wish to come back at a later date and take part. We look forward to seeing your responses!

 

Nov
13
2019
--

AWS, Salesforce join forces with Linux Foundation on Cloud Information Model

Last year, Adobe, SAP and Microsoft came together and formed the Open Data Initiative. Not to be outdone, this week, AWS, Salesforce and Genesys, in partnership with The Linux Foundation, announced the Cloud Information Model.

The two competing data models have a lot in common. They are both about bringing together data and applying a common open model to it. The idea is to allow for data interoperability across products in the partnership without a lot of heavy lifting, a common problem for users of these big companies’ software.

Jim Zemlin, executive director at The Linux Foundation, says this project provides a neutral home for the Cloud Information model, where a community can work on the problem. “This allows for anyone across the community to collaborate and provide contributions under a central governance model. It paves the way for full community-wide engagement in data interoperability efforts and standards development, while rapidly increasing adoption rate of the community,” Zemlin explained in a statement.

Each of the companies in the initial partnership is using the model in different ways. AWS will use it in conjunction with its AWS Lake Formation tool to help customers move, catalog, store and clean data from a variety of data sources, while Genesys customers can use its cloud and AI products to communicate across a variety of channels.

Patrick Stokes from Salesforce says his company is using the Cloud Information Model as the underlying data model for his company’s Customer 360 platform of products. “We’re super excited to announce that we’ve joined together with a few partners — AWS, Genesys and The Linux Foundation — to actually open-source that data model,” Stokes told TechCrunch.

Of course, now we have two competing “open” data models, and it’s going to create some friction until the two competing projects find a way to come together. The fact is that many companies use tools from each of these companies, and if there continues to be these competing approaches, it’s going to defeat the purpose of creating these initiatives in the first place.

As Satya Nadella said in 2015, “It is incumbent upon us, especially those of us who are platform vendors to partner broadly to solve real pain points our customers have.” If that’s the case, having competing models is not really achieving that.

Nov
05
2019
--

Chronosphere launches with $11M Series A to build scalable, cloud-native monitoring tool

Chronosphere, a startup from two ex-Uber engineers who helped create the open-source M3 monitoring project to handle Uber-level scale, officially launched today with the goal of building a commercial company on top of the open-source project.

It also announced an $11 million investment led by Greylock, with participation from venture capitalist Lee Fixel.

While the founders, CEO Martin Mao and CTO Rob Skillington, were working at Uber, they recognized a gap in the monitoring industry, particularly around cloud-native technologies like containers and microservices. There weren’t any tools available on the market that could handle Uber’s scaling requirements — so like any good engineers, they went out and built their own.

“We looked around at the market at the time and couldn’t find anything in open source or commercially available that could really scale to our needs. So we ended up building and open sourcing our solution, which is M3. Over the last three to four years we’ve scaled M3 to one of the largest production monitoring systems in the world today,” Mao explained.

The essential difference between M3 and other open-source, cloud-native monitoring solutions like Prometheus is that ability to scale, he says.

One of the main reasons they left to start a company, with the blessing of Uber, was that the community began asking for features that didn’t really make sense for Uber. By launching Chronosphere, Mao and Skillington would be taking on the management of the project moving forward (although sharing governance for the time being with Uber), while building those enterprise features the community has been requesting.

The new company’s first product will be a cloud version of M3 to help reduce some of the complexity associated with managing an M3 project. “M3 itself is a fairly complex piece of technology to run. It is solving a fairly complex problem at large scale, and running it actually requires a decent amount of investment to run at large scale, so the first thing we’re doing is taking care of that management,” Mao said.

Jerry Chen, who led the investment at Greylock, saw a company solving a big problem. “They were providing such a high-resolution view of what’s going on in your cloud infrastructure and doing that at scale at a cost that actually makes sense. They solved that problem at Uber, and I saw them, and I was like wow, the rest of the market needs what guys built and I wrote the Series A check. It was as simple as that,” Chen told TechCrunch.

The cloud product is currently in private beta; they expect to open to public beta early next year.

Nov
04
2019
--

Robocorp announces $5.6M seed to bring open-source option to RPA

Robotic Process Automation (RPA) has been a hot commodity in recent years as it helps automate tedious manual workflows inside large organizations. Robocorp, a San Francisco startup, wants to bring open source and RPA together. Today it announced a $5.6 million seed investment.

Benchmark led the round, with participation from Slow Ventures, firstminute Capital, Bret Taylor (president and chief product officer at Salesforce) and Docker CEO Rob Bearden. In addition, Benchmark’s Peter Fenton will be joining the company’s board.

Robocorp co-founder and CEO Antti Karjalainen has been around open-source projects for years, and he saw an enterprise software category that was lacking in open-source options. “We actually have a unique angle on RPA, where we are introducing open source and cloud native technology into the market and focusing on developer-led technologies,” Karjalainen said.

He sees a market that’s top-down and focused on heavy sales cycles. He wants to bring the focus back to the developers who will be using the tools. “We are all about removing friction from developers. So, we are focused on giving developers tools that they like to use, and want to use for RPA, and doing it in an open-source model where the tools themselves are free to use,” he said.

The company is built on the open-source Robot Framework project, which was originally developed as an open-source software testing environment, but he sees RPA having a lot in common with testing, and his team has been able to take the project and apply it to RPA.

If you’re wondering how the company will make money, they are offering a cloud service to reduce the complexity even further of using the open-source tools, and that includes the kinds of features enterprises tend to demand from these projects, like security, identity and access management, and so forth.

Benchmark’s Peter Fenton, who has invested in several successful open-source startups, including JBoss, SpringSource and Elastic, sees RPA as an area that’s ripe for a developer-focused open-source option. “We’re living in the era of the developer, where cloud-native and open source provide the freedom to innovate without constraint. Robocorp’s RPA approach provides developers the cloud native, open-source tools to bring RPA into their organizations without the burdensome constraints of existing offerings,” Fenton said.

The company intends to use the money to add new employees and continue scaling the cloud product, while working to build the underlying open-source community.

While UIPath, a fast-growing startup with a hefty $7.1 billion valuation recently announced it was laying off 400 people, Gartner published a study in June showing that RPA is the fastest growing enterprise software category.

Oct
30
2019
--

Spooktacularly Scary Database Stories

scary database stories

scary database storiesThe nights are lengthening and the spookiest day of the year is nearly upon us, Halloween! In the spirit of the holiday, we asked our team to share their scariest tales of database dread, blood-curdling BIOS failures, and dastardly data destruction, and some of the responses are downright chilling.

Grab some candy and check out the stories that Perconians are too afraid to talk about after dark!

Brian Walters, Director of Solution Engineering:

Rookie DBA at a company with a shoestring budget = dev, test, and prod on the same server. What could possibly go wrong?

So, I’m about two years into my career as a DBA and part of my job is maintaining the database for the company MRP system. This system is critical, without it the whole plant shuts down.

During the implementation of the MRP system, the company had fallen into the bad practice of using the production database server for development and testing purposes as well. I’d made several requests for dedicated dev/test hardware, but this just didn’t seem to be a priority for those controlling the budget.

My days usually started with the same few tasks: checking the backups from the night before, deleting the prior day’s testing database, and creating a new testing environment by restoring last night’s backup. I had my routine pretty tight, most of it was scripted. All I had to do was change an environment variable and run the daily scripts.

This all worked fine until one day… that morning, I was just a little more tired than normal. I logged into the database server and thought that I had set the environment variables so they pointed to the testing database. By some mistaken force of habit, mental lapse, or temporary spooky hallucination, I had actually, accidentally set the environment variables to prod… and then I ran the delete-database scripts.

Somehow, I realized my mistake almost before the enter-key was fully depressed. But by that time it was already too late. It took less than ten seconds for the first phone call to arrive. Naturally, I sent it to voicemail, along with the next three phone calls that immediately followed. My next instinct was to press the Send All Calls button. Head in my hands, fully realizing what had just happened, I struggled to bring myself to understand how I did it.

After a quick conversation with my boss, who also happened to be the CFO, my schedule was cleared for the day. The remainder of the morning was spent practicing my Point in Time Recovery skills. To this day, I am grateful that I had a solid and tested backup and restore plan. I went home early that day. And that was the last time I made a mistake like that. We purchased dedicated dev/test hardware about a month later.

Robert Bernier, PostgreSQL Consultant

I was working as a newly hired Senior DBA when two developers suddenly appeared at my desk in a very agitated and excited state of mind. They were upset because they had accidentally pushed OS updates to embedded devices that were installed on many of our client devices. These premature updates effectively bricked them, which numbered in the tens of thousands.

I suppose I shouldn’t have been surprised but they actually “demanded” that I execute a rollback at once. Needless to say, being a new hire, there were a lot of things I still didn’t know and I was anxious to avoid making the situation worse. So I invited them to find a couple of chairs, sit next to me and taking their time “explain” who they were and what they did at the company. Eventually, I got around to the events leading up to the incident. Slowing them down was my primary concern as the rollback’s success hinged their state of mind. In time, they were able to demonstrate the issue and its resolution. Within a couple of hours, we staged the rollback across the affected devices and unbricked them.

In retrospect it could have been worse as the data-store I was managing held several hundred TB representing 600,000 embedded devices.

The moral of the story, heh heh… always know where to find your towel.

 

Audrey Swagerty, Customer Success Manager

This is not a tech story but a real nightmare ?

When I started as a CSR, 3.5 years ago, I was not familiar with our industry and the technologies so it was quite the challenge… for some time (I guess until I got more comfortable with my new job), I used to have a recurring nightmare. I was being chased through the woods by a girl… and she would finally catch up with me, grab me and I would ask her name (don’t ask me why…instead of asking her not to kill me right ? )… And she would say: I am MongoDB!!! Then, I would wake up!

I have not thought about that story in a long time (and have not had the same nightmare ever since) so it was fun to share it again with you! Hopefully, it won’t happen again this weekend… I have been trying to help a customer with Kubernetes questions so you never know! ?

 

Marcos Albe, Principal Technical Services Engineer

The worst horror story is a hospital introducing inconsistencies into a database… I always feared someone with asthma would end up as an amputee due to broken data!

 

Patrick Birch, Senior Technical Writer

While I was a SQL Server DBA, my CTO allowed the report writers to access the production database. The report writers were very nice people but were lacking in certain qualities, such as the ability to write SQL code. I wrote their queries and asked them to run their requirements through me. Well, I went to lunch one day, and when I returned the CTO and other managers were running around. One of the report writers had written a cartesian join (cross join) and the production database was slowing everything down to a crawl!

I killed the connection and had a long talk with the report writers. The managers approved my building a data warehouse the next day.

 

Martin James, Vice President of Sales EMEA & APAC

At my last company, I started to look at the scary reality of data security and healthcare. As mists and mellow fruitfulness provide the perfect backdrop to the spooky traditions of Halloween, ghostly goings-on were being uncovered in unexpected areas. Not in gloomy churchyards nor crumbling manor houses, but in the databases of general practitioners in England.

In 2016, the data of 57 million patients were held within general practitioners’ records. However, census data at the time suggested this should have stood at only 54 million. So who are these extra 3 million people? These records belong to ‘ghost patients’: patients who have deceased or emigrated or are duplications/inaccuracies in record keeping. Either way, it has an impact on surgeries, on funding, and on the services provided, adding unnecessary pressure to the NHS and leaving it unable to provide a precise picture of its patients and the care it provides.

So, NHS England began a ghost hunt to track down and identify the owners of these records so they can update their data and save money. If a patient hasn’t seen their GP for five years, they’ll be sent a letter requesting them to respond. If they don’t respond, they’ll be sent a second letter, after which they’ll be removed from the patient register. This could be a measure that makes an instant saving, as, according to the BBC, family doctors are paid for every patient registered on their list (ghost or not) and the Times quantifies this at around £400 million a year.

Taking a deeper dive into patient data and the connections within it could be used to great benefit in the health service. It would enable a tighter hold on data and driving compliance, as well as help the NHS improve precision and accuracy in its records. Its data will become an asset, and a means of national health intelligence. A utopian view? Perhaps – but without the need for ghostbusters!

What’s your scariest database story?  Let us know in the comments or reply to us on social so we can be on the lookout for database ghouls and goblins! And to learn how Percona’s experts can take the scary out of database management, check out our open source database support, managed services, and consulting services.

Oct
29
2019
--

Column Histograms on Percona Server and MySQL 8.0

MySQL Column HIstorgrams

MySQL Column HIstorgramsFrom time to time you may have experienced that MySQL was not able to find the best execution plan for a query. You felt the query should have been faster. You felt that something didn’t work, but you didn’t realize exactly what.

Maybe some of you did tests and discovered there was a better execution plan that MySQL wasn’t able to find (forcing the order of the tables with STRAIGHT_JOIN for example).

In this article, we’ll see a new interesting feature available on MySQL 8.0 as well as Percona Server for MySQL 8.0: the histogram-based statistics.

Today, we’ll see what a histogram is, how you can create and manage it, and how MySQL’s optimizer can use it.

Just for completeness, histogram statistics have been available on MariaDB since version 10.0.2, with a slightly different implementation. Anyway, what we’ll see here is related to Percona Server and MySQL 8.0 only.

 

What is a histogram

We can define a histogram as a good approximation of the data distribution of the values in a column.

Histogram-based statistics were introduced to give the optimizer more execution plans to investigate and solve a query. Until then, in some cases, the optimizer was not able to find out the best possible execution plan because non-indexed columns were ignored.

With histogram statistics, now the optimizer may have more options because also non-indexed columns can be considered. In some specific cases, a query can run faster than usual.

Let’s consider the following table to store departing times of the trains:

CREATE TABLE train_schedule(
id INT PRIMARY KEY,
train_code VARCHAR(10),
departure_station VARCHAR(100),
departure_time TIME);

We can assume that during peak hours, from 7 AM until 9 AM, there are more rows, and during the night hours we have very few rows.

Let’s take a look at the following two queries:

SELECT * FROM train_schedule WHERE departure_time BETWEEN '07:30:00' AND '09:15:00';
SELECT * FROM train_schedule WHERE departure_time BETWEEN '01:00:00' AND '03:00:00';

Without any kind of statistics, the optimizer assumes by default that the values in the departure_time column are evenly distributed, but they aren’t. In fact, the first query returns more rows because of this assumption.

Histograms were invented to provide to the optimizer a good estimation of the rows returned. This seems to be trivial for the simple queries we have seen so far. But let’s think now about having the same table involved in JOINs with other tables. In such a case, the number of rows returned can be very important for the optimizer to decide the order to consider the tables in the execution plan.

A good estimation of the rows returned gives the optimizer the capability to open the table in the first stages in case it returns few rows. This minimizes the total amount of rows for the final cartesian product. Then the query can run faster.

MySQL supports two different types of histograms: “singleton” and “equi-height”. Common for all histogram types is that they split the data set into a set of “buckets”, and MySQL automatically divides the values into the buckets and will also automatically decide what type of histogram to create.

Singleton histogram

  • one value per bucket
  • each bucket stores
    • value
    • cumulative frequency
  • well suited for equality and range conditions

Equi-height histogram

  • multiple values per bucket
  • each bucket stores
    • minimum value
    • maximum value
    • cumulative frequency
    • number of distinct values
  • not really equi-height: frequent values are in separated buckets
  • well suited for range conditions

How to use histograms

The histogram feature is available and enabled on the server, but not usable by the optimizer. Without an explicit creation, the optimizer works the same as usual and cannot get any benefit from the histogram-bases statistics.

There is some manual operation to do. Let’s see.

In the next examples, we’ll use the world sample database you can download from here: https://dev.mysql.com/doc/index-other.html

Let’s start executing a query joining two tables to find out all the languages spoken on the largest cities of the world, with more than 10 million people.

mysql> select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000; 
+-----------------+-----------+ 
| name            | language  | 
+-----------------+-----------+ 
| Mumbai (Bombay) | Asami     | 
| Mumbai (Bombay) | Bengali   | 
| Mumbai (Bombay) | Gujarati  | 
| Mumbai (Bombay) | Hindi     | 
| Mumbai (Bombay) | Kannada   | 
| Mumbai (Bombay) | Malajalam | 
| Mumbai (Bombay) | Marathi   | 
| Mumbai (Bombay) | Orija     | 
| Mumbai (Bombay) | Punjabi   | 
| Mumbai (Bombay) | Tamil     | 
| Mumbai (Bombay) | Telugu    | 
| Mumbai (Bombay) | Urdu      | 
+-----------------+-----------+ 
12 rows in set (0.04 sec)

The query takes 0.04 seconds. It’s not a lot, but consider that the database is very small. Use the BENCHMARK function to have more relevant response times if you like.

Let’s see the EXPLAIN:

mysql> explain select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000; 
+----+-------------+-----------------+------------+-------+---------------------+-------------+---------+-----------------------------------+------+----------+-------------+ 
| id | select_type | table           | partitions | type  | possible_keys       | key         | key_len | ref                               | rows | filtered | Extra       | 
+----+-------------+-----------------+------------+-------+---------------------+-------------+---------+-----------------------------------+------+----------+-------------+ 
| 1  | SIMPLE      | countrylanguage | NULL       | index | PRIMARY,CountryCode | CountryCode | 3       | NULL                              | 984  | 100.00   | Using index | 
| 1  | SIMPLE      | city            | NULL       | ref   | CountryCode         | CountryCode | 3       | world.countrylanguage.CountryCode | 18   | 33.33    | Using where | 
+----+-------------+-----------------+------------+-------+---------------------+-------------+---------+-----------------------------------+------+----------+-------------+

Indexes are used for both the tables and the estimated cartesian product has 984 * 18 = 17,712 rows.

Now generate the histogram on the Population column. It’s the only column used for filtering the data and it’s not indexed.

For that, we have to use the ANALYZE command:

mysql> ANALYZE TABLE city UPDATE HISTOGRAM ON population WITH 1024 BUCKETS; 
+------------+-----------+----------+-------------------------------------------------------+ 
| Table      | Op        | Msg_type | Msg_text                                              | 
+------------+-----------+----------+-------------------------------------------------------+ 
| world.city | histogram | status   | Histogram statistics created for column 'Population'. | 
+------------+-----------+----------+-------------------------------------------------------+

We have created a histogram using 1024 buckets. The number of buckets is not mandatory, and it can be any number from 1 to 1024. If omitted, the default value is 100.

The number of chunks affects the reliability of the statistics. The more distinct values you have, the more the chunks you need.

Let’s have a look now at the execution plan and execute the query again.

mysql> explain select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000;
+----+-------------+-----------------+------------+------+---------------------+-------------+---------+------------------------+------+----------+-------------+
| id | select_type | table           | partitions | type | possible_keys       | key         | key_len | ref                    | rows | filtered | Extra       |
+----+-------------+-----------------+------------+------+---------------------+-------------+---------+------------------------+------+----------+-------------+
|  1 | SIMPLE      | city            | NULL       | ALL  | CountryCode         | NULL        | NULL    | NULL                   | 4188 |     0.06 | Using where |
|  1 | SIMPLE      | countrylanguage | NULL       | ref  | PRIMARY,CountryCode | CountryCode | 3       | world.city.CountryCode |  984 |   100.00 | Using index |
+----+-------------+-----------------+------------+------+---------------------+-------------+---------+------------------------+------+----------+-------------+

mysql> select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000;
+-----------------+-----------+
| name            | language  |
+-----------------+-----------+
| Mumbai (Bombay) | Asami     |
| Mumbai (Bombay) | Bengali   |
| Mumbai (Bombay) | Gujarati  |
| Mumbai (Bombay) | Hindi     |
| Mumbai (Bombay) | Kannada   |
| Mumbai (Bombay) | Malajalam |
| Mumbai (Bombay) | Marathi   |
| Mumbai (Bombay) | Orija     |
| Mumbai (Bombay) | Punjabi   |
| Mumbai (Bombay) | Tamil     |
| Mumbai (Bombay) | Telugu    |
| Mumbai (Bombay) | Urdu      |
+-----------------+-----------+
12 rows in set (0.00 sec)

The execution plan is different, and the query runs faster.

We can notice that the order of the tables is the opposite as before. Even if it requires a full scan, the city table is in the first stage. It’s because of the filtered value that is only 0.06. It means that only 0.06% of the rows returned by the full scan will be used to be joined with the following table. So, it’s only 4188 * 0.06% = 2.5 rows. In total, the estimated cartesian product is 2.5 * 984 = 2.460 rows. This is significantly lower than the previous execution and explains why the query is faster.

What we have seen sounds a little counterintuitive, doesn’t it? In fact, until MySQL 5.7, we were used to considering full scans as very bad in most cases. In our case, instead, forcing a full scan using a histogram statistic on a non-indexed column lets the query to get optimized. Awesome.

 

Where are the histogram statistics

Histogram statistics are stored in the column_statistics table in the data dictionary and are not directly accessible by the users. Instead the INFORMATION_SCHEMA.COLUMN_STATISTICS table, which is implemented as a view of the data dictionary, can be used for the same purpose.

Let’s see the statistics for our table.

mysql> SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM)  
    -> FROM information_schema.column_statistics  
    -> WHERE COLUMN_NAME = 'population'\G
*************************** 1. row ***************************
           SCHEMA_NAME: world
            TABLE_NAME: city
           COLUMN_NAME: Population
JSON_PRETTY(HISTOGRAM): {
  "buckets": [
    [
      42,
      455,
      0.000980632507967639,
      4
    ],
    [
      503,
      682,
      0.001961265015935278,
      4
    ],
    [
      700,
      1137,
      0.0029418975239029173,
      4
    ],
...
...
    [
      8591309,
      9604900,
      0.9990193674920324,
      4
    ],
    [
      9696300,
      10500000,
      1.0,
      4
    ]
  ],
  "data-type": "int",
  "null-values": 0.0,
  "collation-id": 8,
  "last-updated": "2019-10-14 22:24:58.232254",
  "sampling-rate": 1.0,
  "histogram-type": "equi-height",
  "number-of-buckets-specified": 1024
}

We can see for any chunk the min and max values, the cumulative frequency, and the number of items. Also, we can see that MySQL decided to use an equi-height histogram.

Let’s try to generate a histogram on another table and column.

mysql> ANALYZE TABLE country UPDATE HISTOGRAM ON Region;
+---------------+-----------+----------+---------------------------------------------------+
| Table         | Op        | Msg_type | Msg_text                                          |
+---------------+-----------+----------+---------------------------------------------------+
| world.country | histogram | status   | Histogram statistics created for column 'Region'. |
+---------------+-----------+----------+---------------------------------------------------+
1 row in set (0.01 sec)

mysql> SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM)  FROM information_schema.column_statistics  WHERE COLUMN_NAME = 'Region'\G
*************************** 1. row ***************************
           SCHEMA_NAME: world
            TABLE_NAME: country
           COLUMN_NAME: Region
JSON_PRETTY(HISTOGRAM): {
  "buckets": [
    [
      "base64:type254:QW50YXJjdGljYQ==",
      0.02092050209205021
    ],
    [
      "base64:type254:QXVzdHJhbGlhIGFuZCBOZXcgWmVhbGFuZA==",
      0.04184100418410042
    ],
    [
      "base64:type254:QmFsdGljIENvdW50cmllcw==",
      0.05439330543933054
    ],
    [
      "base64:type254:QnJpdGlzaCBJc2xhbmRz",
      0.06276150627615062
    ],
    [
      "base64:type254:Q2FyaWJiZWFu",
      0.1631799163179916
    ],
    [
      "base64:type254:Q2VudHJhbCBBZnJpY2E=",
      0.20083682008368198
    ],
    [
      "base64:type254:Q2VudHJhbCBBbWVyaWNh",
      0.23430962343096232
    ],
    [
      "base64:type254:RWFzdGVybiBBZnJpY2E=",
      0.3179916317991631
    ],
    [
      "base64:type254:RWFzdGVybiBBc2lh",
      0.35146443514644343
    ],
    [
      "base64:type254:RWFzdGVybiBFdXJvcGU=",
      0.39330543933054385
    ],
    [
      "base64:type254:TWVsYW5lc2lh",
      0.41422594142259406
    ],
    [
      "base64:type254:TWljcm9uZXNpYQ==",
      0.44351464435146437
    ],
    [
      "base64:type254:TWljcm9uZXNpYS9DYXJpYmJlYW4=",
      0.4476987447698744
    ],
    [
      "base64:type254:TWlkZGxlIEVhc3Q=",
      0.5230125523012552
    ],
    [
      "base64:type254:Tm9yZGljIENvdW50cmllcw==",
      0.5523012552301255
    ],
    [
      "base64:type254:Tm9ydGggQW1lcmljYQ==",
      0.5732217573221757
    ],
    [
      "base64:type254:Tm9ydGhlcm4gQWZyaWNh",
      0.602510460251046
    ],
    [
      "base64:type254:UG9seW5lc2lh",
      0.6443514644351465
    ],
    [
      "base64:type254:U291dGggQW1lcmljYQ==",
      0.7029288702928871
    ],
    [
      "base64:type254:U291dGhlYXN0IEFzaWE=",
      0.7489539748953975
    ],
    [
      "base64:type254:U291dGhlcm4gQWZyaWNh",
      0.7698744769874477
    ],
    [
      "base64:type254:U291dGhlcm4gYW5kIENlbnRyYWwgQXNpYQ==",
      0.8284518828451883
    ],
    [
      "base64:type254:U291dGhlcm4gRXVyb3Bl",
      0.891213389121339
    ],
    [
      "base64:type254:V2VzdGVybiBBZnJpY2E=",
      0.9623430962343097
    ],
    [
      "base64:type254:V2VzdGVybiBFdXJvcGU=",
      1.0
    ]
  ],
  "data-type": "string",
  "null-values": 0.0,
  "collation-id": 8,
  "last-updated": "2019-10-14 22:29:13.418582",
  "sampling-rate": 1.0,
  "histogram-type": "singleton",
  "number-of-buckets-specified": 100
}

In this case, a singleton histogram was generated.

Using the following query we can see more human-readable statistics.

mysql> SELECT SUBSTRING_INDEX(v, ':', -1) value, concat(round(c*100,1),'%') cumulfreq,         
    -> CONCAT(round((c - LAG(c, 1, 0) over()) * 100,1), '%') freq   
    -> FROM information_schema.column_statistics, JSON_TABLE(histogram->'$.buckets','$[*]' COLUMNS(v VARCHAR(60) PATH '$[0]', c double PATH '$[1]')) hist  
    -> WHERE schema_name  = 'world' and table_name = 'country' and column_name = 'region';
+---------------------------+-----------+-------+
| value                     | cumulfreq | freq  |
+---------------------------+-----------+-------+
| Antarctica                | 2.1%      | 2.1%  |
| Australia and New Zealand | 4.2%      | 2.1%  |
| Baltic Countries          | 5.4%      | 1.3%  |
| British Islands           | 6.3%      | 0.8%  |
| Caribbean                 | 16.3%     | 10.0% |
| Central Africa            | 20.1%     | 3.8%  |
| Central America           | 23.4%     | 3.3%  |
| Eastern Africa            | 31.8%     | 8.4%  |
| Eastern Asia              | 35.1%     | 3.3%  |
| Eastern Europe            | 39.3%     | 4.2%  |
| Melanesia                 | 41.4%     | 2.1%  |
| Micronesia                | 44.4%     | 2.9%  |
| Micronesia/Caribbean      | 44.8%     | 0.4%  |
| Middle East               | 52.3%     | 7.5%  |
| Nordic Countries          | 55.2%     | 2.9%  |
| North America             | 57.3%     | 2.1%  |
| Northern Africa           | 60.3%     | 2.9%  |
| Polynesia                 | 64.4%     | 4.2%  |
| South America             | 70.3%     | 5.9%  |
| Southeast Asia            | 74.9%     | 4.6%  |
| Southern Africa           | 77.0%     | 2.1%  |
| Southern and Central Asia | 82.8%     | 5.9%  |
| Southern Europe           | 89.1%     | 6.3%  |
| Western Africa            | 96.2%     | 7.1%  |
| Western Europe            | 100.0%    | 3.8%  |
+---------------------------+-----------+-------+

 

Histogram maintenance

Histogram statistics are not automatically recalculated. If you have a table that is very frequently updated with a lot of INSERTs, UPDATEs, and DELETEs, the statistics can run out of date very soon. Having unreliable histograms can lead the optimizer to the wrong choice.

When you find a histogram was useful to optimize a query, you need to also have a scheduled plan to refresh the statistics from time to time, in particular after doing massive modifications to the table.

To refresh a histogram you just need to run the same ANALYZE command we have seen before.

To completely drop a histogram you may run the following:

ANALYZE TABLE city DROP HISTOGRAM ON population;

 

Sampling

The histogram_generation_max_mem_size system variable controls the maximum amount of memory available for histogram generation. The global and session values may be set at runtime.

If the estimated amount of data to be read into memory for histogram generation exceeds the limit defined by the variable, MySQL samples the data rather than reading all of it into memory. Sampling is evenly distributed over the entire table.

The default value is 20000000 but you can increase it in the case of a large column if you want more accurate statistics. For very large columns, pay attention not to increase the threshold more than the memory available in order to avoid excessive overhead or outage.

 

Conclusion

Histogram statistics are particularly useful for non-indexed columns, as shown in the example.

Execution plans that can rely on indexes are usually the best, but histograms can help in some edge cases or when creating a new index is a bad idea.

Since this is not an automatic feature, some manual testing is required to investigate if you really can get the benefit of a histogram. Also, the maintenance requires some scheduled and manual activity.

Use histograms if you really need them, but don’t abuse them since histograms on very large tables can consume a lot of memory.

Usually, the best candidates for a histogram are the columns with:

  • values that do not change much over time
  • low cardinality values
  • uneven distribution

Install Percona Server 8.0, test and enjoy the histograms.

 

Further reading on the same topic: Billion Goods in Few Categories – How Histograms Save a Life?

 

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com