Jun
24
2021
--

Firebolt raises $127M more for its new approach to cheaper and more efficient Big Data analytics

Snowflake changed the conversation for many companies when it comes to the potentials of data warehousing. Now one of the startups that’s hoping to disrupt the disruptor is announcing a big round of funding to expand its own business.

Firebolt, which has built a new kind of cloud data warehouse that promises much more efficient, and cheaper, analytics around whatever is stored within it, is announcing a major Series B of $127 million on the heels of huge demand for its services.

The company, which only came out of stealth mode in December, is not disclosing its valuation with this round, which brings the total raised by the Israeli company to $164 million. New backers Dawn Capital and K5 Global are in this round, alongside previous backers Zeev Ventures, TLV Partners, Bessemer Venture Partners and Angular Ventures.

Nor is it disclosing many details about its customers at the moment. CEO and co-founder Eldad Farkash told me in an interview that most of them are U.S.-based, and that the numbers have grown from the dozen or so that were using Firebolt when it was still in stealth mode (it worked quietly for a couple of years building its product and onboarding customers before finally launching six months ago). They are all migrating from existing data warehousing solutions like Snowflake or BigQuery. In other words, its customers are already cloud-native, Big Data companies: it’s not trying to proselytize on the basic concept but work with those who are already in a specific place as a business.

“If you’re not using Snowflake or BigQuery already, we prefer you come back to us later,” he said. Judging by the size and quick succession of the round, that focus is paying off.

The challenge that Firebolt set out to tackle is that while data warehousing has become a key way for enterprises to analyze, update and manage their big data stores — after all, your data is only as good as the tools you have to parse it and keep it secure — typically data warehousing solutions are not efficient, and they can cost a lot of money to maintain.

The challenge was seen firsthand by the three founders of Firebolt, Farkash (CEO), Saar Bitner (COO) and Ariel Yaroshevich (CTO) when they were at a previous company, the business intelligence powerhouse Sisense, where respectively they were one of its co-founders and two members of its founding team. At Sisense, the company continually came up against an issue: When you are dealing in terabytes of data, cloud data warehouses were straining to deliver good performance to power their analytics and other tools, and the only way to potentially continue to mitigate that was by piling on more cloud capacity. And that started to become very expensive.

Firebolt set out to fix that by taking a different approach, rearchitecting the concept. As Farkash sees it, while data warehousing has indeed been a big breakthrough in Big Data, it has started to feel like a dated solution as data troves have grown.

“Data warehouses are solving yesterday’s problem, which was, ‘How do I migrate to the cloud and deal with scale?’” he told me back in December. Google’s BigQuery, Amazon’s RedShift and Snowflake are fitting answers for that issue, he believes, but “we see Firebolt as the new entrant in that space, with a new take on design on technology. We change the discussion from one of scale to one of speed and efficiency.”

The startup claims that its performance is up to 182 times faster than that of other data warehouses with a SQL-based system that works on academic research that had yet to be applied anywhere, around how to handle data in a lighter way, using new techniques in compression and how data is parsed. Data lakes in turn can be connected with a wider data ecosystem, and what it translates to is a much smaller requirement for cloud capacity. And lower costs.

Fast forward to today, and the company says the concept is gaining a lot of traction with engineers and developers in industries like business intelligence, customer-facing services that need to parse a lot of information to serve information to users in real time and back-end data applications. That is proving out what investors suspected would be a shift before the startup even launched, stealthily or otherwise.

“I’ve been an investor at Firebolt since their Series A round and before they had any paying customers,” said Oren Zeev of Zeev Ventures. “What had me invest in Firebolt is mostly the team. A group of highly experienced executives mostly from the big data space who understand the market very well, and the pain organizations are experiencing. In addition, after speaking to a few of my portfolio companies and Firebolt’s initial design partners, it was clear that Firebolt is solving a major pain, so all in all, it was a fairly easy decision. The market in which Firebolt operates is huge if you consider the valuations of Snowflake and Databricks. Even more importantly, it is growing rapidly as the migration from on-premise data warehouse platforms to the cloud is gaining momentum, and as more and more companies rely on data for their operations and are building data applications.”

Dec
09
2020
--

Firebolt raises $37M to take on Snowflake, Amazon and Google with a new approach to data warehousing

For many organizations, the shift to cloud computing has played out more realistically as a shift to hybrid architectures, where a company’s data is just as likely to reside in one of a number of clouds as it might in an on-premise deployment, in a data warehouse or in a data lake. Today, a startup that has built a more comprehensive way to assess, analyse and use that data is announcing funding as it looks to take on Snowflake, Amazon, Google and others in the area of enterprise data analytics.

Firebolt, which has redesigned the concept of a data warehouse to work more efficiently and at a lower cost, is today announcing that it has raised $37 million from Zeev Ventures, TLV Partners, Bessemer Venture Partners and Angular Ventures. It plans to use the funding to continue developing its product and bring on more customers.

The company is officially “launching” today but — as is the case with so many enterprise startups these days operating in stealth — it has been around for two years already building its platform and signing commercial deals. It now has some 12 large enterprise customers and is “really busy” with new business, said CEO Eldad Farkash in an interview.

The funding may sound like a large amount for a company that has not really been out in the open, but part of the reason is because of the track record of the founders. Farkash was one of the founders of Sisense, the successful business intelligence startup, and he has co-founded Firebolt with two others who were on Sisense’s founding team, Saar Bitner as COO and Ariel Yaroshevich as CTO.

At Sisense, these three were coming up against an issue: When you are dealing in terabytes of data, cloud data warehouses were straining to deliver good performance to power its analytics and other tools, and the only way to potentially continue to mitigate that was by piling on more cloud capacity.

Farkash is something of a technical savant and said that he decided to move on and build Firebolt to see if he could tackle this, which he described as a new, difficult and “meaningful” problem. “The only thing I know how to do is build startups,” he joked.

In his opinion, while data warehousing has been a big breakthrough in how to handle the mass of data that companies now amass and want to use better, it has started to feel like a dated solution.

“Data warehouses are solving yesterday’s problem, which was, ‘How do I migrate to the cloud and deal with scale?’ ” he said, citing Google’s BigQuery, Amazon’s RedShift and Snowflake as fitting answers for that issue. “We see Firebolt as the new entrant in that space, with a new take on design on technology. We change the discussion from one of scale to one of speed and efficiency.”

The startup claims that its performance is up to 182 times faster than that of other data warehouses. It’s a SQL-based system that works on principles that Farkash said came out of academic research that had yet to be applied anywhere, around how to handle data in a lighter way, using new techniques in compression and how data is parsed. Data lakes in turn can be connected with a wider data ecosystem, and what it translates to is a much smaller requirement for cloud capacity.

This is not just a problem at Sisense. With enterprise data continuing to grow exponentially, cloud analytics is growing with it, and is estimated by 2025 to be a $65 billion market, Firebolt estimates.

Still, Farkash said the Firebolt concept was initially a challenging sell even to the engineers that it eventually hired to build out the business: It required building completely new warehouses from the ground up to run the platform, five of which exist today and will be augmented with more, on the back of this funding, he said.

And it should be pointed out that its competitors are not exactly sitting still either. Just yesterday, Dataform announced that it had been acquired by Google to help it build out and run better performance at BigQuery.

“Firebolt created a SaaS product that changes the analytics experience over big data sets,” Oren Zeev of Zeev Ventures said in a statement. “The pace of innovation in the big data space has lagged the explosion in data growth rendering most data warehousing solutions too slow, too expensive, or too complex to scale. Firebolt takes cloud data warehousing to the next level by offering the world’s most powerful analytical engine. This means companies can now analyze multi Terabyte / Petabyte data sets easily at significantly lower costs and provide a truly interactive user experience to their employees, customers or anyone who needs to access the data.”

Sep
16
2020
--

Narrator raises $6.2M for a new approach to data modelling that replaces star schema

Snowflake went public this week, and in a mark of the wider ecosystem that is evolving around data warehousing, a startup that has built a completely new concept for modelling warehoused data is announcing funding. Narrator — which uses an 11-column ordering model rather than standard star schema to organise data for modelling and analysis — has picked up a Series A round of $6.2 million, money that it plans to use to help it launch and build up users for a self-serve version of its product.

The funding is being led by Initialized Capital along with continued investment from Flybridge Capital Partners and Y Combinator — where the startup was in a 2019 cohort — as well as new investors, including Paul Buchheit.

Narrator has been around for three years, but its first phase was based around providing modelling and analytics directly to companies as a consultancy, helping companies bring together disparate, structured data sources from marketing, CRM, support desks and internal databases to work as a unified whole. As consultants, using an earlier build of the tool that it’s now launching, the company’s CEO Ahmed Elsamadisi said he and others each juggled queries “for eight big companies single-handedly,” while deep-dive analyses were done by another single person.

Having validated that it works, the new self-serve version aims to give data scientists and analysts a simplified way of ordering data so that queries, described as actionable analyses in a story-like format — or “Narratives,” as the company calls them — can be made across that data quickly — hours rather than weeks — and consistently. (You can see a demo of how it works below provided by the company’s head of data, Brittany Davis.)

The new data-as-a-service is also priced in SaaS tiers, with a free tier for the first 5 million rows of data, and a sliding scale of pricing after that based on data rows, user numbers and Narratives in use.

Image Credits: Narrator

Elsamadisi, who co-founded the startup with Matt Star, Cedric Dussud and Michael Nason, said that data analysts have long lived with the problems with star schema modelling (and by extension the related format of snowflake schema), which can be summed up as “layers of dependencies, lack of source of truth, numbers not matching and endless maintenance,” he said.

“At its core, when you have lots of tables built from lots of complex SQL, you end up with a growing house of cards requiring the need to constantly hire more people to help make sure it doesn’t collapse.”

(We)Work Experience

It was while he was working as lead data scientist at WeWork — yes, he told me, maybe it wasn’t actually a tech company, but it had “tech at its core” — that he had a breakthrough moment of realising how to restructure data to get around these issues.

Before that, things were tough on the data front. WeWork had 700 tables that his team was managing using a star schema approach, covering 85 systems and 13,000 objects. Data would include information on acquiring buildings, to the flows of customers through those buildings, how things would change and customers might churn, with marketing and activity on social networks, and so on, growing in line with the company’s own rapidly scaling empire.  All of that meant a mess at the data end.

“Data analysts wouldn’t be able to do their jobs,” he said. “It turns out we could barely even answer basic questions about sales numbers. Nothing matched up, and everything took too long.”

The team had 45 people on it, but even so it ended up having to implement a hierarchy for answering questions, as there were so many and not enough time to dig through and answer them all. “And we had every data tool there was,” he added. “My team hated everything they did.”

The single-table column model that Narrator uses, he said, “had been theorised” in the past but hadn’t been figured out.

The spark, he said, was to think of data structured in the same way that we ask questions, where — as he described it — each piece of data can be bridged together and then also used to answer multiple questions.

“The main difference is we’re using a time-series table to replace all your data modelling,” Elsamadisi explained. “This is not a new idea, but it was always considered impossible. In short, we tackle the same problem as most data companies to make it easier to get the data you want but we are the only company that solves it by innovating on the lowest-level data modelling approach. Honestly, that is why our solution works so well. We rebuilt the foundation of data instead of trying to make a faulty foundation better.”

Narrator calls the composite table, which includes all of your data reformatted to fit in its 11-column structure, the Activity Stream.

Elsamadisi said using Narrator for the first time takes about 30 minutes, and about a month to learn to use it thoroughly. “But you’re not going back to SQL after that, it’s so much faster,” he added.

Narrator’s initial market has been providing services to other tech companies, and specifically startups, but the plan is to open it up to a much wider set of verticals. And in a move that might help with that, longer term, it also plans to open source some of its core components so that third parties can build data products on top of the framework more quickly.

As for competitors, he says that it’s essentially the tools that he and other data scientists have always used, although “we’re going against a ‘best practice’ approach (star schema), not a company.” Airflow, DBT, Looker’s LookML, Chartio’s Visual SQL, Tableau Prep are all ways to create and enable the use of a traditional star schema, he added. “We’re similar to these companies — trying to make it as easy and efficient as possible to generate the tables you need for BI, reporting and analysis — but those companies are limited by the traditional star schema approach.”

So far the proof has been in the data. Narrator says that companies average around 20 transformations (the unit used to answer questions) compared to hundreds in a star schema, and that those transformations average 22 lines compared to 1,000+ lines in traditional modelling. For those that learn how to use it, the average time for generating a report or running some analysis is four minutes, compared to weeks in traditional data modelling. 

“Narrator has the potential to set a new standard in data,” said Jen Wolf, ?Initialized Capital COO and partner and new Narrator board member?, in a statement. “We were amazed to see the quality and speed with which Narrator delivered analyses using their product. We’re confident once the world experiences Narrator this will be how data analysis is taught moving forward.”

May
15
2018
--

MemSQL raises $30M Series D round for its real-time database

MemSQL, a company best known for the real-time capabilities of its eponymous in-memory database, today announced that it has raised a $30 million Series D round, bringing the company’s overall funding to $110 million. The round was led by GV (the firm you probably still refer to as Google Ventures) and Glynn Capital. Existing investors Accell, Caffeinated Capital, Data Collective and IA Ventures also participated.

The MemSQL database offers a distributed, relational database that uses standard SQL drivers and queries for transactions and analytics. Its defining feature is the combination of its data ingestions technology that allows users to push millions of events per day into the service while its users can query the records in real time. The company recently showed that its tools can deliver a scan rate of over a trillion rows per second on a cluster with 12 servers.

The database is available for deployments on the major public clouds and on-premises.

MemSQL recently announced that it saw its fourth-quarter commercial booking hit 200 percent year-over-year growth — and that’s typically the kind of growth that investors like to see, even as MemSQL plays in a very competitive market with plenty of incumbents, startups and even open-source projects. Current MemSQL users include the likes of Uber, Akamai, Pinterest, Dell EMC and Comcast.

“MemSQL has achieved strong enterprise traction by delivering a database that enables operational analysis at unique speed and scale, allowing customers to create dynamic, intelligent applications,” said Adam Ghobarah, general partner at GV, in today’s announcement. “The company has demonstrated measurable success with its growing enterprise customer base and we’re excited to invest in the team as they continue to scale.”

Dec
09
2016
--

Segment adds Google BigQuery to expand its customer data platform

Segment BigQuery Segment helps businesses manage all their data from services like Google Analytics, Mixpanel and Salesforce — and send that data to a variety of destinations, whether it’s an attribution product or a data warehouse. The company announced yesterday that it’s expanding its offerings by integrating with BigQuery, Google’s data warehousing service. This might seem like… Read More

Mar
31
2016
--

Microsoft now lets developers embed Power BI visualizations into their own apps

O92A3357 Power BI is Microsoft’s tool for analyzing data and building interactive data-based dashboards and reports. At its Build developer conference in San Francisco, the company today announced the preview of Power BI Embedded. With this, developers will be able to integrate Power BI and its interactive dashboards right into their own apps. Read More

Nov
02
2015
--

Zeta Interactive Acquires eBay Enterprise’s CRM Business

2015-11-02_1525 EBay today finalized the sale of its previously announced eBay Enterprise business. As part of this deal, Zeta Interactive, the big data-driven marketing firm co-founded by David Steinberg and former Apple and Pepsi-Cola CEO John Sculley, today announced that it has acquired the CRM division of eBay Enterprise. Read More

Apr
20
2015
--

Amazon’s AWS Quietly Acquired NoSQL Database Migration Startup Amiato In May 2014

Screen Shot 2015-04-20 at 16.29.50 Amazon Web Services, the enterprise division of the e-commerce and cloud services giant, has been growing its reach in data warehousing and big data analytics by way of Redshift, a hosted product that it launched in 2012 and is now AWS’s fastest growing product “ever”, according to CTO Werner Vogels. Along with input from AWS’s own in-house engineering teams, it… Read More

May
06
2011
--

Shard-Query turbo charges Infobright community edition (ICE)

Shard-Query is an open source tool kit which helps improve the performance of queries against a MySQL database by distributing the work over multiple machines and/or multiple cores. This is similar to the divide and conquer approach that Hive takes in combination with Hadoop. Shard-Query applies a clever approach to parallelism which allows it to significantly improve the performance of queries by spreading the work over all available compute resources. In this test, Shard-Query averages a nearly 6x (max over 10x) improvement over the baseline, as shown in the following graph:

One significant advantage of Shard-Query over Hive is that it works with existing MySQL data sets and queries. Another advantage is that it works with all MySQL storage engines.

This set of benchmarks evaluates how well Infobright community edition (ICE) performs in combination with Shard-Query.

Data set

It was important to choose a data set that was large enough to create queries that would run for a decent amount of time, but not so large that it was difficult to work with. The ontime flight performance statistics data, available online from the United States Bureau of Transportation Statistics (BTS) made a good candidate for testing, as it had been tested before:
Another MPB post
Lucid DB testing

The raw data is a completely denormalized schema (single table). In order to demonstrate the power of Shard-Query it is important to test complex queries involving joins and aggregation. A star schema is the most common OLAP/DW data model, since it typically represents a data mart. See also: “Data mart or data warehouse?”. As it is the most common data model, it is desirable to benchmark using a star schema, even though it involves work to transform the data.

Star schema

Transforming the data was straightforward. I should note that I did this preprocessing with the MyISAM storage engine, then I dumped the data to tab delimited flat files using mysqldump. I started by loading the raw data from the BTS into a single database table called ontime_stage.

Then, the airport information was extracted:

create table dim_airport(
airport_id int auto_increment primary key,
unique key(airport_code)
)
as
select
  Origin as `airport_code`,
  OriginCityName as `CityName`,
  OriginState as `State`,
  OriginStateFips as `StateFips`,
  OriginStateName as `StateName` ,
  OriginWac as `Wac`
FROM ontime_stage
UNION
select
  Dest as `airport_code`,
  DestCityName as `CityName`,
  DestState as `State`,
  DestStateFips as `StateFips`,
  DestStateName as `StateName` ,
  DestWac as `Wac`
FROM ontime_stage;

After extracting flight/airline and date information in a similar fashion, a final table `ontime_fact` is created by joining the newly constructed dimension table tables to the staging tables, omitting the dimension columns from the projection, instead replacing them with the dimension keys:

select dim_date.date_id,
       origin.airport_id as origin_airport_id,
       dest.airport_id as dest_airport_id,
       dim_flight.flight_id,
       ontime_stage.TailNum, ...
from ontime_stage
join dim_date using(FlightDate)
join dim_airport origin on ontime_stage.Origin = origin.airport_code
join dim_airport dest on ontime_stage.Dest = dest.airport_code
join dim_flight using (UniqueCarrier,AirlineID,Carrier,FlightNum);

The data set contains ontime flight information for 22 years, which can be confirmed by examining the contents of the date dimension:

mysql> select count(*),
min(FlightDate),
max(FlightDate)
from dim_date\G
*************************** 1. row ***************************
       count(*): 8401
min(FlightDate): 1988-01-01
max(FlightDate): 2010-12-31
1 row in set (0.00 sec)

The airport dimension is a puppet dimension. It is called a puppet because it serves as both origin and destination dimensions, being referenced by origin_airport_id and destination_airport_id in the fact table, respectively. There are nearly 400 major airports included in the data set.

mysql> select count(*) from dim_airport;
+----------+
| count(*) |
+----------+
|      396 |
+----------+
1 row in set (0.00 sec)

The final dimension is the flight dimension, which contains the flight numbers and air carrier hierarchies. Only the largest air carriers must register and report ontime information with the FAA, so there are only 29 air carriers in the table:

mysql> select count(*),
count(distinct UniqueCarrier)
from dim_flight\G
*************************** 1. row ***************************
                     count(*): 58625
count(distinct UniqueCarrier): 29
1 row in set (0.02 sec)

Each year has tens of millions of flights:

mysql> select count(*) from ontime_one.ontime_fact;
+-----------+
| count(*)  |
+-----------+
| 135125787 |
+-----------+
1 row in set (0.00 sec)

This should be made fully clear by the following schema diagram:

Star schema (ontime_fact, dim_date, dim_flight, dim_airport)

Diagram of the ontime dimensional schema

Test environment

For this benchmark, a test environment consisting of a single commodity database server with 6 cores (+6ht) and 24GB of memory was selected. The selected operating system was Fedora 14. Oracle VirtualBox OSE was used to create six virtual machines, each running Fedora 14. Each of the virtual machines was granted 4GB of memory. A SATA 7200rpm RAID10 battery backed RAID array was used as the underlying storage for the virtual machines.

Baseline:
The MySQL command line client was used to execute the a script file containing the 11 queries. This same SQL script was used in the Shard-Query tests. For the baseline, the results and response times were captured with the \T command. The queries were executed on a single database schema containing all of the data. Before loading, there was approximately 23GB of data. After loading, ICE compressed this data to about 2GB. The test virtual machine was assigned 12 cores in this test.

Scale-up:
Shard-Query was given the following configuration file, which lists only one server. A single schema (ontime_one) contained all of the data. The test virtual machine was assigned 12 cores for this test. The same VM was used as the previous baseline test. This VM was rebooted between tests. A SQL script was fed to the run_query.php script and the output was captured with the ‘tee’ command.

$ cat one.ini
[default]
user=mysql
db=ontime_one
password=
port=5029
column=date_id
mapper=hash
gearman=127.0.0.1:7000
inlist=*
between=*

[shard1]
host=192.168.56.102

Scale-out
In addition to adding parallelism via scale-up, Shard-Query can improve performance of almost all queries by spreading them over more than one physical server. This is called “scaling out” and it allows Shard-Query to vastly improve the performance of queries which have to examine a large amount of data. Shard-Query includes a loader (loader.php) which can be used to either split a data into multiple files (for each shard, for later loading) or it can load files directly, in parallel, to multiple hosts.

Shard-Query will execute queries in parallel over all of these machines. With enough machines, massive parallelism is possible and very large data sets may be processed. As each machine examines only a small subset of the data, performance can be improved significantly:

$ cat shards.ini
[default]
user=mysql
db=ontime
password=
port=5029
column=date_id
mapper=hash
gearman=127.0.0.1:7000
inlist=*
between=*

[shard1]
host=192.168.56.101
db=ontime

[shard2]
host=192.168.56.102
db=ontime

[shard3]
host=192.168.56.103
db=ontime

[shard4]
host=192.168.56.105
db=ontime

[shard5]
host=192.168.56.106
db=ontime

[shard6]
host=192.168.56.104
db=ontime

In this configuration, each shard has about 335MB-350MB of data (23GB raw data, compressed to about 2GB total data. then spread over six servers). Due to ICE limitations, the data was split before loading. The splitting/loading process will be described in another post.

Complex queries

Shard-Query was tested with the simple single table version of this data set in a previous blog post. Previous testing was limited to a subset of Vadim’s test queries (see that post). As this new test schema is normalized, Vadim’s test queries must be modified to reflect the more complex schema structure. For this benchmark each of the original queries has been rewritten to conform to the new schema, and additionally two new test queries have been added. To model real world complexity, each of the test queries feature at least one join, and many of the filter conditions are placed on attributes in the dimension tables. It will be very interesting to test these queries on various engines and databases.

Following is a list of the queries, followed by a response time table recording the actual response times for each query. The queries should be self-explanatory.

Performance, at a glance

You will notice that Shard-Query is faster in nearly every case. The performance of the queries is improved significantly by scaling out, even more so than scaling up, because additional parallelism is added beyond what can be exploited by scale up. Scale up can improve query performance when there is enough resources to support the added parallelism, and when one of the the following are in used in the query: BETWEEN or IN clauses, subqueries in the FROM clause, UNION or UNION ALL clauses. If none of those features are used, then parallelism can’t be added. Q9 is an example of such a query.
.

Query details:

-- Q1
 SELECT DayOfWeek, count(*) AS c
   from ontime_fact
   JOIN dim_date using (date_id)
  WHERE Year
BETWEEN 2000 AND 2008
  GROUP BY DayOfWeek
  ORDER BY c DESC;

-- Q2
SELECT DayOfWeek, count(*) AS c
  from ontime_fact
  JOIN dim_date using (date_id)
 WHERE DepDelay>10 AND Year BETWEEN 2000 AND 2008
 GROUP BY DayOfWeek
 ORDER BY c DESC;

-- Q3
SELECT CityName as Origin, count(*) AS c
  from ontime_fact
  JOIN dim_date using (date_id)
  JOIN dim_airport origin
    ON origin_airport_id = origin.airport_id
 WHERE DepDelay>10
   AND Year BETWEEN 2000 AND 2008
 GROUP BY 1
 ORDER BY c
 LIMIT 10;


The next queries show how performance is improved when Shard-Query adds parallelism when “subqueries in the from clause” are used. There are benefits with both “scale-up” and “scale-out”, but once again, the “scale-out” results are the most striking.

-- Q4
SELECT Carrier, count(*) as c
  from ontime_fact
  JOIN dim_date using (date_id)
  join dim_flight using(flight_id)
 WHERE DepDelay>10
   AND Year=2007
 GROUP BY Carrier
 ORDER BY c DESC;

-- Q5
SELECT t.Carrier, c, c2, c*1000/c2 as c3
FROM
     (SELECT Carrier, count(*) AS c
        from ontime_fact
        join dim_date using(date_id)
        join dim_flight using(flight_id)
       WHERE DepDelay>10
         AND Year=2007
       GROUP BY Carrier) t
JOIN (SELECT Carrier, count(*) AS c2
        from ontime_fact
        join dim_date using(date_id)
        join dim_flight using(flight_id)
       WHERE Year=2007
       GROUP BY Carrier) t2
  ON (t.Carrier=t2.Carrier)
ORDER BY c3 DESC;

-- Q6
SELECT t.Year, c1 / c2 as ratio
FROM
     (select Year, count(*)*1000 as c1
        from ontime_fact
        join dim_date using (date_id)
       WHERE DepDelay>10
       GROUP BY Year) t
JOIN (select Year, count(*) as c2
        from ontime_fact
        join dim_date using (date_id)
       WHERE DepDelay>10
       GROUP BY Year) t2
  ON (t.Year=t2.Year);

-- Q7
SELECT t.Year, c1 / c2 as ratio
  FROM (select Year, count(Year)*1000 as c1
          from ontime_fact
          join dim_date using (date_id)
         WHERE DepDelay>10
         GROUP BY Year) t
  JOIN (select Year, count(*) as c2
          from ontime_fact
          join dim_date using (date_id)
         GROUP BY Year) t2
    ON (t.Year=t2.Year);


The performance of the following queries depends on the size of the date range:

-- Q8.0
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 2001 and 2001
 GROUP BY dest.CityName
 ORDER BY 2 DESC;

-- Q8.1
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 2001 and 2005
 GROUP BY dest.CityName
 ORDER BY 2 DESC;

-- Q8.2
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 2001 and 2011
 GROUP BY dest.CityName
 ORDER BY 2 DESC;

-- Q8.3
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 1990 and 2011
 GROUP BY dest.CityName
 ORDER BY 2 DESC;

-- Q8.4
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 1980 and 2011
 GROUP BY dest.CityName
 ORDER BY 2 DESC;


Finally, Shard-Query performance continues to improve when grouping and filtering is used. Again, notice Q9. It doesn’t use any features which Shard-Query can use to add parallelism. Thus, in the scale up configuration it does not perform any better than the baseline, and actually performed just a little worse. Since scale out splits the data between servers, it performs about 6x better as the degree of parallelism is controlled by the number of shards.

-- Q9
select Year ,count(Year) as c1
  from ontime_fact
  JOIN dim_date using (date_id)
 group by Year;

-- Q10
SELECT Carrier, dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
  JOIN dim_flight using (flight_id)
 WHERE Year BETWEEN 2009 and 2011
 GROUP BY Carrier,dest.CityName
 ORDER BY 3 DESC;

-- Q11
SELECT Year, Carrier, dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
  JOIN dim_flight using (flight_id)
 WHERE Year BETWEEN 2000 and 2003
   AND Carrier = 'AA'
 GROUP BY Year, Carrier,dest.CityName
 ORDER BY 4 DESC;


Conclusion

The divide and conquer approach is very useful when working with large quantities of data. Shard-Query can be used with existing data sets easily, improving the performance of queries significantly if they use common query features like BETWEEN or IN. It is also possible to spread your data over multiple machines, scaling out to improve query response times significantly.

These queries are a great test of Shard-Query features. It is currently approaching RC status. If you decide to test it and encounter issues, please file a bug on the bug tracker. You can get Shard-Query (currently in development release form as a checkout from SVN) here: Shard-Query Google code project

Full disclosure

Justin Swanhart, the author of this article is also the creator and maintainer of Shard-Query. The author has previously worked in cooperation with Infobright, including on benchmarking. These particular tests were performed independently of Infobright, without their knowledge or approval. Infobright was, however, given the chance to review this document before publication, as a courtesy. All findings are represented truthfully, transparently, and without any intended bias.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com