May
20
2021
--

How to ensure data quality in the era of Big Data

A little over a decade has passed since The Economist warned us that we would soon be drowning in data. The modern data stack has emerged as a proposed life-jacket for this data flood — spearheaded by Silicon Valley startups such as Snowflake, Databricks and Confluent.

Today, any entrepreneur can sign up for BigQuery or Snowflake and have a data solution that can scale with their business in a matter of hours. The emergence of cheap, flexible and scalable data storage solutions was largely a response to changing needs spurred by the massive explosion of data.

Currently, the world produces 2.5 quintillion bytes of data daily (there are 18 zeros in a quintillion). The explosion of data continues in the roaring ‘20s, both in terms of generation and storage — the amount of stored data is expected to continue to double at least every four years. However, one integral part of modern data infrastructure still lacks solutions suitable for the Big Data era and its challenges: Monitoring of data quality and data validation.

Let me go through how we got here and the challenges ahead for data quality.

The value vs. volume dilemma of Big Data

In 2005, Tim O’Reilly published his groundbreaking article “What is Web 2.0?”, truly setting off the Big Data race. The same year, Roger Mougalas from O’Reilly introduced the term “Big Data” in its modern context? — ?referring to a large set of data that is virtually impossible to manage and process using traditional BI tools.

Back in 2005, one of the biggest challenges with data was managing large volumes of it, as data infrastructure tooling was expensive and inflexible, and the cloud market was still in its infancy (AWS didn’t publicly launch until 2006). The other was speed: As Tristan Handy from Fishtown Analytics (the company behind dbt) notes, before Redshift launched in 2012, performing relatively straightforward analyses could be incredibly time-consuming even with medium-sized data sets. An entire data tooling ecosystem has since been created to mitigate these two problems.

The emergence of the modern data stack (example logos & categories)

The emergence of the modern data stack (example logos and categories). Image Credits: Validio

Scaling relational databases and data warehouse appliances used to be a real challenge. Only 10 years ago, a company that wanted to understand customer behavior had to buy and rack servers before its engineers and data scientists could work on generating insights. Data and its surrounding infrastructure was expensive, so only the biggest companies could afford large-scale data ingestion and storage.

The challenge before us is to ensure that the large volumes of Big Data are of sufficiently high quality before they’re used.

Then came a (Red)shift. In October 2012, AWS presented the first viable solution to the scale challenge with Redshift — a cloud-native, massively parallel processing (MPP) database that anyone could use for a monthly price of a pair of sneakers ($100) — about 1,000x cheaper than the previous “local-server” setup. With a price drop of this magnitude, the floodgates opened and every company, big or small, could now store and process massive amounts of data and unlock new opportunities.

As Jamin Ball from Altimeter Capital summarizes, Redshift was a big deal because it was the first cloud-native OLAP warehouse and reduced the cost of owning an OLAP database by orders of magnitude. The speed of processing analytical queries also increased dramatically. And later on (Snowflake pioneered this), they separated computing and storage, which, in overly simplified terms, meant customers could scale their storage and computing resources independently.

What did this all mean? An explosion of data collection and storage.

Sep
03
2020
--

Avo raises $3M for its analytics governance platform

Avo, a startup that helps businesses better manage their data quality across teams, today announced that it has raised a $3 million seed round led by GGV Capital, with participation from  Heavybit, Y Combinator and others.

The company’s founder, Stefania Olafsdóttir, who is currently based in Iceland, was previously the head of data science at QuizUp, which at some point had 100 million users around the world. “I had the opportunity to build up the Data Science Division, and that meant the cultural aspect of helping people ask and answer the right questions — and get them curious about data — but it also meant the technical part of setting up the infrastructure and tools and pipelines, so people can get the right answers when they need it,” she told me. “We were early adopters of self-serve product analytics and culture — and we struggled immensely with data reliability and data trust.”

Image Credits: Avo

As companies collect more data across products and teams, the process tends to become unwieldy and different teams end up using different methods (or just simply different tags), which creates inefficiencies and issues across the data pipeline.

“At first, that unreliable data just slowed down decision making, because people were just like, didn’t understand the data and needed to ask questions,” Olafsdóttir said about her time at QuizUp. “But then it caused us to actually launch bad product updates based on incorrect data.” Over time, that problem only became more apparent.

“Once organizations realize how big this issue is — that they’re effectively flying blind because of unreliable data, while their competition might be like taking the lead on the market — the default is to patch together a bunch of clunky processes and tools that partially increase the level of liability,” she said. And that clunky process typically involves a product manager and a spreadsheet today.

At its core, the Avo team set out to build a better process around this, and after a few detours and other product ideas, Olafsdóttir and her co-founders regrouped to focus on exactly this problem during their time in the Y Combinator program.

Avo gives developers, data scientists and product managers a shared workspace to develop and optimize their data pipelines. “Good product analytics is the product of collaboration between these cross-functional groups of stakeholders,” Olafsdóttir argues, and the goal of Avo is to give these groups a platform for their analytics planning and governance — and to set company-wide standards for how they create their analytics events.

Once that is done, Avo provides developers with typesafe analytics code and debuggers that allows them to take those snippets and add them to their code within minutes. For some companies, this new process can help them go from spending 10 hours on fixing a specific analytics issue to an hour or less.

Most companies, the team argues, know — deep down — that they can’t fully trust their data. But they also often don’t know how to fix this problem. To help them with this, Avo also today released its Inspector product. This tool processes event streams for a company, visualizes them and then highlights potential errors. These could be type mismatches, missing properties or other discrepancies. In many ways, that’s obviously a great sales tool for a service that aims to avoid exactly these problems.

One of Avo’s early customers is Rappi, the Latin American delivery service. “This year we scaled to meet the demand of 100,000 new customers digitizing their deliveries and curbside pickups. The problem with every new software release was that we’d break analytics. It represented 25% of our Jira tickets,” said Rappi’s head of Engineering, Damian Sima. “With Avo we create analytics schemas upfront, identify analytics issues fast, add consistency over time and ensure data reliability as we help customers serve the 12+ million monthly users their businesses attract.”

As most startups at this stage, Avo plans to use the new funding to build out its team and continue to develop its product.

“The next trillion-dollar software market will be driven from the ground up, with developers deciding the tools they use to create digital transformation across every industry. Avo offers engineers ease of implementation while still retaining schemas and analytics governance for product leaders,” said GGV Capital Managing Partner Glenn Solomon. “Our investment in Avo is an investment in software developers as the new kingmakers and product leaders as the new oracles.”

May
27
2020
--

Toro snags $4M seed investment to monitor data quality

Toro’s founders started at Uber helping monitor the data quality in the company’s vast data catalogs, and they wanted to put that experience to work for a more general audience. Today, the company announced a $4 million seed round.

The round was co-led by Costanoa Ventures and Point72 Ventures, with help from a number of individual investors.

Company co-founder and CEO Kyle Kirwan says the startup wanted to bring to data the kind of automated monitoring we have in applications performance monitoring products. Instead of getting an alert when the application is performing poorly, you would get an alert that there is an issue with the data.

“We’re building a monitoring platform that helps data teams find problems in their data content before that gets into dashboards and machine learning models and other places where problems in the data could cause a lot of damage,” Kirwan told TechCrunch.

When it comes to data, there are specific kinds of issues a product like Toro would be looking at. It might be a figure that falls outside of a specific dollar range that could be indicative of fraud, or it could be simply a mistake in how the data was labeled that is different from previous ways that could break a model.

The founders learned the lessons they used to build Toro while working on the data team at Uber. They had helped build tools there to find these kinds of problems, but in a way that was highly specific to Uber. When they started Toro, they needed to build a more general-purpose tool.

The product works by understanding what it’s looking at in terms of data, and what the normal thresholds are for a particular type of data. Anything that falls outside of the threshold for a particular data point would trigger an alert, and the data team would need to go to work to fix the problem.

Casey Aylward, vice president at Costanoa Ventures, likes the pedigree of this team and the problem it’s trying to solve. “Despite its importance, data quality has remained a challenge for many enterprise companies,” she said in a statement. She added, “[The co-founders] deep experience building several of Uber’s internal data tools makes them uniquely qualified to build the best solution.”

The company has been at this for just over a year and has been keeping it lean with four employees, including the two co-founders, but they do have plans to add a couple of data scientists in the coming year as they continue to build out the product.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com