Aug
19
2021
--

Companies betting on data must value people as much as AI

The Pareto principle, also known as the 80-20 rule, asserts that 80% of consequences come from 20% of causes, rendering the remainder way less impactful.

Those working with data may have heard a different rendition of the 80-20 rule: A data scientist spends 80% of their time at work cleaning up messy data as opposed to doing actual analysis or generating insights. Imagine a 30-minute drive expanded to two-and-a-half hours by traffic jams, and you’ll get the picture.

As tempting as it may be to think of a future where there is a machine learning model for every business process, we do not need to tread that far right now.

While most data scientists spend more than 20% of their time at work on actual analysis, they still have to waste countless hours turning a trove of messy data into a tidy dataset ready for analysis. This process can include removing duplicate data, making sure all entries are formatted correctly and doing other preparatory work.

On average, this workflow stage takes up about 45% of the total time, a recent Anaconda survey found. An earlier poll by CrowdFlower put the estimate at 60%, and many other surveys cite figures in this range.

None of this is to say data preparation is not important. “Garbage in, garbage out” is a well-known rule in computer science circles, and it applies to data science, too. In the best-case scenario, the script will just return an error, warning that it cannot calculate the average spending per client, because the entry for customer #1527 is formatted as text, not as a numeral. In the worst case, the company will act on insights that have little to do with reality.

The real question to ask here is whether re-formatting the data for customer #1527 is really the best way to use the time of a well-paid expert. The average data scientist is paid between $95,000 and $120,000 per year, according to various estimates. Having the employee on such pay focus on mind-numbing, non-expert tasks is a waste both of their time and the company’s money. Besides, real-world data has a lifespan, and if a dataset for a time-sensitive project takes too long to collect and process, it can be outdated before any analysis is done.

What’s more, companies’ quests for data often include wasting the time of non-data-focused personnel, with employees asked to help fetch or produce data instead of working on their regular responsibilities. More than half of the data being collected by companies is often not used at all, suggesting that the time of everyone involved in the collection has been wasted to produce nothing but operational delay and the associated losses.

The data that has been collected, on the other hand, is often only used by a designated data science team that is too overworked to go through everything that is available.

All for data, and data for all

The issues outlined here all play into the fact that save for the data pioneers like Google and Facebook, companies are still wrapping their heads around how to re-imagine themselves for the data-driven era. Data is pulled into huge databases and data scientists are left with a lot of cleaning to do, while others, whose time was wasted on helping fetch the data, do not benefit from it too often.

The truth is, we are still early when it comes to data transformation. The success of tech giants that put data at the core of their business models set off a spark that is only starting to take off. And even though the results are mixed for now, this is a sign that companies have yet to master thinking with data.

Data holds much value, and businesses are very much aware of it, as showcased by the appetite for AI experts in non-tech companies. Companies just have to do it right, and one of the key tasks in this respect is to start focusing on people as much as we do on AIs.

Data can enhance the operations of virtually any component within the organizational structure of any business. As tempting as it may be to think of a future where there is a machine learning model for every business process, we do not need to tread that far right now. The goal for any company looking to tap data today comes down to getting it from point A to point B. Point A is the part in the workflow where data is being collected, and point B is the person who needs this data for decision-making.

Importantly, point B does not have to be a data scientist. It could be a manager trying to figure out the optimal workflow design, an engineer looking for flaws in a manufacturing process or a UI designer doing A/B testing on a specific feature. All of these people must have the data they need at hand all the time, ready to be processed for insights.

People can thrive with data just as well as models, especially if the company invests in them and makes sure to equip them with basic analysis skills. In this approach, accessibility must be the name of the game.

Skeptics may claim that big data is nothing but an overused corporate buzzword, but advanced analytics capacities can enhance the bottom line for any company as long as it comes with a clear plan and appropriate expectations. The first step is to focus on making data accessible and easy to use and not on hauling in as much data as possible.

In other words, an all-around data culture is just as important for an enterprise as the data infrastructure.

Aug
19
2021
--

Insight Partners leads $30M round into Metabase, developing enterprise business intelligence tools

Open-source business intelligence company Metabase announced Thursday a $30 million Series B round led by Insight Partners.

Existing investors Expa and NEA joined in on the round, which gives the San Francisco-based company a total of $42.5 million in funding since it was founded in 2015. Metabase previously raised $8 million in Series A funding back in 2019, led by NEA.

Metabase was developed within venture studio Expa and spun out as an easy way for people to interact with data sets, co-founder and CEO Sameer Al-Sakran told TechCrunch.

“When someone wants access to data, they may not know what to measure or how to use it, all they know is they have the data,” Al-Sakran said. “We provide a self-service access layer where they can ask a question, Metabase scans the data and they can use the results to build models, create a dashboard and even slice the data in ways they choose without having an analyst build out the database.”

He notes that not much has changed in the business intelligence realm since Tableau came out more than 15 years ago, and that computers can do more for the end user, particularly to understand what the user is going to do. Increasingly, open source is the way software and information wants to be consumed, especially for the person that just wants to pull the data themselves, he added.

George Mathew, managing director of Insight Partners, believes we are seeing the third generation of business intelligence tools emerging following centralized enterprise architectures like SAP, then self-service tools like Tableau and Looker and now companies like Metabase that can get users to discovery and insights quickly.

“The third generation is here and they are leading the charge to insights and value,” Mathew added. “In addition, the world has moved to the cloud, and BI tools need to move there, too. This generation of open source is a better and greater example of all three of those.”

To date, Metabase has been downloaded 98 million times and used by more than 30,000 companies across 200 countries. The company pursued another round of funding after building out a commercial offering, Metabase Enterprise, that is doing well, Al-Sakran said.

The new funding round enables the company to build out a sales team and continue with product development on both Metabase Enterprise and Metabase Cloud. Due to Metabase often being someone’s first business intelligence tool, he is also doubling down on resources to help educate customers on how to ask questions and learn from their data.

“Open source has changed from floppy disks to projects on the cloud, and we think end users have the right to see what they are running,” Al-Sakran said. “We are continuing to create new features and improve performance and overall experience in efforts to create the BI system of the future.

 

Jul
15
2021
--

CockroachDB, the database that just won’t die

There is an art to engineering, and sometimes engineering can transform art. For Spencer Kimball and Peter Mattis, those two worlds collided when they created the widely successful open-source graphics program, GIMP, as college students at Berkeley.

That project was so successful that when the two joined Google in 2002, Sergey Brin and Larry Page personally stopped by to tell the new hires how much they liked it and explained how they used the program to create the first Google logo.

Cockroach Labs was started by developers and stays true to its roots to this day.

In terms of good fortune in the corporate hierarchy, when you get this type of recognition in a company such as Google, there’s only one way you can go — up. They went from rising stars to stars at Google, becoming the go-to guys on the Infrastructure Team. They could easily have looked forward to a lifetime of lucrative employment.

But Kimball, Mattis and another Google employee, Ben Darnell, wanted more — a company of their own. To realize their ambitions, they created Cockroach Labs, the business entity behind their ambitious open-source database CockroachDB. Can some of the smartest former engineers in Google’s arsenal upend the world of databases in a market spotted with the gravesites of storage dreams past? That’s what we are here to find out.

Berkeley software distribution

Mattis and Kimball were roommates at Berkeley majoring in computer science in the early-to-mid-1990s. In addition to their usual studies, they also became involved with the eXperimental Computing Facility (XCF), an organization of undergraduates who have a keen, almost obsessive interest in CS.

Jul
15
2021
--

How engineers fought the CAP theorem in the global war on latency

CockroachDB was intended to be a global database from the beginning. The founders of Cockroach Labs wanted to ensure that data written in one location would be viewable immediately in another location 10,000 miles away. The use case was simple, but the work needed to make it happen was herculean.

The company is betting the farm that it can solve one of the largest challenges for web-scale applications. The approach it’s taking is clever, but it’s a bit complicated, particularly for the non-technical reader. Given its history and engineering talent, the company is in the process of pulling it off and making a big impact on the database market, making it a technology well worth understanding. In short, there’s value in digging into the details.

Using CockroachDB’s multiregion feature to segment data according to geographic proximity fulfills Cockroach Labs’ primary directive: To get data as close to the user as possible.

In part 1 of this EC-1, I provided a general overview and a look at the origins of Cockroach Labs. In this installment, I’m going to cover the technical details of the technology with an eye to the non-technical reader. I’m going to describe the CockroachDB technology through three questions:

  1. What makes reading and writing data over a global geography so hard?
  2. How does CockroachDB address the problem?
  3. What does it all mean for those using CockroachDB?

What makes reading and writing data over a global geography so hard?

Spencer Kimball, CEO and co-founder of Cockroach Labs, describes the situation this way:

There’s lots of other stuff you need to consider when building global applications, particularly around data management. Take, for example, the question and answer website Quora. Let’s say you live in Australia. You have an account and you store the particulars of your Quora user identity on a database partition in Australia.

But when you post a question, you actually don’t want that data to just be posted in Australia. You want that data to be posted everywhere so that all the answers to all the questions are the same for everybody, anywhere. You don’t want to have a situation where you answer a question in Sydney and then you can see it in Hong Kong, but you can’t see it in the EU. When that’s the case, you end up getting different answers depending where you are. That’s a huge problem.

Reading and writing data over a global geography is challenging for pretty much the same reason that it’s faster to get a pizza delivered from across the street than from across the city. The essential constraints of time and space apply. Whether it’s digital data or a pepperoni pizza, the further away you are from the source, the longer stuff takes to get to you.

Jul
15
2021
--

Scaling CockroachDB in the red ocean of relational databases

Most database startups avoid building relational databases, since that market is dominated by a few goliaths. Oracle, MySQL and Microsoft SQL Server have embedded themselves into the technical fabric of large- and medium-size companies going back decades. These established companies have a lot of market share and a lot of money to quash the competition.

So rather than trying to compete in the relational database market, over the past decade, many database startups focused on alternative architectures such as document-centric databases (like MongoDB), key-value stores (like Redis) and graph databases (like Neo4J). But Cockroach Labs went against conventional wisdom with CockroachDB: It intentionally competed in the relational database market with its relational database product.

While it did face an uphill battle to penetrate the market, Cockroach Labs saw a surprising benefit: It didn’t have to invent a market. All it needed to do was grab a share of a market that also happened to be growing rapidly.

Cockroach Labs has a bright future, compelling technology, a lot of money in the bank and has an experienced, technically astute executive team.

In previous parts of this EC-1, I looked at the origins of CockroachDB, presented an in-depth technical description of its product as well as an analysis of the company’s developer relations and cloud service, CockroachCloud. In this final installment, we’ll look at the future of the company, the competitive landscape within the relational database market, its ability to retain talent as it looks toward a potential IPO or acquisition, and the risks it faces.

CockroachDB’s success is not guaranteed. It has to overcome significant hurdles to secure a profitable place for itself among a set of well-established database technologies that are owned by companies with very deep pockets.

It’s not impossible, though. We’ll first look at MongoDB as an example of how a company can break through the barriers for database startups competing with incumbents.

When life gives you Mongos, make MongoDB

Dev Ittycheria, MongoDB CEO, rings the Nasdaq Stock Market Opening Bell. Image Credits: Nasdaq, Inc

MongoDB is a good example of the risks that come with trying to invent a new database market. The company started out as a purely document-centric database at a time when that approach was the exception rather than the rule.

Web developers like document-centric databases because they address a number of common use cases in their work. For example, a document-centric database works well for storing comments to a blog post or a customer’s entire order history and profile.

Jul
01
2021
--

To guard against data loss and misuse, the cybersecurity conversation must evolve

Data breaches have become a part of life. They impact hospitals, universities, government agencies, charitable organizations and commercial enterprises. In healthcare alone, 2020 saw 640 breaches, exposing 30 million personal records, a 25% increase over 2019 that equates to roughly two breaches per day, according to the U.S. Department of Health and Human Services. On a global basis, 2.3 billion records were breached in February 2021.

It’s painfully clear that existing data loss prevention (DLP) tools are struggling to deal with the data sprawl, ubiquitous cloud services, device diversity and human behaviors that constitute our virtual world.

Conventional DLP solutions are built on a castle-and-moat framework in which data centers and cloud platforms are the castles holding sensitive data. They’re surrounded by networks, endpoint devices and human beings that serve as moats, defining the defensive security perimeters of every organization. Conventional solutions assign sensitivity ratings to individual data assets and monitor these perimeters to detect the unauthorized movement of sensitive data.

It’s painfully clear that existing data loss prevention (DLP) tools are struggling to deal with the data sprawl, ubiquitous cloud services, device diversity and human behaviors that constitute our virtual world.

Unfortunately, these historical security boundaries are becoming increasingly ambiguous and somewhat irrelevant as bots, APIs and collaboration tools become the primary conduits for sharing and exchanging data.

In reality, data loss is only half the problem confronting a modern enterprise. Corporations are routinely exposed to financial, legal and ethical risks associated with the mishandling or misuse of sensitive information within the corporation itself. The risks associated with the misuse of personally identifiable information have been widely publicized.

However, risks of similar or greater severity can result from the mishandling of intellectual property, material nonpublic information, or any type of data that was obtained through a formal agreement that placed explicit restrictions on its use.

Conventional DLP frameworks are incapable of addressing these challenges. We believe they need to be replaced by a new data misuse protection (DMP) framework that safeguards data from unauthorized or inappropriate use within a corporate environment in addition to its outright theft or inadvertent loss. DMP solutions will provide data assets with more sophisticated self-defense mechanisms instead of relying on the surveillance of traditional security perimeters.

Jun
02
2021
--

Stemma launches with $4.8M seed to build managed data catalogue

As companies increasingly rely on data to run their businesses, having accurate sources of data becomes paramount. Stemma, a new early-stage startup, has come up with a solution, a managed data catalogue that acts as an organization’s source of truth.

Today the company announced a $4.8 million seed investment led by Sequoia with assorted individual tech luminaries also participating. The product is also available for the first time today.

Company co-founder and CEO Mark Grover says the product is actually built on top of the open-source Amundsen data catalogue project that he helped launch at Lyft to manage its massive data requirements. The problem was that with so much data, employees had to kludge together systems to confirm the data validity. Ultimately manual processes like asking someone in Slack or even creating a Wiki failed under the weight of trying to keep up with the volume and velocity.

“I saw this problem firsthand at Lyft, which led me to create the open-source Amundsen project with a team of talented engineers,” Grover said. That project has 750 users at Lyft using it every week. Since it was open-sourced, 35 companies like Brex, Snap and Asana have been using it.

What Stemma offers is a managed version of Amundsen that adds functionality like using intelligence to show data that’s meaningful to the person who is searching in the catalogue. It also can add metadata automatically to data as it’s added to the catalogue, creating documentation about the data on the fly, among other features.

The company launched last fall when Grover and co-founder and CTO Dorian Johnson decided to join forces and create a commercial product on top of Amundsen. Grover points out that Lyft was supportive of the move.

Today the company has five employees, in addition to the founders, and has plans to add several more this year. As he does that, he is cognizant of diversity and inclusion in the hiring process. “I think it’s super important that we continue to invest in diversity, and the two ways that I think are the most meaningful for us right now is to have early employees that are from diverse groups, and that is the case within the first five,” he said. Beyond that, he says that as the company grows he wants to improve the ratio, while also looking at diversity in investors, board members and executives.

The company, which launched during COVID, is entirely remote right now and plans to remain that way for at least the short term. As the company grows, they will look at ways to build camaraderie, like organizing a regular cadence of employee offsite events.

Apr
30
2021
--

Analytics as a service: Why more enterprises should consider outsourcing

With an increasing number of enterprise systems, growing teams, a rising proliferation of the web and multiple digital initiatives, companies of all sizes are creating loads of data every day. This data contains excellent business insights and immense opportunities, but it has become impossible for companies to derive actionable insights from this data consistently due to its sheer volume.

According to Verified Market Research, the analytics-as-a-service (AaaS) market is expected to grow to $101.29 billion by 2026. Organizations that have not started on their analytics journey or are spending scarce data engineer resources to resolve issues with analytics implementations are not identifying actionable data insights. Through AaaS, managed services providers (MSPs) can help organizations get started on their analytics journey immediately without extravagant capital investment.

MSPs can take ownership of the company’s immediate data analytics needs, resolve ongoing challenges and integrate new data sources to manage dashboard visualizations, reporting and predictive modeling — enabling companies to make data-driven decisions every day.

AaaS could come bundled with multiple business-intelligence-related services. Primarily, the service includes (1) services for data warehouses; (2) services for visualizations and reports; and (3) services for predictive analytics, artificial intelligence (AI) and machine learning (ML). When a company partners with an MSP for analytics as a service, organizations are able to tap into business intelligence easily, instantly and at a lower cost of ownership than doing it in-house. This empowers the enterprise to focus on delivering better customer experiences, be unencumbered with decision-making and build data-driven strategies.

Organizations that have not started on their analytics journey or are spending scarce data engineer resources to resolve issues with analytics implementations are not identifying actionable data insights.

In today’s world, where customers value experiences over transactions, AaaS helps businesses dig deeper into their psyche and tap insights to build long-term winning strategies. It also enables enterprises to forecast and predict business trends by looking at their data and allows employees at every level to make informed decisions.

Apr
16
2021
--

Data scientists: Bring the narrative to the forefront

By 2025, 463 exabytes of data will be created each day, according to some estimates. (For perspective, one exabyte of storage could hold 50,000 years of DVD-quality video.) It’s now easier than ever to translate physical and digital actions into data, and businesses of all types have raced to amass as much data as possible in order to gain a competitive edge.

However, in our collective infatuation with data (and obtaining more of it), what’s often overlooked is the role that storytelling plays in extracting real value from data.

The reality is that data by itself is insufficient to really influence human behavior. Whether the goal is to improve a business’ bottom line or convince people to stay home amid a pandemic, it’s the narrative that compels action, rather than the numbers alone. As more data is collected and analyzed, communication and storytelling will become even more integral in the data science discipline because of their role in separating the signal from the noise.

Data alone doesn’t spur innovation — rather, it’s data-driven storytelling that helps uncover hidden trends, powers personalization, and streamlines processes.

Yet this can be an area where data scientists struggle. In Anaconda’s 2020 State of Data Science survey of more than 2,300 data scientists, nearly a quarter of respondents said that their data science or machine learning (ML) teams lacked communication skills. This may be one reason why roughly 40% of respondents said they were able to effectively demonstrate business impact “only sometimes” or “almost never.”

The best data practitioners must be as skilled in storytelling as they are in coding and deploying models — and yes, this extends beyond creating visualizations to accompany reports. Here are some recommendations for how data scientists can situate their results within larger contextual narratives.

Make the abstract more tangible

Ever-growing datasets help machine learning models better understand the scope of a problem space, but more data does not necessarily help with human comprehension. Even for the most left-brain of thinkers, it’s not in our nature to understand large abstract numbers or things like marginal improvements in accuracy. This is why it’s important to include points of reference in your storytelling that make data tangible.

For example, throughout the pandemic, we’ve been bombarded with countless statistics around case counts, death rates, positivity rates, and more. While all of this data is important, tools like interactive maps and conversations around reproduction numbers are more effective than massive data dumps in terms of providing context, conveying risk, and, consequently, helping change behaviors as needed. In working with numbers, data practitioners have a responsibility to provide the necessary structure so that the data can be understood by the intended audience.

Apr
13
2021
--

Meroxa raises $15M Series A for its real-time data platform

Meroxa, a startup that makes it easier for businesses to build the data pipelines to power both their analytics and operational workflows, today announced that it has raised a $15 million Series A funding round led by Drive Capital. Existing investors Root, Amplify and Hustle Fund also participated in this round, which together with the company’s previously undisclosed $4.2 million seed round now brings total funding in the company to $19.2 million.

The promise of Meroxa is that businesses can use a single platform for their various data needs and won’t need a team of experts to build their infrastructure and then manage it. At its core, Meroxa provides a single software-as-a-service solution that connects relational databases to data warehouses and then helps businesses operationalize that data.

Image Credits: Meroxa

“The interesting thing is that we are focusing squarely on relational and NoSQL databases into data warehouse,” Meroxa co-founder and CEO DeVaris Brown told me. “Honestly, people come to us as a real-time FiveTran or real-time data warehouse sink. Because, you know, the industry has moved to this [extract, load, transform] format. But the beautiful part about us is, because we do change data capture, we get that granular data as it happens.” And businesses want this very granular data to be reflected inside of their data warehouses, Brown noted, but he also stressed that Meroxa can expose this stream of data as an API endpoint or point it to a Webhook.

The company is able to do this because its core architecture is somewhat different from other data pipeline and integration services that, at first glance, seem to offer a similar solution. Because of this, users can use the service to connect different tools to their data warehouse but also build real-time tools on top of these data streams.

Image Credits: Meroxa

“We aren’t a point-to-point solution,” Meroxa co-founder and CTO Ali Hamidi explained. “When you set up the connection, you aren’t taking data from Postgres and only putting it into Snowflake. What’s really happening is that it’s going into our intermediate stream. Once it’s in that stream, you can then start hanging off connectors and say, ‘Okay, well, I also want to peek into the stream, I want to transfer my data, I want to filter out some things, I want to put it into S3.’ ”

Because of this, users can use the service to connect different tools to their data warehouse but also build real-time tools to utilize the real-time data stream. With this flexibility, Hamidi noted, a lot of the company’s customers start with a pretty standard use case and then quickly expand into other areas as well.

Brown and Hamidi met during their time at Heroku, where Brown was a director of product management and Hamidi a lead software engineer. But while Heroku made it very easy for developers to publish their web apps, there wasn’t anything comparable in the highly fragmented database space. The team acknowledges that there are a lot of tools that aim to solve these data problems, but few of them focus on the user experience.

Image Credits: Meroxa

“When we talk to customers now, it’s still very much an unsolved problem,” Hamidi said. “It seems kind of insane to me that this is such a common thing and there is no ‘oh, of course you use this tool because it addresses all my problems.’ And so the angle that we’re taking is that we see user experience not as a nice-to-have, it’s really an enabler, it is something that enables a software engineer or someone who isn’t a data engineer with 10 years of experience in wrangling Kafka and Postgres and all these things. […] That’s a transformative kind of change.”

It’s worth noting that Meroxa uses a lot of open-source tools but the company has also committed to open-sourcing everything in its data plane as well. “This has multiple wins for us, but one of the biggest incentives is in terms of the customer, we’re really committed to having our agenda aligned. Because if we don’t do well, we don’t serve the customer. If we do a crappy job, they can just keep all of those components and run it themselves,” Hamidi explained.

Today, Meroxa, which the team founded in early 2020, has more than 24 employees (and is 100% remote). “I really think we’re building one of the most talented and most inclusive teams possible,” Brown told me. “Inclusion and diversity are very, very high on our radar. Our team is 50% black and brown. Over 40% are women. Our management team is 90% underrepresented. So not only are we building a great product, we’re building a great company, we’re building a great business.”  

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com