Why Daimler moved its big data platform to the cloud

Like virtually every big enterprise company, a few years ago, the German auto giant Daimler decided to invest in its own on-premises data centers. And while those aren’t going away anytime soon, the company today announced that it has successfully moved its on-premises big data platform to Microsoft’s Azure cloud. This new platform, which the company calls eXtollo, is Daimler’s first major service to run outside of its own data centers, though it’ll probably not be the last.

As Daimler’s head of its corporate center of excellence for advanced analytics and big data Guido Vetter told me, the company started getting interested in big data about five years ago. “We invested in technology — the classical way, on-premise — and got a couple of people on it. And we were investigating what we could do with data because data is transforming our whole business as well,” he said.

By 2016, the size of the organization had grown to the point where a more formal structure was needed to enable the company to handle its data at a global scale. At the time, the buzz phrase was “data lakes” and the company started building its own in order to build out its analytics capacities.

Electric lineup, Daimler AG

“Sooner or later, we hit the limits as it’s not our core business to run these big environments,” Vetter said. “Flexibility and scalability are what you need for AI and advanced analytics and our whole operations are not set up for that. Our backend operations are set up for keeping a plant running and keeping everything safe and secure.” But in this new world of enterprise IT, companies need to be able to be flexible and experiment — and, if necessary, throw out failed experiments quickly.

So about a year and a half ago, Vetter’s team started the eXtollo project to bring all the company’s activities around advanced analytics, big data and artificial intelligence into the Azure Cloud, and just over two weeks ago, the team shut down its last on-premises servers after slowly turning on its solutions in Microsoft’s data centers in Europe, the U.S. and Asia. All in all, the actual transition between the on-premises data centers and the Azure cloud took about nine months. That may not seem fast, but for an enterprise project like this, that’s about as fast as it gets (and for a while, it fed all new data into both its on-premises data lake and Azure).

If you work for a startup, then all of this probably doesn’t seem like a big deal, but for a more traditional enterprise like Daimler, even just giving up control over the physical hardware where your data resides was a major culture change and something that took quite a bit of convincing. In the end, the solution came down to encryption.

“We needed the means to secure the data in the Microsoft data center with our own means that ensure that only we have access to the raw data and work with the data,” explained Vetter. In the end, the company decided to use the Azure Key Vault to manage and rotate its encryption keys. Indeed, Vetter noted that knowing that the company had full control over its own data was what allowed this project to move forward.

Vetter tells me the company obviously looked at Microsoft’s competitors as well, but he noted that his team didn’t find a compelling offer from other vendors in terms of functionality and the security features that it needed.

Today, Daimler’s big data unit uses tools like HD Insights and Azure Databricks, which covers more than 90 percents of the company’s current use cases. In the future, Vetter also wants to make it easier for less experienced users to use self-service tools to launch AI and analytics services.

While cost is often a factor that counts against the cloud, because renting server capacity isn’t cheap, Vetter argues that this move will actually save the company money and that storage costs, especially, are going to be cheaper in the cloud than in its on-premises data center (and chances are that Daimler, given its size and prestige as a customer, isn’t exactly paying the same rack rate that others are paying for the Azure services).

As with so many big data AI projects, predictions are the focus of much of what Daimler is doing. That may mean looking at a car’s data and error code and helping the technician diagnose an issue or doing predictive maintenance on a commercial vehicle. Interestingly, the company isn’t currently bringing to the cloud any of its own IoT data from its plants. That’s all managed in the company’s on-premises data centers because it wants to avoid the risk of having to shut down a plant because its tools lost the connection to a data center, for example.


Microsoft Azure sets its sights on more analytics workloads

Enterprises now amass huge amounts of data, both from their own tools and applications, as well as from the SaaS applications they use. For a long time, that data was basically exhaust. Maybe it was stored for a while to fulfill some legal requirements, but then it was discarded. Now, data is what drives machine learning models, and the more data you have, the better. It’s maybe no surprise, then, that the big cloud vendors started investing in data warehouses and lakes early on. But that’s just a first step. After that, you also need the analytics tools to make all of this data useful.

Today, it’s Microsoft turn to shine the spotlight on its data analytics services. The actual news here is pretty straightforward. Two of these are services that are moving into general availability: the second generation of Azure Data Lake Storage for big data analytics workloads and Azure Data Explorer, a managed service that makes easier ad-hoc analysis of massive data volumes. Microsoft is also previewing a new feature in Azure Data Factory, its graphical no-code service for building data transformation. Data Factory now features the ability to map data flows.

Those individual news pieces are interesting if you are a user or are considering Azure for your big data workloads, but what’s maybe more important here is that Microsoft is trying to offer a comprehensive set of tools for managing and storing this data — and then using it for building analytics and AI services.

(Photo credit:Josh Edelson/AFP/Getty Images)

“AI is a top priority for every company around the globe,” Julia White, Microsoft’s corporate VP for Azure, told me. “And as we are working with our customers on AI, it becomes clear that their analytics often aren’t good enough for building an AI platform.” These companies are generating plenty of data, which then has to be pulled into analytics systems. She stressed that she couldn’t remember a customer conversation in recent months that didn’t focus on AI. “There is urgency to get to the AI dream,” White said, but the growth and variety of data presents a major challenge for many enterprises. “They thought this was a technology that was separate from their core systems. Now it’s expected for both customer-facing and line-of-business applications.”

Data Lake Storage helps with managing this variety of data since it can handle both structured and unstructured data (and is optimized for the Spark and Hadoop analytics engines). The service can ingest any kind of data — yet Microsoft still promises that it will be very fast. “The world of analytics tended to be defined by having to decide upfront and then building rigid structures around it to get the performance you wanted,” explained White. Data Lake Storage, on the other hand, wants to offer the best of both worlds.

Likewise, White argued that while many enterprises used to keep these services on their on-premises servers, many of them are still appliance-based. But she believes the cloud has now reached the point where the price/performance calculations are in its favor. It took a while to get to this point, though, and to convince enterprises. White noted that for the longest time, enterprises that looked at their analytics projects thought $300 million projects took forever, tied up lots of people and were frankly a bit scary. “But also, what we had to offer in the cloud hasn’t been amazing until some of the recent work,” she said. “We’ve been on a journey — as well as the other cloud vendors — and the price performance is now compelling.” And it sure helps that if enterprises want to meet their AI goals, they’ll now have to tackle these workloads, too.


Big companies are not becoming data-driven fast enough

I remember watching MIT professor Andrew McAfee years ago telling stories about the importance of data over gut feeling, whether it was predicting successful wines or making sound business decisions. We have been hearing about big data and data-driven decision making for so long, you would think it has become hardened into our largest organizations by now. As it turns out, new research by NewVantage Partners finds that most large companies are having problems implementing an organization-wide, data-driven strategy.

McAfee was fond of saying that before the data deluge we have today, the way most large organizations made decisions was via the HiPPO — the highest paid person’s opinion. Then he would chide the audience that this was not the proper way to run your business. Data, not gut feelings, even those based on experience, should drive important organizational decisions.

While companies haven’t failed to recognize McAfee’s advice, the NVP report suggests they are having problems implementing data-driven decision making across organizations. There are plenty of technological solutions out there today to help them, from startups all the way to the largest enterprise vendors, but the data (see, you always need to go back to the data) suggests that it’s not a technology problem, it’s a people problem.

Executives can have farsighted vision that their organizations need to be data-driven. They can acquire all of the latest solutions to bring data to the forefront, but unless they combine that with a broad cultural shift and a deep understanding of how to use that data inside business processes, they will continue to struggle.

The study’s authors, Randy Bean and Thomas H. Davenport, wrote about the people problem in their study’s executive summary. “We hear little about initiatives devoted to changing human attitudes and behaviors around data. Unless the focus shifts to these types of activities, we are likely to see the same problem areas in the future that we’ve observed year after year in this survey.”

The survey found that 72 percent of respondents have failed in this regard, reporting they haven’t been able to create a data-driven culture, whatever that means to individual respondents. Meanwhile, 69 percent reported they had failed to create a data-driven organization, although it would seem that these two metrics would be closely aligned.

Perhaps most discouraging of all is that the data is trending the wrong way. Over the last several years, the report’s authors say that those organizations calling themselves data-driven has actually dropped each year from 37.1 percent in 2017 to 32.4 percent in 2018 to 31.0 percent in the latest survey.

This matters on so many levels, but consider that as companies shift to artificial intelligence and machine learning, these technologies rely on abundant amounts of data to work effectively. What’s more, every organization, regardless of its size, is generating vast amounts of data, simply as part of being a digital business in the 21st century. They need to find a way to control this data to make better decisions and understand their customers better. It’s essential.

There is so much talk about innovation and disruption, and understanding and affecting company culture, but so much of all this is linked. You need to be more agile. You need to be more digital. You need to be transformational. You need to be all of these things — and data is at the center of all of it.

Data has been called the new oil often enough to be cliché, but these results reveal that the lesson is failing to get through. Companies need to be data-driven now, this instant. This isn’t something to be working toward at this point. This is something you need to be doing, unless your ultimate goal is to become irrelevant.


Databricks raises $250M at a $2.75B valuation for its analytics platform

Databricks, the company founded by the original team behind the Apache Spark big data analytics engine, today announced that it has raised a $250 million Series E round led by Andreessen Horowitz. Coatue Management, Green Bay Ventures, Microsoft and NEA, also participated in this round, which brings the company’s total funding to $498.5 million. Microsoft’s involvement here is probably a bit of a surprise, but it’s worth noting that it also worked with Databricks on the launch of Azure Databricks as a first-party service on the platform, something that’s still a rarity in the Azure cloud.

As Databricks also today announced, its annual recurring revenue now exceeds $100 million. The company didn’t share whether it’s cash flow-positive at this point, but Databricks CEO and co-founder Ali Ghodsi shared that the company’s valuation is now $2.75 billion.

Current customers, which the company says number around 2,000, include the likes of Nielsen, Hotels.com, Overstock, Bechtel, Shell and HP.

“What Ali and the Databricks team have built is truly phenomenal,” Green Bay Ventures co-founder Anthony Schiller told me. “Their success is a testament to product innovation at the highest level. Databricks is without question best-in-class and their impact on the industry proves it. We were thrilled to participate in this round.”

While Databricks is obviously known for its contributions to Apache Spark, the company itself monetizes that work by offering its Unified Analytics platform on top of it. This platform allows enterprises to build their data pipelines across data storage systems and prepare data sets for data scientists and engineers. To do this, Databricks offers shared notebooks and tools for building, managing and monitoring data pipelines, and then uses that data to build machine learning models, for example. Indeed, training and deploying these models is one of the company’s focus areas these days, which makes sense, given that this is one of the main use cases for big data, after all.

On top of that, Databricks also offers a fully managed service for hosting all of these tools.

“Databricks is the clear winner in the big data platform race,” said Ben Horowitz, co-founder and general partner at Andreessen Horowitz, in today’s announcement. “In addition, they have created a new category atop their world-beating Apache Spark platform called Unified Analytics that is growing even faster. As a result, we are thrilled to invest in this round.”

Ghodsi told me that Horowitz was also instrumental in getting the company to re-focus on growth. The company was already growing fast, of course, but Horowitz asked him why Databricks wasn’t growing faster. Unsurprisingly, given that it’s an enterprise company, that means aggressively hiring a larger sales force — and that’s costly. Hence the company’s need to raise at this point.

As Ghodsi told me, one of the areas the company wants to focus on is the Asia Pacific region, where overall cloud usage is growing fast. The other area the company is focusing on is support for more verticals like mass media and entertainment, federal agencies and fintech firms, which also comes with its own cost, given that the experts there don’t come cheap.

Ghodsi likes to call this “boring AI,” since it’s not as exciting as self-driving cars. In his view, though, the enterprise companies that don’t start using machine learning now will inevitably be left behind in the long run. “If you don’t get there, there’ll be no place for you in the next 20 years,” he said.

Engineering, of course, will also get a chunk of this new funding, with an emphasis on relatively new products like MLFlow and Delta, two tools Databricks recently developed and that make it easier to manage the life cycle of machine learning models and build the necessary data pipelines to feed them.


Has the fight over privacy changed at all in 2019?

Few issues divide the tech community quite like privacy. Much of Silicon Valley’s wealth has been built on data-driven advertising platforms, and yet, there remain constant concerns about the invasiveness of those platforms.

Such concerns have intensified in just the last few weeks as France’s privacy regulator placed a record fine on Google under Europe’s General Data Protection Regulation (GDPR) rules which the company now plans to appeal. Yet with global platform usage and service sales continuing to tick up, we asked a panel of eight privacy experts: “Has anything fundamentally changed around privacy in tech in 2019? What is the state of privacy and has the outlook changed?” 

This week’s participants include:

TechCrunch is experimenting with new content forms. Consider this a recurring venue for debate, where leading experts – with a diverse range of vantage points and opinions – provide us with thoughts on some of the biggest issues currently in tech, startups and venture. If you have any feedback, please reach out: Arman.Tabatabai@techcrunch.com.

Thoughts & Responses:

Albert Gidari

Albert Gidari is the Consulting Director of Privacy at the Stanford Center for Internet and Society. He was a partner for over 20 years at Perkins Coie LLP, achieving a top-ranking in privacy law by Chambers, before retiring to consult with CIS on its privacy program. He negotiated the first-ever “privacy by design” consent decree with the Federal Trade Commission. A recognized expert on electronic surveillance law, he brought the first public lawsuit before the Foreign Intelligence Surveillance Court, seeking the right of providers to disclose the volume of national security demands received and the number of affected user accounts, ultimately resulting in greater public disclosure of such requests.

There is no doubt that the privacy environment changed in 2018 with the passage of California’s Consumer Privacy Act (CCPA), implementation of the European Union’s General Data Protection Regulation (GDPR), and new privacy laws enacted around the globe.

“While privacy regulation seeks to make tech companies betters stewards of the data they collect and their practices more transparent, in the end, it is a deception to think that users will have more “privacy.””

For one thing, large tech companies have grown huge privacy compliance organizations to meet their new regulatory obligations. For another, the major platforms now are lobbying for passage of a federal privacy law in the U.S. This is not surprising after a year of privacy miscues, breaches and negative privacy news. But does all of this mean a fundamental change is in store for privacy? I think not.

The fundamental model sustaining the Internet is based upon the exchange of user data for free service. As long as advertising dollars drive the growth of the Internet, regulation simply will tinker around the edges, setting sideboards to dictate the terms of the exchange. The tech companies may be more accountable for how they handle data and to whom they disclose it, but the fact is that data will continue to be collected from all manner of people, places and things.

Indeed, if the past year has shown anything it is that two rules are fundamental: (1) everything that can be connected to the Internet will be connected; and (2) everything that can be collected, will be collected, analyzed, used and monetized. It is inexorable.

While privacy regulation seeks to make tech companies betters stewards of the data they collect and their practices more transparent, in the end, it is a deception to think that users will have more “privacy.” No one even knows what “more privacy” means. If it means that users will have more control over the data they share, that is laudable but not achievable in a world where people have no idea how many times or with whom they have shared their information already. Can you name all the places over your lifetime where you provided your SSN and other identifying information? And given that the largest data collector (and likely least secure) is government, what does control really mean?

All this is not to say that privacy regulation is futile. But it is to recognize that nothing proposed today will result in a fundamental shift in privacy policy or provide a panacea of consumer protection. Better privacy hygiene and more accountability on the part of tech companies is a good thing, but it doesn’t solve the privacy paradox that those same users who want more privacy broadly share their information with others who are less trustworthy on social media (ask Jeff Bezos), or that the government hoovers up data at rate that makes tech companies look like pikers (visit a smart city near you).

Many years ago, I used to practice environmental law. I watched companies strive to comply with new laws intended to control pollution by creating compliance infrastructures and teams aimed at preventing, detecting and deterring violations. Today, I see the same thing at the large tech companies – hundreds of employees have been hired to do “privacy” compliance. The language is the same too: cradle to grave privacy documentation of data flows for a product or service; audits and assessments of privacy practices; data mapping; sustainable privacy practices. In short, privacy has become corporatized and industrialized.

True, we have cleaner air and cleaner water as a result of environmental law, but we also have made it lawful and built businesses around acceptable levels of pollution. Companies still lawfully dump arsenic in the water and belch volatile organic compounds in the air. And we still get environmental catastrophes. So don’t expect today’s “Clean Privacy Law” to eliminate data breaches or profiling or abuses.

The privacy world is complicated and few people truly understand the number and variety of companies involved in data collection and processing, and none of them are in Congress. The power to fundamentally change the privacy equation is in the hands of the people who use the technology (or choose not to) and in the hands of those who design it, and maybe that’s where it should be.

Gabriel Weinberg

Gabriel Weinberg is the Founder and CEO of privacy-focused search engine DuckDuckGo.

Coming into 2019, interest in privacy solutions is truly mainstream. There are signs of this everywhere (media, politics, books, etc.) and also in DuckDuckGo’s growth, which has never been faster. With solid majorities now seeking out private alternatives and other ways to be tracked less online, we expect governments to continue to step up their regulatory scrutiny and for privacy companies like DuckDuckGo to continue to help more people take back their privacy.

“Consumers don’t necessarily feel they have anything to hide – but they just don’t want corporations to profit off their personal information, or be manipulated, or unfairly treated through misuse of that information.”

We’re also seeing companies take action beyond mere regulatory compliance, reflecting this new majority will of the people and its tangible effect on the market. Just this month we’ve seen Apple’s Tim Cook call for stronger privacy regulation and the New York Times report strong ad revenue in Europe after stopping the use of ad exchanges and behavioral targeting.

At its core, this groundswell is driven by the negative effects that stem from the surveillance business model. The percentage of people who have noticed ads following them around the Internet, or who have had their data exposed in a breach, or who have had a family member or friend experience some kind of credit card fraud or identity theft issue, reached a boiling point in 2018. On top of that, people learned of the extent to which the big platforms like Google and Facebook that collect the most data are used to propagate misinformation, discrimination, and polarization. Consumers don’t necessarily feel they have anything to hide – but they just don’t want corporations to profit off their personal information, or be manipulated, or unfairly treated through misuse of that information. Fortunately, there are alternatives to the surveillance business model and more companies are setting a new standard of trust online by showcasing alternative models.

Melika Carroll

Melika Carroll is Senior Vice President, Global Government Affairs at Internet Association, which represents over 45 of the world’s leading internet companies, including Google, Facebook, Amazon, Twitter, Uber, Airbnb and others.

We support a modern, national privacy law that provides people meaningful control over the data they provide to companies so they can make the most informed choices about how that data is used, seen, and shared.

“Any national privacy framework should provide the same protections for people’s data across industries, regardless of whether it is gathered offline or online.”

Internet companies believe all Americans should have the ability to access, correct, delete, and download the data they provide to companies.

Americans will benefit most from a federal approach to privacy – as opposed to a patchwork of state laws – that protects their privacy regardless of where they live. If someone in New York is video chatting with their grandmother in Florida, they should both benefit from the same privacy protections.

It’s also important to consider that all companies – both online and offline – use and collect data. Any national privacy framework should provide the same protections for people’s data across industries, regardless of whether it is gathered offline or online.

Two other important pieces of any federal privacy law include user expectations and the context in which data is shared with third parties. Expectations may vary based on a person’s relationship with a company, the service they expect to receive, and the sensitivity of the data they’re sharing. For example, you expect a car rental company to be able to track the location of the rented vehicle that doesn’t get returned. You don’t expect the car rental company to track your real-time location and sell that data to the highest bidder. Additionally, the same piece of data can have different sensitivities depending on the context in which it’s used or shared. For example, your name on a business card may not be as sensitive as your name on the sign in sheet at an addiction support group meeting.

This is a unique time in Washington as there is bipartisan support in both chambers of Congress as well as in the administration for a federal privacy law. Our industry is committed to working with policymakers and other stakeholders to find an American approach to privacy that protects individuals’ privacy and allows companies to innovate and develop products people love.

Johnny Ryan

Dr. Johnny Ryan FRHistS is Chief Policy & Industry Relations Officer at Brave. His previous roles include Head of Ecosystem at PageFair, and Chief Innovation Officer of The Irish Times. He has a PhD from the University of Cambridge, and is a Fellow of the Royal Historical Society.

Tech companies will probably have to adapt to two privacy trends.

“As lawmakers and regulators in Europe and in the United States start to think of “purpose specification” as a tool for anti-trust enforcement, tech giants should beware.”

First, the GDPR is emerging as a de facto international standard.

In the coming years, the application of GDPR-like laws for commercial use of consumers’ personal data in the EU, Britain (post-EU), Japan, India, Brazil, South Korea, Malaysia, Argentina, and China will bring more than half of global GDP under a similar standard.

Whether this emerging standard helps or harms United States firms will be determined by whether the United States enacts and actively enforces robust federal privacy laws. Unless there is a federal GDPR-like law in the United States, there may be a degree of friction and the potential of isolation for United States companies.

However, there is an opportunity in this trend. The United States can assume the global lead by doing two things. First, enact a federal law that borrows from the GDPR, including a comprehensive definition of “personal data”, and robust “purpose specification”. Second, invest in world-leading regulation that pursues test cases, and defines practical standards. Cutting edge enforcement of common principles-based standards is de facto leadership.

Second, privacy and antitrust law are moving closer to each other, and might squeeze big tech companies very tightly indeed.

Big tech companies “cross-use” user data from one part of their business to prop up others. The result is that a company can leverage all the personal information accumulated from its users in one line of business, and for one purpose, to dominate other lines of business too.

This is likely to have anti-competitive effects. Rather than competing on the merits, the company can enjoy the unfair advantage of massive network effects even though it may be starting from scratch in a new line of business. This stifles competition and hurts innovation and consumer choice.

Antitrust authorities in other jurisdictions have addressed this. In 2015, the Belgian National Lottery was fined for re-using personal information acquired through its monopoly for a different, and incompatible, line of business.

As lawmakers and regulators in Europe and in the United States start to think of “purpose specification” as a tool for anti-trust enforcement, tech giants should beware.

John Miller

John Miller is the VP for Global Policy and Law at the Information Technology Industry Council (ITI), a D.C. based advocate group for the high tech sector.  Miller leads ITI’s work on cybersecurity, privacy, surveillance, and other technology and digital policy issues.

Data has long been the lifeblood of innovation. And protecting that data remains a priority for individuals, companies and governments alike. However, as times change and innovation progresses at a rapid rate, it’s clear the laws protecting consumers’ data and privacy must evolve as well.

“Data has long been the lifeblood of innovation. And protecting that data remains a priority for individuals, companies and governments alike.”

As the global regulatory landscape shifts, there is now widespread agreement among business, government, and consumers that we must modernize our privacy laws, and create an approach to protecting consumer privacy that works in today’s data-driven reality, while still delivering the innovations consumers and businesses demand.

More and more, lawmakers and stakeholders acknowledge that an effective privacy regime provides meaningful privacy protections for consumers regardless of where they live. Approaches, like the framework ITI released last fall, must offer an interoperable solution that can serve as a model for governments worldwide, providing an alternative to a patchwork of laws that could create confusion and uncertainty over what protections individuals have.

Companies are also increasingly aware of the critical role they play in protecting privacy. Looking ahead, the tech industry will continue to develop mechanisms to hold us accountable, including recommendations that any privacy law mandate companies identify, monitor, and document uses of known personal data, while ensuring the existence of meaningful enforcement mechanisms.

Nuala O’Connor

Nuala O’Connor is president and CEO of the Center for Democracy & Technology, a global nonprofit committed to the advancement of digital human rights and civil liberties, including privacy, freedom of expression, and human agency. O’Connor has served in a number of presidentially appointed positions, including as the first statutorily mandated chief privacy officer in U.S. federal government when she served at the U.S. Department of Homeland Security. O’Connor has held senior corporate leadership positions on privacy, data, and customer trust at Amazon, General Electric, and DoubleClick. She has practiced at several global law firms including Sidley Austin and Venable. She is an advocate for the use of data and internet-enabled technologies to improve equity and amplify marginalized voices.

For too long, Americans’ digital privacy has varied widely, depending on the technologies and services we use, the companies that provide those services, and our capacity to navigate confusing notices and settings.

“Americans deserve comprehensive protections for personal information – protections that can’t be signed, or check-boxed, away.”

We are burdened with trying to make informed choices that align with our personal privacy preferences on hundreds of devices and thousands of apps, and reading and parsing as many different policies and settings. No individual has the time nor capacity to manage their privacy in this way, nor is it a good use of time in our increasingly busy lives. These notices and choices and checkboxes have become privacy theater, but not privacy reality.

In 2019, the legal landscape for data privacy is changing, and so is the public perception of how companies handle data. As more information comes to light about the effects of companies’ data practices and myriad stewardship missteps, Americans are surprised and shocked about what they’re learning. They’re increasingly paying attention, and questioning why they are still overburdened and unprotected. And with intensifying scrutiny by the media, as well as state and local lawmakers, companies are recognizing the need for a clear and nationally consistent set of rules.

Personal privacy is the cornerstone of the digital future people want. Americans deserve comprehensive protections for personal information – protections that can’t be signed, or check-boxed, away. The Center for Democracy & Technology wants to help craft those legal principles to solidify Americans’ digital privacy rights for the first time.

Chris Baker

Chris Baker is Senior Vice President and General Manager of EMEA at Box.

Last year saw data privacy hit the headlines as businesses and consumers alike were forced to navigate the implementation of GDPR. But it’s far from over.

“…customers will have trust in a business when they are given more control over how their data is used and processed”

2019 will be the year that the rest of the world catches up to the legislative example set by Europe, as similar data regulations come to the forefront. Organizations must ensure they are compliant with regional data privacy regulations, and more GDPR-like policies will start to have an impact. This can present a headache when it comes to data management, especially if you’re operating internationally. However, customers will have trust in a business when they are given more control over how their data is used and processed, and customers can rest assured knowing that no matter where they are in the world, businesses must meet the highest bar possible when it comes to data security.

Starting with the U.S., 2019 will see larger corporations opt-in to GDPR to support global business practices. At the same time, local data regulators will lift large sections of the EU legislative framework and implement these rules in their own countries. 2018 was the year of GDPR in Europe, and 2019 be the year of GDPR globally.

Christopher Wolf

Christopher Wolf is the Founder and Chair of the Future of Privacy Forum think tank, and is senior counsel at Hogan Lovells focusing on internet law, privacy and data protection policy.

With the EU GDPR in effect since last May (setting a standard other nations are emulating),

“Regardless of the outcome of the debate over a new federal privacy law, the issue of the privacy and protection of personal data is unlikely to recede.”

with the adoption of a highly-regulatory and broadly-applicable state privacy law in California last Summer (and similar laws adopted or proposed in other states), and with intense focus on the data collection and sharing practices of large tech companies, the time may have come where Congress will adopt a comprehensive federal privacy law. Complicating the adoption of a federal law will be the issue of preemption of state laws and what to do with the highly-developed sectoral laws like HIPPA and Gramm-Leach-Bliley. Also to be determined is the expansion of FTC regulatory powers. Regardless of the outcome of the debate over a new federal privacy law, the issue of the privacy and protection of personal data is unlikely to recede.


Dataiku raises $101 million for its collaborative data science platform

Dataiku wants to turn buzzwords into an actual service. The company has been focused on data tools for many years, before everybody started talking about big data, data science and machine learning.

And the company just raised $101 million in a round led by Iconiq Capital, with Alven Capital, Battery Ventures, Dawn Capital and FirstMark Capital also participating.

If you’re generating a lot of data, Dataiku helps you find a meaning behind data sets. First, you import your data by connecting Dataiku to your storage system. The platform supports dozens of database formats and sources — Hadoop, NoSQL, images, you name it.

You can then use Dataiku to visualize your data, clean your data set, run some algorithms on your data in order to build a machine learning model, deploy it and more. Dataiku has a visual coding tool, or you can use your own code.

But Dataiku isn’t just a tool for data scientists. Even if you’re a business analyst, you can visualize and extract data from Dataiku directly. And because of its software-as-a-service approach, your entire team of data scientists and data analysts can collaborate on Dataiku.

Clients use it to track churn, detect fraud, forecast demand, optimize lifetime values and more. Customers include General Electric, Sephora, Unilever, KUKA, FOX and BNP Paribas.

With today’s funding round, the company plans to double its staff. The company currently works with 200 people in New York, Paris and London. It plans to open offices in Singapore and Sydney, as well.


AtScale lands $50 million investment led by Morgan Stanley

AtScale, the startup that helps companies move massive amounts of data into business intelligence and analytics tools, announced a $50 million Series D round today.

Morgan Stanley led the round, with previous investors Storm Ventures and Atlantic Bridge joining in. New investor Wells Fargo also participated. The funding comes almost exactly a year after the company announced its $25 million Series C. Today’s funding brings the total amount raised to $120 million.

Bringing on an institutional investor like Morgan Stanley is often a signal that the company has reached the stage where it is at least beginning to think about the possibility of going public at some point in the future. AtScale CEO Chris Lynch acknowledged such a connection without making any broad commitment (as you would expect). “We are not close to being IPO-ready, but that was a future consideration in selecting Morgan Stanley,” Lynch told TechCrunch.

What the company does is help take big data and move it into tools where customers can make better use of it. AtScale co-founder Dave Mariani used to be at Yahoo where he helped pioneer the use of big data in the 2009/2010 timeframe. Unfortunately, systems at the time couldn’t deal with the volume of data — and that is still a problem, one that AtScale says it is designed to solve. “We take a bunch of data silos and put a semantic layer across the data platforms and expose them in a consistent way,” Mariani told TechCrunch last year at the time of the Series C round. This allows a company to get a big picture view of their data, rather than consuming it in smaller chunks.

AtScale reported a banner year, bringing on 50 new customers across their target verticals of retail, financial services, advertising and digital sales. These include Rakuten, Dell Technologies, TD Bank and Toyota. What’s more, the company stretched out this year, taking advantage of the last funding round to expand more into international markets in Europe and Asia.

The company was founded in 2013 and is based in San Mateo, California.


Fivetran announces $15M Series A to build automated data pipelines

Fivetran, a startup that builds automated data pipelines between data repositories and cloud data warehouses and analytics tools, announced a $15 million Series A investment led by Matrix Partners.

Fivetran helps move data from source repositories like Salesforce and NetSuite to data warehouses like Snowflake or analytics tools like Looker. Company CEO and co-founder George Fraser says the automation is the key differentiator here between his company and competitors like Informatica and SnapLogic.

“What makes Fivetran different is that it’s an automated data pipeline to basically connect all your sources. You can access your data warehouse, and all of the data just appears and gets kept updated automatically,” Fraser explained. While he acknowledges that there is a great deal of complexity behind the scenes to drive that automation, he stresses that his company is hiding that complexity from the customer.

The company launched out of Y Combinator in 2012, and other than $4 million in seed funding along the way, it has relied solely on revenue up until now. That’s a rather refreshing approach to running an enterprise startup, which typically requires piles of cash to build out sales and marketing organizations to compete with the big guys they are trying to unseat.

One of the key reasons they’ve been able to take this approach has been the company’s partner strategy. Having the ability to get data into another company’s solution with a minimum of fuss and expense has attracted data-hungry applications. In addition to the previously mentioned Snowflake and Looker, the company counts Google BigQuery, Microsoft Azure, Amazon Redshift, Tableau, Periscope Data, Salesforce, NetSuite and PostgreSQL as partners.

Ilya Sukhar, general partner at Matrix Partners, who will be joining the Fivetran board under the terms of deal sees a lot of potential here. “We’ve gone from companies talking about the move to the cloud to preparing to execute their plans, and the most sophisticated are making Fivetran, along with cloud data warehouses and modern analysis tools, the backbone of their analytical infrastructure,” Sukhar said in a statement.

They currently have 100 employees spread out across four offices in Oakland, Denver, Bangalore and Dublin. They boast 500 customers using their product including Square, WeWork, Vice Media and Lime Scooters, among others.


One Billion Tables in MySQL 8.0 with ZFS

one billion tables MySQL

The short version

I created > one billion InnoDB tables in MySQL 8.0 (tables, not rows) just for fun. Here is the proof:

$ mysql -A
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1425329
Server version: 8.0.12 MySQL Community Server - GPL
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> select count(*) from information_schema.tables;
| count(*)   |
| 1011570298 |
1 row in set (6 hours 57 min 6.31 sec)

Yes, it took 6 hours and 57 minutes to count them all!

Why does anyone need one billion tables?

In my previous blog post, I created and tested MySQL 8.0 with 40 million tables (that was a real case study). The One Billion Tables project is not a real world scenario, however. I was challenged by Billion Tables Project (BTP) in PostgreSQL, and decided to repeat it with MySQL, creating 1 billion InnoDB tables.

As an aside: I think MySQL 8.0 is the first MySQL version where creating 1 billion InnoDB tables is even practically possible.

Challenges with one billion InnoDB tables

Disk space

The first and one of the most important challenges is disk space. InnoDB allocates data pages on disk when creating .ibd files. Without disk level compression we need > 25Tb of disk. The good news: we have ZFS which provides transparent disk compression. Here’s how the disk utilization looks:

Actual data (apparent-size):

# du -sh --apparent-size /mysqldata/
26T     /mysqldata/

Compressed data:

# du -sh /mysqldata/
2.4T    /mysqldata/

Compression ratio:

# zfs get compression,compressratio
mysqldata/mysql/data             compressratio         7.14x                      -
mysqldata/mysql/data             compression           gzip                       inherited from mysqldata/mysql

(Looks like the compression ratio reported is not 100% correct, we expect ~10x compression ratio.)

Too many tiny files

This is usually the big issue with databases that create a file per table. With MySQL 8.0 we can create a shared tablespace and “assign” a table to it. I created a tablespace per database, and created 1000 tables in each database.

The result:

mysql> select count(*) from information_schema.schemata;
| count(*) |
|  1011575 |
1 row in set (1.31 sec)

Creating tables

Another big challenge is how to create tables fast enough so it will not take months. I have used three approaches:

  1. Disabled all possible consistency checks in MySQL, and decreased the innodb page size to 4K (these config options are NOT for production use)
  2. Created tables in parallel: as the mutex contention bug in MySQL 8.0 has been fixed, creating tables in parallel works fine.
  3. Use local NVMe cards on top of an AWS ec2 i3.8xlarge instance

my.cnf config file (I repeat: do not use this in production):

default-authentication-plugin = mysql_native_password
log-error = /mysqldata/mysql/log/error.log
innodb_log_group_home_dir = /mysqldata/mysql/log/
innodb_doublewrite = 0
innodb_stats_persistent = 0
tablespace_definition_cache = 524288
schema_definition_cache = 524288
table_definition_cache = 524288

ZFS pool:

# zpool status
  pool: mysqldata
 state: ONLINE
  scan: scrub repaired 0B in 1h49m with 0 errors on Sun Oct 14 02:13:17 2018
        NAME        STATE     READ WRITE CKSUM
        mysqldata   ONLINE       0     0     0
          nvme0n1   ONLINE       0     0     0
          nvme1n1   ONLINE       0     0     0
          nvme2n1   ONLINE       0     0     0
          nvme3n1   ONLINE       0     0     0
errors: No known data errors

A simple “deploy” script to create tables in parallel (includes the sysbench table structure):

function do_db {
        db_exist=$(mysql -A -s -Nbe "select 1 from information_schema.schemata where schema_name = '$db'")
        if [ "$db_exist" == "1" ]; then echo "Already exists: $db"; return 0; fi;
        tbspace="create database $db; use $db; CREATE TABLESPACE $db ADD DATAFILE '$db.ibd' engine=InnoDB";
        #echo "Tablespace $db.ibd created!"
        for i in {1..1000}
                table="CREATE TABLE sbtest$i ( id int(10) unsigned NOT NULL AUTO_INCREMENT, k int(10) unsigned NOT NULL DEFAULT '0', c varchar(120) NOT NULL DEFAULT '', pad varchar(60) NOT NULL DEFAULT '', PRIMARY KEY (id), KEY k_1 (k) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 tablespace $db;"
                tables="$tables; $table;"
        echo "$tbspace;$tables" | mysql
echo "starting..."
c=$(mysql -A -s -Nbe "select max(cast(SUBSTRING_INDEX(schema_name, '_', -1) as unsigned)) from information_schema.schemata where schema_name like 'sbtest_%'")
for m in {1..100000}
        echo "m=$m"
        for i in {1..30}
                let c=$c+1
                echo $c
                do_db &

How fast did we create tables? Here are some stats:

# mysqladmin -i 10 -r ex|grep Com_create_table
| Com_create_table                                      | 6497                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| Com_create_table                                      | 6449

So we created ~650 tables per second. The average, above, is per 10 seconds.

Counting the tables

It took > 6 hours to do “count(*) from information_schema.tables”! Here is why:

  1. MySQL 8.0 uses a new data dictionary (this is great as it avoids creating 1 billion frm files). Everything is stored in this file:
    # ls -lah /mysqldata/mysql/data/mysql.ibd
    -rw-r----- 1 mysql mysql 6.1T Oct 18 15:02 /mysqldata/mysql/data/mysql.ibd
  2. The information_schema.tables is actually a view:
mysql> show create table information_schema.tables\G
*************************** 1. row ***************************
                View: TABLES
         Create View: CREATE ALGORITHM=UNDEFINED DEFINER=`mysql.infoschema`@`localhost` SQL SECURITY DEFINER VIEW `information_schema`.`TABLES` AS select `cat`.`name` AS `TABLE_CATALOG`,`sch`.`name` AS `TABLE_SCHEMA`,`tbl`.`name` AS `TABLE_NAME`,`tbl`.`type` AS `TABLE_TYPE`,if((`tbl`.`type` = 'BASE TABLE'),`tbl`.`engine`,NULL) AS `ENGINE`,if((`tbl`.`type` = 'VIEW'),NULL,10) AS `VERSION`,`tbl`.`row_format` AS `ROW_FORMAT`,internal_table_rows(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`table_rows`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `TABLE_ROWS`,internal_avg_row_length(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`avg_row_length`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `AVG_ROW_LENGTH`,internal_data_length(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`data_length`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `DATA_LENGTH`,internal_max_data_length(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`max_data_length`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `MAX_DATA_LENGTH`,internal_index_length(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`index_length`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `INDEX_LENGTH`,internal_data_free(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`data_free`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `DATA_FREE`,internal_auto_increment(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`auto_increment`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0),`tbl`.`se_private_data`) AS `AUTO_INCREMENT`,`tbl`.`created` AS `CREATE_TIME`,internal_update_time(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(cast(`stat`.`update_time` as unsigned),0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `UPDATE_TIME`,internal_check_time(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(cast(`stat`.`check_time` as unsigned),0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `CHECK_TIME`,`col`.`name` AS `TABLE_COLLATION`,internal_checksum(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`checksum`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `CHECKSUM`,if((`tbl`.`type` = 'VIEW'),NULL,get_dd_create_options(`tbl`.`options`,if((ifnull(`tbl`.`partition_expression`,'NOT_PART_TBL') = 'NOT_PART_TBL'),0,1))) AS `CREATE_OPTIONS`,internal_get_comment_or_error(`sch`.`name`,`tbl`.`name`,`tbl`.`type`,`tbl`.`options`,`tbl`.`comment`) AS `TABLE_COMMENT` from (((((`mysql`.`tables` `tbl` join `mysql`.`schemata` `sch` on((`tbl`.`schema_id` = `sch`.`id`))) join `mysql`.`catalogs` `cat` on((`cat`.`id` = `sch`.`catalog_id`))) left join `mysql`.`collations` `col` on((`tbl`.`collation_id` = `col`.`id`))) left join `mysql`.`tablespaces` `ts` on((`tbl`.`tablespace_id` = `ts`.`id`))) left join `mysql`.`table_stats` `stat` on(((`tbl`.`name` = `stat`.`table_name`) and (`sch`.`name` = `stat`.`schema_name`)))) where (can_access_table(`sch`.`name`,`tbl`.`name`) and is_visible_dd_object(`tbl`.`hidden`))
character_set_client: utf8
collation_connection: utf8_general_ci

and the explain plan looks like this:

mysql> explain select count(*) from information_schema.tables \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: cat
   partitions: NULL
         type: index
possible_keys: PRIMARY
          key: name
      key_len: 194
          ref: NULL
         rows: 1
     filtered: 100.00
        Extra: Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: tbl
   partitions: NULL
         type: ALL
possible_keys: schema_id
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1023387060
     filtered: 100.00
        Extra: Using where; Using join buffer (Block Nested Loop)
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: sch
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY,catalog_id
          key: PRIMARY
      key_len: 8
          ref: mysql.tbl.schema_id
         rows: 1
     filtered: 11.11
        Extra: Using where
*************************** 4. row ***************************
           id: 1
  select_type: SIMPLE
        table: stat
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 388
          ref: mysql.sch.name,mysql.tbl.name
         rows: 1
     filtered: 100.00
        Extra: Using index
*************************** 5. row ***************************
           id: 1
  select_type: SIMPLE
        table: ts
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: mysql.tbl.tablespace_id
         rows: 1
     filtered: 100.00
        Extra: Using index
*************************** 6. row ***************************
           id: 1
  select_type: SIMPLE
        table: col
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: mysql.tbl.collation_id
         rows: 1
     filtered: 100.00
        Extra: Using index


  1. I have created more than 1 billion real InnoDB tables with indexes in MySQL 8.0, just for fun, and it worked. It took ~2 weeks to create.
  2. Probably MySQL 8.0 is the first version where it is even practically possible to create billion InnoDB tables
  3. ZFS compression together with NVMe cards makes it reasonably cheap to do, for example, by using i3.4xlarge or i3.8xlarge instances on AWS.

one billion tables MySQL


Nvidia launches Rapids to help bring GPU acceleration to data analytics

Nvidia, together with partners like IBM, HPE, Oracle, Databricks and others, is launching a new open-source platform for data science and machine learning today. Rapids, as the company is calling it, is all about making it easier for large businesses to use the power of GPUs to quickly analyze massive amounts of data and then use that to build machine learning models.

“Businesses are increasingly data-driven,” Nvidia’s VP of Accelerated Computing Ian Buck told me. “They sense the market and the environment and the behavior and operations of their business through the data they’ve collected. We’ve just come through a decade of big data and the output of that data is using analytics and AI. But most it is still using traditional machine learning to recognize complex patterns, detect changes and make predictions that directly impact their bottom line.”

The idea behind Rapids then is to work with the existing popular open-source libraries and platforms that data scientists use today and accelerate them using GPUs. Rapids integrates with these libraries to provide accelerated analytics, machine learning and — in the future — visualization.

Rapids is based on Python, Buck noted; it has interfaces that are similar to Pandas and Scikit, two very popular machine learning and data analysis libraries, and it’s based on Apache Arrow for in-memory database processing. It can scale from a single GPU to multiple notes and IBM notes that the platform can achieve improvements of up to 50x for some specific use cases when compared to running the same algorithms on CPUs (though that’s not all that surprising, given what we’ve seen from other GPU-accelerated workloads in the past).

Buck noted that Rapids is the result of a multi-year effort to develop a rich enough set of libraries and algorithms, get them running well on GPUs and build the relationships with the open-source projects involved.

“It’s designed to accelerate data science end-to-end,” Buck explained. “From the data prep to machine learning and for those who want to take the next step, deep learning. Through Arrow, Spark users can easily move data into the Rapids platform for acceleration.”

Indeed, Spark is surely going to be one of the major use cases here, so it’s no wonder that Databricks, the company founded by the team behind Spark, is one of the early partners.

“We have multiple ongoing projects to integrate Spark better with native accelerators, including Apache Arrow support and GPU scheduling with Project Hydrogen,” said Spark founder Matei Zaharia in today’s announcement. “We believe that RAPIDS is an exciting new opportunity to scale our customers’ data science and AI workloads.”

Nvidia is also working with Anaconda, BlazingDB, PyData, Quansight and scikit-learn, as well as Wes McKinney, the head of Ursa Labs and the creator of Apache Arrow and Pandas.

Another partner is IBM, which plans to bring Rapids support to many of its services and platforms, including its PowerAI tools for running data science and AI workloads on GPU-accelerated Power9 servers, IBM Watson Studio and Watson Machine Learning and the IBM Cloud with its GPU-enabled machines. “At IBM, we’re very interested in anything that enables higher performance, better business outcomes for data science and machine learning — and we think Nvidia has something very unique here,” Rob Thomas, the GM of IBM Analytics told me.

“The main benefit to the community is that through an entirely free and open-source set of libraries that are directly compatible with the existing algorithms and subroutines that their used to — they now get access to GPU-accelerated versions of them,” Buck said. He also stressed that Rapids isn’t trying to compete with existing machine learning solutions. “Part of the reason why Rapids is open source is so that you can easily incorporate those machine learning subroutines into their software and get the benefits of it.”

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com