Jun
12
2018
--

Sumo Logic brings data analysis to containers

Sumo Logic has long held the goal to help customers understand their data wherever it lives. As we move into the era of containers, that goal becomes more challenging because containers by their nature are ephemeral. The company announced a product enhancement today designed to instrument containerized applications in spite of that.

They are debuting these new features at DockerCon, Docker’s customer conference taking place this week in San Francisco.

Sumo’s CEO Ramin Sayer says containers have begun to take hold over the last 12-18 months with Docker and Kubernetes emerging as tools of choice. Given their popularity, Sumo wants to be able to work with them. “[Docker and Kubernetes] are by far the most standard things that have developed in any new shop, or any existing shop that wants to build a brand new modern app or wants to lift and shift an app from on prem [to the cloud], or have the ability to migrate workloads from Vendor A platform to Vendor B,” he said.

He’s not wrong of course. Containers and Kubernetes have been taking off in a big way over the last 18 months and developers and operations alike have struggled to instrument these apps to understand how they behave.

“But as that standardization of adoption of that technology has come about, it makes it easier for us to understand how to instrument, collect, analyze, and more importantly, start to provide industry benchmarks,” Sayer explained.

They do this by avoiding the use of agents. Regardless of how you run your application, whether in a VM or a container, Sumo is able to capture the data and give you feedback you might otherwise have trouble retrieving.

Screen shot: Sumo Logic (cropped)

The company has built in native support for Kubernetes and Amazon Elastic Container Service for Kubernetes (Amazon EKS). It also supports the open source tool Prometheus favored by Kubernetes users to extract metrics and metadata. The goal of the Sumo tool is to help customers fix issues faster and reduce downtime.

As they work with this technology, they can begin to understand norms and pass that information onto customers. “We can guide them and give them best practices and tips, not just on what they’ve done, but how they compare to other users on Sumo,” he said.

Sumo Logic was founded in 2010 and has raised $230 million, according to data on Crunchbase. Its most recent round was a $70 million Series F led by Sapphire Ventures last June.

Jun
07
2018
--

Devo scores $25 million and cool new name

Logtrust is now known as Devo in one of the cooler name changes I’ve seen in a long time. Whether they intended to pay homage to the late 70s band is not clear, but investors probably didn’t care, as they gave the data operations startup a bushel of money today.

The company now known as Devo announced a $25 million Series C round led by Insight Venture Partners with participation from Kibo Ventures. Today’s investment brings the total raised to $71 million.

The company changed its name because it was about much more than logs, according to CEO Walter Scott. It offers a cloud service that allows customers to stream massive amounts of data — think terabytes or even petabytes — relieving the need to worry about all of the scaling and hardware requirements processing this amount of data would require. That could be from logs from web servers, security data from firewalls or transactions taking place on backend systems, as some examples.

The data can live on prem if required, but the processing always gets done in the cloud to provide for the scaling needs. Scott says this is about giving companies this ability to process and understand massive amounts of data that previously was only in reach of web scale companies like Google, Facebook or Amazon.

But it involves more than simply collecting the data. “It’s the combination of us being able to collect all of that data together with running analytics on top of it all in a unified platform, then allowing a very broad spectrum of the business [to make use of it],” Scott explained.

Devo dashboard. Photo: Devo

Devo sees Sumo Logic, Elastic and Splunk as its primary competitors in this space, but like many startups they often battle companies trying to build their own systems as well, a difficult approach for any company to take when you are dealing with this amount of data.

The company, which was founded in Spain is now based in Cambridge, Massachusetts, and has close to 100 employees. Scott says he has the budget to double that by the end of the year, although he’s not sure they will be able to hire that many people that rapidly

May
31
2018
--

Don’t Drown in your Data Lake

Don't drown in your data lake

Don't drown in your data lakeA data lake is “…a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms…”1. Many companies find value in using a data lake but aren’t clear that they need to properly plan for it and maintain it in order to prevent issues.

The idea of a data lake rose from the need to store data in a raw format that is accessible to a variety of applications and authorized users. Hadoop is often used to query the data, and the necessary structures for querying are created through the query tool (schema on read) rather than as part of the data design (schema on write). There are other tools available for analysis, and many cloud providers are actively developing additional options for creating and managing your data lake. The cloud is often viewed as an ideal place for your data lake since it is inherently elastic and can expand to meet the needs of your data.

Data Lake or Data Swamp?

One of the key components of a functioning data lake is the continuing inflow and egress of data. Some data must be kept indefinitely but some can be archived or deleted after a defined period of time. Failure to remove stale data can result in a data swamp, where the out of date data is taking up valuable and costly space and may be causing queries to take longer to complete. This is one of the first issues that companies encounter in maintaining their data lake. Often, people view the data lake as a “final resting place” for data, but it really should be used for data that is accessed often, or at least occasionally.

A natural spring-fed lake can turn into a swamp due to a variety of factors. If fresh water is not allowed to flow into the lake, this can cause stagnation, meaning that plants and animals that previously were not able to be supported by the lake take hold. Similarly, if water cannot exit the lake at some point, the borders will be breached, and the surrounding land will be inundated. Both of these conditions can cause a once pristine lake to turn into a fetid and undesirable swamp. If data is no longer being added to your data lake, the results will become dated and eventually unreliable. Also, if data is always being added to the lake but is not accessed on a regular basis, this can lead to unrestricted growth of your data lake, with no real plan for how the data will be used. This can become an expensive “cold storage” facility that is likely more expensive than archived storage.

If bad or undesirable items, like old cars or garbage, are thrown into a lake, this can damage the ecosystem, causing unwanted reactions. In a data lake, this is akin to simply throwing data into the data lake with no real rules or rationale. While the data is saved, it may not be useful and can cause negative consequences across the whole environment since it is consuming space and may slow response times. Even though a basic concept of a data lake is that the data does not need to conform to a predefined structure, like you would see with a relational database, it is important that some rules and guidelines exist regarding the type and quality of data that is included in the lake. In the absence of some guidelines, it becomes difficult to access the relevant data for your needs. Proper definition and tagging of content help to ensure that the correct data is accessible and available when needed.

Unrestricted Growth Consequences

Many people have a junk drawer somewhere in their house; a drawer that is filled with old receipts, used tickets, theater programs, and the like. Some of this may be stored for sentimental reasons, but a lot of it is put into this drawer since it was a convenient dropping place for things. Similarly, if we look to the data lake as the “junk drawer” for our company, it is guaranteed to be bigger and more expensive than it truly needs to be.

It is important that the data that is stored in your data lake has a current or expected purpose. While you may not have a current use for some data, it can be helpful to keep it around in case a need arises. An example of this is in the area of machine learning. Providing more ancillary data enables better decisions since it provides a deeper view into the decision process. Therefore, maintaining some data that may not have a specific and current need can be helpful. However, there are cases where maintaining a huge volume of data can be counterproductive. Consider temperature information delivered from a switch. If the temperature reaches a specific threshold, the switch should be shut down. Reporting on the temperature in an immediate and timely manner is important to make an informed decision, but stable temperature data from days, week, or months ago could be summarized and stored in a more efficient manner. The granular details can then be purged from the lake.

So, where is the balance? If you keep all the data, it can make your data lake unwieldy and costly. If you only keep data that has a specific current purpose, you may be impairing your future plans. Obviously, the key is to monitor your access and use of the data frequently, and purge or archive some of the data that is not being regularly used.

Uncontrolled Access Concerns

Since much of the data in your data lake is company confidential, it is imperative that access to that data be controlled. The fact that the data in the lake is stored in its raw format means that it is more difficult to control access. The structures of a relational database provide some of the basis for access control, allowing us to limit who has access to specific queries, tables, fields, schemas, databases, and other objects. In the absence of these structures, controlling access requires more finesse. Determining who has access to what parts of the data in the lake must be handled, as well as isolating the data within your own network environment. Many of these restrictions may already be in place in your current environment, but they should be reviewed before being relied on fully, since the data lake may store information that was previously unavailable to some users. Access should be regularly reviewed to identify potential rogue activities. Encryption options also exist to further secure the data from unwanted access, and file system security can be used to limit access. All of these components must be considered, implemented, and reviewed to ensure that the data is secure.

User Considerations

In a relational database, the data structure inherently determines some of the consistencies and format of the data. This enables users to easily query the data and be assured that they are returning valid results. The lack of such structures in the data lake means that users must be more highly skilled at data manipulation. Having users with less skill accessing the data is possible, but it may not provide the best results. A data scientist is better positioned to access and query the complete data set. Obviously, users with a higher skill set are rare and cost more to hire, but the return may be worth it in the long run.

So What Do I Do Now?

This is an area where there are no hard and fast rules. Each company must develop and implement processes and procedures that make sense for their individual needs. Only with a plan for monitoring inputs, outputs, access patterns, and the like are you able to make a solid determination for your company’s needs. Percona can help to determine a plan for reporting usage, assess security settings, and more. As you are using the data in your data lake, we can also provide guidance regarding tools used to access the data.

1 Wikipedia, May 22, 2018

The post Don’t Drown in your Data Lake appeared first on Percona Database Performance Blog.

Apr
27
2018
--

This Week In Data with Colin Charles 37: Percona Live 2018 Wrap Up

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

Percona Live Santa Clara 2018 is now over! All things considered, I think it went off quite well; if you have any comments/complaints/etc., please don’t hesitate to drop me a line. I believe a survey will be going out as to where you’d like to see the conference in 2019 – yes, it is no longer going to be at the Santa Clara Convention Centre.

I was pleasantly surprised that several people came up to me saying they read this column and enjoy it. Thank you!

The whole conference was abuzz with MySQL 8.0 GA chatter. Many seemed to enjoy the PostgreSQL focus too, now that Percona announced PostgreSQL support.

Congratulations as well to the MySQL Community Awards 2018 winners.

Releases

Link List

Upcoming appearances

Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

The post This Week In Data with Colin Charles 37: Percona Live 2018 Wrap Up appeared first on Percona Database Performance Blog.

Apr
24
2018
--

Etleap scores $1.5 million seed to transform how we ingest data

Etleap is a play on words for a common set of data practices: extract, transform and load. The startup is trying to place these activities in a modern context, automating what they can and in general speeding up what has been a tedious and highly technical practice. Today, they announced a $1.5 million seed round.

Investors include First Round Capital, SV Angel, Liquid2, BoxGroup and other unnamed investors. The startup launched five years ago as a Y Combinator company. It spent a good 2.5 years building out the product, says CEO and founder Christian Romming. They haven’t required additional funding until now because they have been working with actual customers. Those include Okta, PagerDuty and Mode, among others.

Romming started out at adtech startup VigLink and while there he encountered a problem that was hard to solve. “Our analysts and scientists were frustrated. Integration of the data sources wasn’t always a priority and when something broke, they couldn’t get it fixed until a developer looked at it.” That lack of control slowed things down and made it hard to keep the data warehouse up-to-date.

He saw an opportunity in solving that problem and started Etleap . While there were (and continue to be) legacy solutions like Informatica, Talend and Microsoft SQL Server Integration Services, he said when he studied these at a deeply technical level, he found they required a great deal of help to implement. He wanted to simplify ETL as much as possible, putting data integration into the hands of much less technical end users, rather than relying on IT and consultants.

One of the problems with traditional ETL is that the data analysts who make use of the data tend to get involved very late after the tools have already been chosen, and Romming says his company wants to change that. “They get to consume whatever IT has created for them. You end up with a bread line where analysts are at the mercy of IT to get their jobs done. That’s one of the things we are trying to solve. We don’t think there should be any engineering at all to set up an ETL pipeline,” he said.

Etleap is delivered as managed SaaS or you can run it within your company’s AWS accounts. Regardless of the method, it handles all of the managing, monitoring and operations for the customer.

Romming emphasizes that the product is really built for cloud data warehouses. For now, they are concentrating on the AWS ecosystem, but have plans to expand beyond that down the road. “We want to help more enterprise companies make better use of their data, while modernizing data warehousing infrastructure and making use of cloud data warehouses,” he explained.

The company currently has 15 employees, but Romming plans to at least double that in the next 12-18 months, mostly increasing the engineering team to help further build out the product and create more connectors.

Apr
23
2018
--

This Week In Data with Colin Charles 36: Percona Live 2018

Colin Charles

Colin CharlesPercona Live Santa Clara 2018! Last week’s column may have somehow not made it to Planet MySQL, so please don’t miss the good links at: This Week in Data with Colin Charles 35: Percona Live 18 final countdown and a roundup of recent news.

Back to Percona Live – I expect people are still going to be registering, right down to the wire! I highly recommend you also register for the community dinner. They routinely sell out and people tend to complain about not being able to join in the fun, so reserve your spot early. Please also be present on Monday, which is not just tutorial day, but also during the welcoming reception, there will be the most excellent community awards. In addition, if you don’t find a tutorial as something you’re interested in (or didn’t get a ticket that included tutorials!), why not check out the China Track, something new and unique that showcases the technology coming out of China.

The biggest news this week? On Thursday, April 19, 2018, MySQL 8.0 became Generally Available with the 8.0.11 release. The release notes are a must read, as is the upgrade guide (this time around, you really want to read it!). Some more digestible links: What’s New in MySQL 8.0? (Generally Available), MySQL 8.0: New Features in Replication, MySQL 8.0 – Announcing GA of the MySQL Document Store. As a bonus, the Hacker News thread is also well worth a read. Don’t forget that all the connectors also got a nice version bump.

The PostgreSQL website has been redesigned – check out PostgreSQL.org.

More open source databases are always a good thing, and it’s great to see Apple open sourcing FoundationDB. Being corporate-backed open source, I have great hopes for what the project can become. The requisite Hacker News thread is also well worth a read.

Releases

  • PostgreSQL 10.3, 9.6.8, 9.5.12, 9.4.17, AND 9.3.22 released
  • MariaDB 10.3.6 is another release candidate, more changes for sql_mode=oracle, changes to the INFORMATION_SCHEMA tables around system versioning, and more. Particularly interesting is the contributor list, listing a total of 34 contributors. Five come from the MariaDB Foundation (including Monty) which is 14%, 17 come from the MariaDB Corporation (including Monty again) which is 50%, two from Tempesta, one from IBM, six from Codership (over 17%!), and four are independent. So nearly 62% of contributions are run by the Corporation/Foundation in total.
  • SysbenchRocks, a repository of Sysbench benchmarks, libraries and extensions.

Link List

Upcoming appearances

Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

The post This Week In Data with Colin Charles 36: Percona Live 2018 appeared first on Percona Database Performance Blog.

Apr
10
2018
--

Splunk turns data processing chops to Industrial IoT

Splunk has always been known as a company that can sift through oodles of log or security data and help customers surface the important bits. Today, it announced it was going to try to apply that same skill set to Industrial Internet of Things data.

IIoT is data found in manufacturing settings, typically come from sensors on the factory floor giving engineers and plant managers data about the health and well-being of the machines running in the facility. Up until now, that data hasn’t had a modern place to live. Traditionally, companies pull the data into Excel and try to slice and dice it to find the issues

Splunk wants to change that with Splunk Industrial Asset Intelligence (IAI). The latest product pulls data from a variety of sources where it can be presented to management and engineers with the information they need to see along with critical alerts.

The new product takes advantage of some existing Splunk tools being built on top of Splunk Enterprise, but instead of processing data coming from IT systems, it’s looking at Industrial Control Systems (ICS), sensors, SCADA (supervisory control and data acquisition) systems and applications and pulling all that data together and presenting it to the key constituencies in a dashboard.

It is not a simple matter, however, to set up these dashboards, pull the data from the various data sources, some of which may be modern and some quite old, and figure out what’s important for a particular customer. Splunk says it has turned to systems integrators to help with that part of the implementation.

Splunk understands data, but it also recognizes working in the manufacturing sector is new territory for them, so they are looking to SIs with expertise in manufacturing to help them work with the unique requirements of this group. But it’s still data says Ammar Maraqa. Splunk SVP of Business Operations And Strategy and General Manager of IoT Markets

“If you step back at the end of the day, Splunk is able to ingest and correlate heterogeneous sets of data to provide a view into what’s happening in their environments,” Maraqa said.

With today’s announcement, Splunk Industrial Asset Intelligence exits Beta for a limited release. It should be generally available sometime in the Fall.

Mar
30
2018
--

IoT devices could be next customer data frontier

At the Adobe Summit this week in Las Vegas, the company introduced what could be the ultimate customer experience construct, a customer experience system of record that pulls in information, not just from Adobe tools, but wherever it lives. In many ways it marked a new period in the notion of customer experience management, putting it front and center of the marketing strategy.

Adobe was not alone, of course. Salesforce, with its three-headed monster, the sales, marketing and service clouds, was also thinking of a similar idea. In fact, they spent $6.5 billion dollars last week to buy MuleSoft to act as a data integration layer to access  customer information from across the enterprise software stack, whether on prem, in the cloud, or inside or outside of Salesforce. And they announced the Salesforce Integration Cloud this week to make use of their newest company.

As data collection takes center stage, we actually could be on the edge of yet another data revolution, one that could be more profound than even the web and mobile were before it. That is…the Internet of Things.

Here comes IoT

There are three main pieces to that IoT revolution at the moment from a consumer perspective. First of all, there is the smart speaker like the Amazon Echo or Google Home. These provide a way for humans to interact verbally with machines, a notion that is only now possible through the marriage of all this data, sheer (and cheap) compute power and the AI algorithms that fuel all of it.

Next, we have the idea of a connected car, one separate from the self-driving car. Much like the smart speaker, humans can interact with the car, to find directions and recommendations and that leaves a data trail in its wake. Finally we, have sensors like iBeacons sitting in stores, providing retailers with a world of information about a customer’s journey through the store — what they like or don’t like, what they pick up, what they try on and so forth.

There are very likely a host of other categories too, and all of this information is data that needs to be processed and understood just like any other signals coming from customers, but it also has unique characteristics around the volume and velocity of this data — it is truly big data with all of the issues inherent in processing that amount of data.

The means it needs to be ingested, digested and incorporated into that central customer record-keeping system to drive the content and experiences you need to create to keep your customers happy — or so the marketing software companies tell us, at least. (We also need to consider the privacy implications of such a record, but that is the subject for another article.)

Building a better relationship

Regardless of the vendor, all of this is about understanding the customer better to provide a central data gathering system with the hope of giving people exactly what they want. We are no longer a generic mass of consumers. We are instead individuals with different needs, desires and requirements, and the best way to please us they say, is to understand us so well, that the brand can deliver the perfect experience at exactly the right moment.

Photo: Ron Miller

That involves listening to the digital signals we give off without even thinking about it. We carry mobile, connected computers in our pockets and they send out a variety of information about our whereabouts and what we are doing. Social media acts as a broadcast system that brands can tap into to better understand us (or so the story goes).

Part of what Adobe, Salesforce and others can deliver is a way to gather that information, pull it together into his uber record keeping system and apply a level of machine and learning and intelligence to help further the brand’s ultimate goals of serving a customer of one and delivering an efficient (and perhaps even pleasurable) experience.

Getting on board

At an Adobe Summit session this week on IoT (which I moderated), the audience was polled a couple of times. In one show of hands, they were asked how many owned a smart speaker and about three quarters indicated they owned at least one, but when asked how many were developing applications for these same devices only a handful of hands went up. This was in a room full of marketers, mind you.

Photo: Ron Miller

That suggests that there is a disconnect between usage and tools to take advantage of them. The same could be said for the other IoT data sources, the car and sensor tech, or any other connected consumer device. Just as we created a set of tools to capture and understand the data coming from mobile apps and the web, we need to create the same thing for all of these IoT sources.

That means coming up with creative ways to take advantage of another interaction (and data collection) point. This is an entirely new frontier with all of the opportunity involved in that, and that suggests startups and established companies alike need to be thinking about solutions to help companies do just that.

Mar
27
2018
--

Pure Storage teams with Nvidia on GPU-fueled Flash storage solution for AI

As companies gather increasing amounts of data, they face a choice over bottlenecks. They can have it in the storage component or the backend compute system. Some companies have attacked the problem by using GPUs to streamline the back end problem or Flash storage to speed up the storage problem. Pure Storage wants to give customers the best of both worlds.

Today it announced, Airi, a complete data storage solution for AI workloads in a box.

Under the hood Airi starts with a Pure Storage FlashBlade, a storage solution that Pure created specifically with AI and machine learning kind of processing in mind. NVidia contributes the pure power with four NVIDIA DGX-1 supercomputers, delivering four petaFLOPS of performance with NVIDIA ® Tesla ® V100 GPUs. Arista provides the networking hardware to make it all work together with Arista 100GbE switches. The software glue layer comes from the NVIDIA GPU Cloud deep learning stack and Pure Storage AIRI Scaling Toolkit.

Photo: Pure Storage

One interesting aspect of this deal is that the FlashBlade product operates as a separate product inside of the Pure Storage organization. They have put together a team of engineers with AI and data pipeline understanding with the focus inside the company on finding ways to move beyond the traditional storage market and find out where the market is going.

This approach certainly does that, but the question is do companies want to chase the on-prem hardware approach or take this kind of data to the cloud. Pure would argue that the data gravity of AI workloads would make this difficult to achieve with a cloud solution, but we are seeing increasingly large amounts of data moving to the cloud with the cloud vendors providing tools for data scientists to process that data.

If companies choose to go the hardware route over the cloud, each vendor in this equation — whether Nvidia, Pure Storage or Arista — should benefit from a multi-vendor sale. The idea ultimately is to provide customers with a one-stop solution they can install quickly inside a data center if that’s the approach they want to take.

Mar
09
2018
--

InfoSum’s first product touts decentralized big data insights

Nick Halstead’s new startup, InfoSum, is launching its first product today — moving one step closer to his founding vision of a data platform that can help businesses and organizations unlock insights from big data silos without compromising user privacy, data security or data protection law. So a pretty high bar then.

If the underlying tech lives up to the promises being made for it, the timing for this business looks very good indeed, with the European Union’s new General Data Protection Regulation (GDPR) mere months away from applying across the region — ushering in a new regime of eye-wateringly large penalties to incentivize data handling best practice.

InfoSum bills its approach to collaboration around personal data as fully GDPR compliant — because it says it doesn’t rely on sharing the actual raw data with any third parties.

Rather a mathematical model is used to make a statistical comparison, and the platform delivers aggregated — but still, says Halstead — useful insights. Though he says the regulatory angle is fortuitous, rather than the full inspiration for the product.

“Two years ago, I saw that the world definitely needed a different way to think about working on knowledge about people,” he tells TechCrunch. “Both for privacy [reasons] — there isn’t a week where we don’t see some kind of data breach… they happen all the time — but also privacy isn’t enough by itself. There has to be a commercial reason to change things.”

The commercial imperative he reckons he’s spied is around how “unmanageable” big data can become when it’s pooled for collaborative purposes.

Datasets invariably need a lot of cleaning up to make different databases align and overlap. And the process of cleaning and structuring data so it can be usefully compared can run to multiple weeks. Yet that effort has to be put in before you really know if it will be worth your while doing so.

That snag of time + effort is a major barrier preventing even large companies from doing more interesting things with their data holdings, argues Halstead.

So InfoSum’s first product — called Link — is intended to give businesses a glimpse of the “art of the possible”, as he puts it — in just a couple of hours, rather than the “nine, ten weeks” he says it might otherwise take them.

“I set myself a challenge… could I get through the barriers that companies have around privacy, security, and the commercial risks when they handle consumer data. And, more importantly, when they need to work with third parties or need to work across their corporation where they’ve got numbers of consumer data and they want to be able to look at that data and look at the combined knowledge across those.

“That’s really where I came up with this idea of non-movement of data. And that’s the core principle of what’s behind InfoSum… I can connect knowledge across two data sets, as if they’ve been pooled.”

Halstead says that the problem with the traditional data pooling route — so copying and sharing raw data with all sorts of partners (or even internally, thereby expanding the risk vector surface area) — is that it’s risky. The myriad data breaches that regularly make headlines nowadays are a testament to that.

But that’s not the only commercial consideration in play, as he points out that raw data which has been shared is immediately less valuable — because it can’t be sold again.

“If I give you a data set in its raw form, I can’t sell that to you again — you can take it away, you can slice it and dice it as many ways as you want. You won’t need to come back to me for another three or four years for that same data,” he argues. “From a commercial point of view [what we’re doing] makes the data more valuable. In that data is never actually having to be handed over to the other party.”

Not blockchain for privacy

Decentralization, as a technology approach, is also of course having a major moment right now — thanks to blockchain hype. But InfoSum is definitely not blockchain. Which is a good thing. No sensible person should be trying to put personal data on a blockchain.

“The reality is that all the companies that say they’re doing blockchain for privacy aren’t using blockchain for the privacy part, they’re just using it for a trust model, or recording the transactions that occur,” says Halstead, discussing why blockchain is terrible for privacy.

“Because you can’t use the blockchain and say it’s GDPR compliant or privacy safe. Because the whole transparency part of it and the fact that it’s immutable. You can’t have an immutable database where you can’t then delete users from it. It just doesn’t work.”

Instead he describes InfoSum’s technology as “blockchain-esque” — because “everyone stays holding their data”. “The trust is then that because everyone holds their data, no one needs to give their data to everyone else. But you can still crucially, through our technology, combine the knowledge across those different data sets.”

So what exactly is InfoSum doing to the raw personal data to make it “privacy safe”? Halstead claims it goes “beyond hashing” or encrypting it. “Our solution goes beyond that — there is no way to re-identify any of our data because it’s not ever represented in that way,” he says, further claiming: “It is absolutely 100 per cent data isolation, and we are the only company doing this in this way.

“There are solutions out there where traditional models are pooling it but with encryption on top of it. But again if the encryption gets broken the data is still ending up being in a single silo.”

InfoSum’s approach is based on mathematically modeling users, using a “one way model”, and using that to make statistical comparisons and serve up aggregated insights.

“You can’t read things out of it, you can only test things against it,” he says of how it’s transforming the data. “So it’s only useful if you actually knew who those users were beforehand — which obviously you’re not going to. And you wouldn’t be able to do that unless you had access to our underlying code-base. Everyone else either users encryption or hashing or a combination of both of those.”

This one-way modeling technique is in the process of being patented — so Halstead says he can’t discuss the “fine details” — but he does mention a long standing technique for optimizing database communications, called bloom filters, saying those sorts of “principles” underpin InfoSum’s approach.

Although he also says it’s using those kind of techniques differently. Here’s how InfoSum’s website describes this process (which it calls Quantum):

InfoSum Quantum irreversibly anonymises data and creates a mathematical model that enables isolated datasets to be statistically compared. Identities are matched at an individual level and results are collated at an aggregate level – without bringing the datasets together.

On the surface, the approach shares a similar structure to Facebook’s Custom Audiences Product, where advertisers’ customer lists are locally hashed and then uploaded to Facebook for matching against its own list of hashed customer IDs — with any matches used to create a custom audience for ad targeting purposes.

Though Halstead argues InfoSum’s platform offers more for even this kind of audience building marketing scenario, because its users can use “much more valuable knowledge” to model on — knowledge they would not comfortably share with Facebook “because of the commercial risks of handing over that first person valuable data”.

“For instance if you had an attribute that defined which were your most valuable customers, you would be very unlikely to share that valuable knowledge — yet if you could safely then it would be one of the most potent indicators to model upon,” he suggests.

He also argues that InfoSum users will be able to achieve greater marketing insights via collaborations with other users of the platform vs being a customer of Facebook Custom Audiences — because Facebook simply “does not open up its knowledge”.

“You send them your customer lists, but they don’t then let you have the data they have,” he adds. “InfoSum for many DMPs [data management platforms] will allow them to collaborate with customers so the whole purchasing of marketing can be much more transparent.”

He also emphasizes that marketing is just one of the use-cases InfoSum’s platform can address.

Decentralized bunkers of data

One important clarification: InfoSum customers’ data does get moved — but it’s moved into a “private isolated bunker” of their choosing, rather than being uploaded to a third party.

“The easiest one to use is where we basically create you a 100 per cent isolated instance in Amazon [Web Services],” says Halstead. “We’ve worked with Amazon on this so that we’ve used a whole number of techniques so that once we create this for you, you put your data into it — we don’t have access to it. And when you connect it to the other part we use this data modeling so that no data then moves between them.”

“The ‘bunker’ is… an isolated instance,” he adds, elaborating on how communications with these bunkers are secured. “It has its own firewall, a private VPN, and of course uses standard SSL security. And once you have finished normalising the data it is turned into a form in which all PII [personally identifiable information] is deleted.

“And of course like any other security related company we have had independent security companies penetration test our solution and look at our architecture design.”

Other key pieces of InfoSum’s technology are around data integration and identity mapping — aimed at tackling the (inevitable) problem of data in different databases/datasets being stored in different formats. Which again is one of the commercial reasons why big data silos often stay just that: Silos.

Halstead gave TechCrunch a demo showing how the platform ingests and connects data, with users able to use “simple steps” to teach the system what is meant by data types stored in different formats — such as that ‘f’ means the same as ‘female’ for gender category purposes — to smooth the data mapping and “try to get it as clean as possible”.

Once that step has been completed, the user (or collaborating users) are able to get a view on how well linked their data sets are — and thus to glimpse “the start of the art of the possible”.

In practice this means they can choose to run different reports atop their linked datasets — such as if they want to enrich their data holdings by linking their own users across different products to gain new insights, such as for internal research purposes.

Or, where there’s two InfoSum users linking different data sets, they could use it for propensity modeling or lookalike modeling of customers, says Halstead. So, for example, a company could link models of their users with models of the users of a third party that holds richer data on its users to identify potential new customer types to target marketing at.

“Because I’ve asked to look at the overlap I can literally say I only know the gender of these people but I would also like to know what their income is,” he says, fleshing out another possible usage scenario. “You can’t drill into this, you can’t do really deep analytics — that’s what we’ll be launching later. But Link allows you to get this idea of what would it look like if I combine our datasets.

“The key here is it’s opening up a whole load of industries where sensitivity around doing this — and where, even in industries that share a lot of data already but where GDPR is going to be a massive barrier to it in the future.”

Halstead says he expects big demand from the marketing industry which is of course having to scramble to rework its processes to ensure they don’t fall foul of GDPR.

“Within marketing there is going to be a whole load of new challenges for companies where they were currently enhancing their databases, buying up large raw datasets and bringing their data into their own CRM. That world’s gone once we’ve got GDPR.

“Our model is safer, faster, and actually still really lets people do all the things they did before but while protecting the customers.”

But it’s not just marketing exciting him. Halstead believes InfoSum’s approach to lifting insights from personal data could be very widely applicable — arguing, for example, that it’s only a minority of use-cases, such as credit risk and fraud within banking, where companies actually need to look at data at an individual level.

One area he says he’s “very passionate” about InfoSum’s potential is in the healthcare space.

“We believe that this model isn’t just about helping marketing and helping a whole load of others — healthcare especially for us I think is going to be huge. Because [this affords] the ability to do research against health data where health data is never been actually shared,” he says.

“In the UK especially we’ve had a number of massive false starts where companies have, for very good reasons, wanted to be able to look at health records and combine data — which can turn into vital research to help people. But actually their way of doing it has been about giving out large datasets. And that’s just not acceptable.”

He even suggests the platform could be used for training AIs within the isolated bunkers — flagging a developer interface that will be launching after Link which will let users query the data as a traditional SQL query.

Though he says he sees most initial healthcare-related demand coming from analytics that need “one or two additional attributes” — such as, for example, comparing health records of people with diabetes with activity tracker data to look at outcomes for different activity levels.

“You don’t need to drill down into individuals to know that the research capabilities could give you incredible results to understand behavior,” he adds. “When you do medical research you need bodies of data to be able to prove things so the fact that we can only work at an aggregate level is not, I don’t think, any barrier to being able to do the kind of health research required.”

Another area he believes could really benefit is M&A — saying InfoSum’s platform could offer companies a way to understand how their user bases overlap before they sign on the line. (It is also of course handling and thus simplifying the legal side of multiple entities collaborating over data sets.)

“There hasn’t been the technology to allow them to look at whether there’s an overlap before,” he claims. “It puts the power in the hands of the buyer to be able to say we’d like to be able to look at what your user base looks like in comparison to ours.

“The problem right now is you could do that manually but if they then backed out there’s all kinds of legal problems because I’ve had to hand the raw data over… so no one does it. So we’re going to change the M&A market for allowing people to discover whether I should acquire someone before they go through to the data room process.”

While Link is something of a taster of what InfoSum’s platform aims to ultimately offer (with this first product priced low but not freemium), the SaaS business it’s intending to get into is data matchmaking — whereby, once it has a pipeline of users, it can start to suggest links that might be interesting for its customers to explore.

“There is no point in us reinventing the wheel of being the best visualization company because there’s plenty that have done that,” he says. “So we are working on data connectors for all of the most popular BI tools that plug in to then visualize the actual data.

“The long term vision for us moves more into being more of an introductory service — i.e. one we’ve got 100 companies in this how do we help those companies work out what other companies that they should be working with.”

“We’ve got some very good systems for — in a fully anonymized way — helping you understand what the intersection is from your data to all of the other datasets, obviously with their permission if they want us to calculate that for them,” he adds.

“The way our investors looked at this, this is the big opportunity going forward. There is not limit, in a decentralized world… imagine 1,000 bunkers around the world in these different corporates who all can start to collaborate. And that’s our ultimate goal — that all of them are still holding onto their own knowledge, 100% privacy safe, but then they have that opportunity to work with each other, which they don’t right now.”

Engineering around privacy risks?

But does he not see any risks to privacy of enabling the linking of so many separate datasets — even with limits in place to avoid individuals being directly outed as connected across different services?

“However many data sets there are the only thing it can reveal extra is whether every extra data has an extra bit of knowledge,” he responds on that. “And every party has the ability to define  what bit of data they would then want to be open to others to then work on.

“There are obviously sensitivities around certain combinations of attributes, around religion, gender and things like that. Where we already have a very clever permission system where the owners can define what combinations are acceptable and what aren’t.”

“My experience of working with all the social networks has meant — I hope — that we are ahead of the game of thinking about those,” he adds, saying that the matchmaking stage is also six months out at this point.

“I don’t see any down sides to it, as long as the controls are there to be able to limit it. It’s not like it’s going to be a sudden free for all. It’s an introductory service, rather than an open platform so everyone can see everything else.”

The permission system is clearly going to be important. But InfoSum does essentially appear to be heading down the platform route of offloading responsibility for ethical considerations — in its case around dataset linkages — to its customers.

Which does open the door to problematic data linkages down the line, and all sorts of unintended dots being joined.

Say, for example, a health clinic decides to match people with particular medical conditions to users of different dating apps — and the relative proportions of HIV rates across straight and gay dating apps in the local area gets published. What unintended consequences might spring from that linkage being made?

Other equally problematic linkages aren’t hard to imagine. And we’ve seen the appetite businesses have for making creepy observations about their users public.

“Combining two sets of aggregate data meaningfully is not easy,” says Eerke Boiten, professor of cyber security at De Montfort University, discussing InfoSum’s approach. “If they can make this all work out in a way that makes sense, preserves privacy, and is GDPR compliant, then they deserve a patent I suppose.”

On data linkages, Boiten points to the problems Facebook has had with racial profiling as illustrative of the potential pitfalls.

He also says there may also be GDPR-specific risks around customer profiling enabled by the platform. In an edge case scenario, for example, where two overlapped datasets are linked and found to have a 100% user match, that would mean people’s personal data had been processed by default — so that processing would have required a legal basis to be in place beforehand.

And there may be wider legal risks around profiling too. If, for example, linkages are used to deny services or vary pricing to certain types or blocks of customers, is that legal or ethical?

“From a company’s perspective, if it already has either consent or a legitimate purpose (under GDPR) to use customer data for analytical/statistical purposes then it can use our products,” says InfoSum’s COO Danvers Baillieu, on data processing consent. “Where a company has an issue using InfoSum as a sub-processor, then… we can set up the system differently so that we simply supply the software and they run it on their own machines (so we are not a data processor) –- but this is not yet available in Link.”

Baillieu also notes that the bin sizes InfoSum’s platform aggregates individuals into are configurable in its first product. “The default bin size is 10, and the absolute minimum is three,” he adds.

“The other key point around disclosure control is that our system never needs to publish the raw data table. All the famous breaches from Netflix onwards are because datasets have been pseudonymised badly and researchers have been able to run analysis across the visible fields and then figure out who the individuals are — this is simply not possible with our system as this data is never revealed.”

‘Fully GDPR compliant’ is certainly a big claim — and one that it going to have a lot of slings and arrows thrown at it as data gets ingested by InfoSum’s platform.

It’s also fair to say that a whole library of books could be written about technology’s unintended consequences.

Indeed, InfoSum’s own website credits Halstead as the inventor of the embedded retweet button, noting the technology is “something that is now ubiquitous on almost every website in the world”.

Those ubiquitous social plugins are also of course a core part of the infrastructure used to track Internet users wherever and almost everywhere they browse. So does he have any regrets about the invention, given how that bit of innovation has ended up being so devastating for digital privacy?

“When I invented it, the driving force for the retweet button was only really as a single number to count engagement. It was never to do with tracking. Our version of the retweet button never had any trackers in it,” he responds on that. “It was the number that drove our algorithms for delivering news in a very transparent way.

“I don’t need to add my voice to all the US pundits of the regrets of the beast that’s been unleashed. All of us feel that desire to unhook from some of these networks now because they aren’t being healthy for us in certain ways. And I certainly feel that what we’re not doing for improving the world of data is going to be good for everyone.”

When we first covered the UK-based startup it was going under the name CognitiveLogic — a placeholder name, as three weeks in Halstead says he was still figuring out exactly how to take his idea to market.

The founder of DataSift has not had difficulties raising funding for his new venture. There was an initial $3M from Upfront Ventures and IA Ventures, with the seed topped up by a further $5M last year, with new investors including Saul Klein (formerly Index Ventures) and Mike Chalfen of Mosaic Ventures. Halstead says he’ll be looking to raise “a very large Series A” over the summer.

In the meanwhile he says he has a “very long list” of hundreds customers wanting to get their hands on the platform to kick its tires. “The last three months has been a whirlwind of me going back to all of the major brands, all of the big data companies, there no large corporate that doesn’t have these kinds of challenges,” he adds.

“I saw a very big client this morning… they’re a large multinational, they’ve got three major brands where the three customer sets had never been joined together. So they don’t even know what the overlap of those brands are at the moment. So even giving them that insight would be massively valuable to them.”

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com