InfoSum’s first product touts decentralized big data insights

Nick Halstead’s new startup, InfoSum, is launching its first product today — moving one step closer to his founding vision of a data platform that can help businesses and organizations unlock insights from big data silos without compromising user privacy, data security or data protection law. So a pretty high bar then.

If the underlying tech lives up to the promises being made for it, the timing for this business looks very good indeed, with the European Union’s new General Data Protection Regulation (GDPR) mere months away from applying across the region — ushering in a new regime of eye-wateringly large penalties to incentivize data handling best practice.

InfoSum bills its approach to collaboration around personal data as fully GDPR compliant — because it says it doesn’t rely on sharing the actual raw data with any third parties.

Rather a mathematical model is used to make a statistical comparison, and the platform delivers aggregated — but still, says Halstead — useful insights. Though he says the regulatory angle is fortuitous, rather than the full inspiration for the product.

“Two years ago, I saw that the world definitely needed a different way to think about working on knowledge about people,” he tells TechCrunch. “Both for privacy [reasons] — there isn’t a week where we don’t see some kind of data breach… they happen all the time — but also privacy isn’t enough by itself. There has to be a commercial reason to change things.”

The commercial imperative he reckons he’s spied is around how “unmanageable” big data can become when it’s pooled for collaborative purposes.

Datasets invariably need a lot of cleaning up to make different databases align and overlap. And the process of cleaning and structuring data so it can be usefully compared can run to multiple weeks. Yet that effort has to be put in before you really know if it will be worth your while doing so.

That snag of time + effort is a major barrier preventing even large companies from doing more interesting things with their data holdings, argues Halstead.

So InfoSum’s first product — called Link — is intended to give businesses a glimpse of the “art of the possible”, as he puts it — in just a couple of hours, rather than the “nine, ten weeks” he says it might otherwise take them.

“I set myself a challenge… could I get through the barriers that companies have around privacy, security, and the commercial risks when they handle consumer data. And, more importantly, when they need to work with third parties or need to work across their corporation where they’ve got numbers of consumer data and they want to be able to look at that data and look at the combined knowledge across those.

“That’s really where I came up with this idea of non-movement of data. And that’s the core principle of what’s behind InfoSum… I can connect knowledge across two data sets, as if they’ve been pooled.”

Halstead says that the problem with the traditional data pooling route — so copying and sharing raw data with all sorts of partners (or even internally, thereby expanding the risk vector surface area) — is that it’s risky. The myriad data breaches that regularly make headlines nowadays are a testament to that.

But that’s not the only commercial consideration in play, as he points out that raw data which has been shared is immediately less valuable — because it can’t be sold again.

“If I give you a data set in its raw form, I can’t sell that to you again — you can take it away, you can slice it and dice it as many ways as you want. You won’t need to come back to me for another three or four years for that same data,” he argues. “From a commercial point of view [what we’re doing] makes the data more valuable. In that data is never actually having to be handed over to the other party.”

Not blockchain for privacy

Decentralization, as a technology approach, is also of course having a major moment right now — thanks to blockchain hype. But InfoSum is definitely not blockchain. Which is a good thing. No sensible person should be trying to put personal data on a blockchain.

“The reality is that all the companies that say they’re doing blockchain for privacy aren’t using blockchain for the privacy part, they’re just using it for a trust model, or recording the transactions that occur,” says Halstead, discussing why blockchain is terrible for privacy.

“Because you can’t use the blockchain and say it’s GDPR compliant or privacy safe. Because the whole transparency part of it and the fact that it’s immutable. You can’t have an immutable database where you can’t then delete users from it. It just doesn’t work.”

Instead he describes InfoSum’s technology as “blockchain-esque” — because “everyone stays holding their data”. “The trust is then that because everyone holds their data, no one needs to give their data to everyone else. But you can still crucially, through our technology, combine the knowledge across those different data sets.”

So what exactly is InfoSum doing to the raw personal data to make it “privacy safe”? Halstead claims it goes “beyond hashing” or encrypting it. “Our solution goes beyond that — there is no way to re-identify any of our data because it’s not ever represented in that way,” he says, further claiming: “It is absolutely 100 per cent data isolation, and we are the only company doing this in this way.

“There are solutions out there where traditional models are pooling it but with encryption on top of it. But again if the encryption gets broken the data is still ending up being in a single silo.”

InfoSum’s approach is based on mathematically modeling users, using a “one way model”, and using that to make statistical comparisons and serve up aggregated insights.

“You can’t read things out of it, you can only test things against it,” he says of how it’s transforming the data. “So it’s only useful if you actually knew who those users were beforehand — which obviously you’re not going to. And you wouldn’t be able to do that unless you had access to our underlying code-base. Everyone else either users encryption or hashing or a combination of both of those.”

This one-way modeling technique is in the process of being patented — so Halstead says he can’t discuss the “fine details” — but he does mention a long standing technique for optimizing database communications, called bloom filters, saying those sorts of “principles” underpin InfoSum’s approach.

Although he also says it’s using those kind of techniques differently. Here’s how InfoSum’s website describes this process (which it calls Quantum):

InfoSum Quantum irreversibly anonymises data and creates a mathematical model that enables isolated datasets to be statistically compared. Identities are matched at an individual level and results are collated at an aggregate level – without bringing the datasets together.

On the surface, the approach shares a similar structure to Facebook’s Custom Audiences Product, where advertisers’ customer lists are locally hashed and then uploaded to Facebook for matching against its own list of hashed customer IDs — with any matches used to create a custom audience for ad targeting purposes.

Though Halstead argues InfoSum’s platform offers more for even this kind of audience building marketing scenario, because its users can use “much more valuable knowledge” to model on — knowledge they would not comfortably share with Facebook “because of the commercial risks of handing over that first person valuable data”.

“For instance if you had an attribute that defined which were your most valuable customers, you would be very unlikely to share that valuable knowledge — yet if you could safely then it would be one of the most potent indicators to model upon,” he suggests.

He also argues that InfoSum users will be able to achieve greater marketing insights via collaborations with other users of the platform vs being a customer of Facebook Custom Audiences — because Facebook simply “does not open up its knowledge”.

“You send them your customer lists, but they don’t then let you have the data they have,” he adds. “InfoSum for many DMPs [data management platforms] will allow them to collaborate with customers so the whole purchasing of marketing can be much more transparent.”

He also emphasizes that marketing is just one of the use-cases InfoSum’s platform can address.

Decentralized bunkers of data

One important clarification: InfoSum customers’ data does get moved — but it’s moved into a “private isolated bunker” of their choosing, rather than being uploaded to a third party.

“The easiest one to use is where we basically create you a 100 per cent isolated instance in Amazon [Web Services],” says Halstead. “We’ve worked with Amazon on this so that we’ve used a whole number of techniques so that once we create this for you, you put your data into it — we don’t have access to it. And when you connect it to the other part we use this data modeling so that no data then moves between them.”

“The ‘bunker’ is… an isolated instance,” he adds, elaborating on how communications with these bunkers are secured. “It has its own firewall, a private VPN, and of course uses standard SSL security. And once you have finished normalising the data it is turned into a form in which all PII [personally identifiable information] is deleted.

“And of course like any other security related company we have had independent security companies penetration test our solution and look at our architecture design.”

Other key pieces of InfoSum’s technology are around data integration and identity mapping — aimed at tackling the (inevitable) problem of data in different databases/datasets being stored in different formats. Which again is one of the commercial reasons why big data silos often stay just that: Silos.

Halstead gave TechCrunch a demo showing how the platform ingests and connects data, with users able to use “simple steps” to teach the system what is meant by data types stored in different formats — such as that ‘f’ means the same as ‘female’ for gender category purposes — to smooth the data mapping and “try to get it as clean as possible”.

Once that step has been completed, the user (or collaborating users) are able to get a view on how well linked their data sets are — and thus to glimpse “the start of the art of the possible”.

In practice this means they can choose to run different reports atop their linked datasets — such as if they want to enrich their data holdings by linking their own users across different products to gain new insights, such as for internal research purposes.

Or, where there’s two InfoSum users linking different data sets, they could use it for propensity modeling or lookalike modeling of customers, says Halstead. So, for example, a company could link models of their users with models of the users of a third party that holds richer data on its users to identify potential new customer types to target marketing at.

“Because I’ve asked to look at the overlap I can literally say I only know the gender of these people but I would also like to know what their income is,” he says, fleshing out another possible usage scenario. “You can’t drill into this, you can’t do really deep analytics — that’s what we’ll be launching later. But Link allows you to get this idea of what would it look like if I combine our datasets.

“The key here is it’s opening up a whole load of industries where sensitivity around doing this — and where, even in industries that share a lot of data already but where GDPR is going to be a massive barrier to it in the future.”

Halstead says he expects big demand from the marketing industry which is of course having to scramble to rework its processes to ensure they don’t fall foul of GDPR.

“Within marketing there is going to be a whole load of new challenges for companies where they were currently enhancing their databases, buying up large raw datasets and bringing their data into their own CRM. That world’s gone once we’ve got GDPR.

“Our model is safer, faster, and actually still really lets people do all the things they did before but while protecting the customers.”

But it’s not just marketing exciting him. Halstead believes InfoSum’s approach to lifting insights from personal data could be very widely applicable — arguing, for example, that it’s only a minority of use-cases, such as credit risk and fraud within banking, where companies actually need to look at data at an individual level.

One area he says he’s “very passionate” about InfoSum’s potential is in the healthcare space.

“We believe that this model isn’t just about helping marketing and helping a whole load of others — healthcare especially for us I think is going to be huge. Because [this affords] the ability to do research against health data where health data is never been actually shared,” he says.

“In the UK especially we’ve had a number of massive false starts where companies have, for very good reasons, wanted to be able to look at health records and combine data — which can turn into vital research to help people. But actually their way of doing it has been about giving out large datasets. And that’s just not acceptable.”

He even suggests the platform could be used for training AIs within the isolated bunkers — flagging a developer interface that will be launching after Link which will let users query the data as a traditional SQL query.

Though he says he sees most initial healthcare-related demand coming from analytics that need “one or two additional attributes” — such as, for example, comparing health records of people with diabetes with activity tracker data to look at outcomes for different activity levels.

“You don’t need to drill down into individuals to know that the research capabilities could give you incredible results to understand behavior,” he adds. “When you do medical research you need bodies of data to be able to prove things so the fact that we can only work at an aggregate level is not, I don’t think, any barrier to being able to do the kind of health research required.”

Another area he believes could really benefit is M&A — saying InfoSum’s platform could offer companies a way to understand how their user bases overlap before they sign on the line. (It is also of course handling and thus simplifying the legal side of multiple entities collaborating over data sets.)

“There hasn’t been the technology to allow them to look at whether there’s an overlap before,” he claims. “It puts the power in the hands of the buyer to be able to say we’d like to be able to look at what your user base looks like in comparison to ours.

“The problem right now is you could do that manually but if they then backed out there’s all kinds of legal problems because I’ve had to hand the raw data over… so no one does it. So we’re going to change the M&A market for allowing people to discover whether I should acquire someone before they go through to the data room process.”

While Link is something of a taster of what InfoSum’s platform aims to ultimately offer (with this first product priced low but not freemium), the SaaS business it’s intending to get into is data matchmaking — whereby, once it has a pipeline of users, it can start to suggest links that might be interesting for its customers to explore.

“There is no point in us reinventing the wheel of being the best visualization company because there’s plenty that have done that,” he says. “So we are working on data connectors for all of the most popular BI tools that plug in to then visualize the actual data.

“The long term vision for us moves more into being more of an introductory service — i.e. one we’ve got 100 companies in this how do we help those companies work out what other companies that they should be working with.”

“We’ve got some very good systems for — in a fully anonymized way — helping you understand what the intersection is from your data to all of the other datasets, obviously with their permission if they want us to calculate that for them,” he adds.

“The way our investors looked at this, this is the big opportunity going forward. There is not limit, in a decentralized world… imagine 1,000 bunkers around the world in these different corporates who all can start to collaborate. And that’s our ultimate goal — that all of them are still holding onto their own knowledge, 100% privacy safe, but then they have that opportunity to work with each other, which they don’t right now.”

Engineering around privacy risks?

But does he not see any risks to privacy of enabling the linking of so many separate datasets — even with limits in place to avoid individuals being directly outed as connected across different services?

“However many data sets there are the only thing it can reveal extra is whether every extra data has an extra bit of knowledge,” he responds on that. “And every party has the ability to define  what bit of data they would then want to be open to others to then work on.

“There are obviously sensitivities around certain combinations of attributes, around religion, gender and things like that. Where we already have a very clever permission system where the owners can define what combinations are acceptable and what aren’t.”

“My experience of working with all the social networks has meant — I hope — that we are ahead of the game of thinking about those,” he adds, saying that the matchmaking stage is also six months out at this point.

“I don’t see any down sides to it, as long as the controls are there to be able to limit it. It’s not like it’s going to be a sudden free for all. It’s an introductory service, rather than an open platform so everyone can see everything else.”

The permission system is clearly going to be important. But InfoSum does essentially appear to be heading down the platform route of offloading responsibility for ethical considerations — in its case around dataset linkages — to its customers.

Which does open the door to problematic data linkages down the line, and all sorts of unintended dots being joined.

Say, for example, a health clinic decides to match people with particular medical conditions to users of different dating apps — and the relative proportions of HIV rates across straight and gay dating apps in the local area gets published. What unintended consequences might spring from that linkage being made?

Other equally problematic linkages aren’t hard to imagine. And we’ve seen the appetite businesses have for making creepy observations about their users public.

“Combining two sets of aggregate data meaningfully is not easy,” says Eerke Boiten, professor of cyber security at De Montfort University, discussing InfoSum’s approach. “If they can make this all work out in a way that makes sense, preserves privacy, and is GDPR compliant, then they deserve a patent I suppose.”

On data linkages, Boiten points to the problems Facebook has had with racial profiling as illustrative of the potential pitfalls.

He also says there may also be GDPR-specific risks around customer profiling enabled by the platform. In an edge case scenario, for example, where two overlapped datasets are linked and found to have a 100% user match, that would mean people’s personal data had been processed by default — so that processing would have required a legal basis to be in place beforehand.

And there may be wider legal risks around profiling too. If, for example, linkages are used to deny services or vary pricing to certain types or blocks of customers, is that legal or ethical?

“From a company’s perspective, if it already has either consent or a legitimate purpose (under GDPR) to use customer data for analytical/statistical purposes then it can use our products,” says InfoSum’s COO Danvers Baillieu, on data processing consent. “Where a company has an issue using InfoSum as a sub-processor, then… we can set up the system differently so that we simply supply the software and they run it on their own machines (so we are not a data processor) –- but this is not yet available in Link.”

Baillieu also notes that the bin sizes InfoSum’s platform aggregates individuals into are configurable in its first product. “The default bin size is 10, and the absolute minimum is three,” he adds.

“The other key point around disclosure control is that our system never needs to publish the raw data table. All the famous breaches from Netflix onwards are because datasets have been pseudonymised badly and researchers have been able to run analysis across the visible fields and then figure out who the individuals are — this is simply not possible with our system as this data is never revealed.”

‘Fully GDPR compliant’ is certainly a big claim — and one that it going to have a lot of slings and arrows thrown at it as data gets ingested by InfoSum’s platform.

It’s also fair to say that a whole library of books could be written about technology’s unintended consequences.

Indeed, InfoSum’s own website credits Halstead as the inventor of the embedded retweet button, noting the technology is “something that is now ubiquitous on almost every website in the world”.

Those ubiquitous social plugins are also of course a core part of the infrastructure used to track Internet users wherever and almost everywhere they browse. So does he have any regrets about the invention, given how that bit of innovation has ended up being so devastating for digital privacy?

“When I invented it, the driving force for the retweet button was only really as a single number to count engagement. It was never to do with tracking. Our version of the retweet button never had any trackers in it,” he responds on that. “It was the number that drove our algorithms for delivering news in a very transparent way.

“I don’t need to add my voice to all the US pundits of the regrets of the beast that’s been unleashed. All of us feel that desire to unhook from some of these networks now because they aren’t being healthy for us in certain ways. And I certainly feel that what we’re not doing for improving the world of data is going to be good for everyone.”

When we first covered the UK-based startup it was going under the name CognitiveLogic — a placeholder name, as three weeks in Halstead says he was still figuring out exactly how to take his idea to market.

The founder of DataSift has not had difficulties raising funding for his new venture. There was an initial $3M from Upfront Ventures and IA Ventures, with the seed topped up by a further $5M last year, with new investors including Saul Klein (formerly Index Ventures) and Mike Chalfen of Mosaic Ventures. Halstead says he’ll be looking to raise “a very large Series A” over the summer.

In the meanwhile he says he has a “very long list” of hundreds customers wanting to get their hands on the platform to kick its tires. “The last three months has been a whirlwind of me going back to all of the major brands, all of the big data companies, there no large corporate that doesn’t have these kinds of challenges,” he adds.

“I saw a very big client this morning… they’re a large multinational, they’ve got three major brands where the three customer sets had never been joined together. So they don’t even know what the overlap of those brands are at the moment. So even giving them that insight would be massively valuable to them.”


InfoSum’s first product touts decentralized big data insights

 Nick Halstead’s new startup, InfoSum, is launching its first product today — moving one step closer to his founding vision of a data platform that can help businesses and organizations unlock insights from big data silos without compromising user privacy, data security or data protection law. So a pretty high bar then. Read More


Archiving MySQL Tables in ClickHouse

Archiving MySQL Tables in ClickHouse

Archiving MySQL Tables in ClickHouseIn this blog post, I will talk about archiving MySQL tables in ClickHouse for storage and analytics.

Why Archive?

Hard drives are cheap nowadays, but storing lots of data in MySQL is not practical and can cause all sorts of performance bottlenecks. To name just a few issues:

  1. The larger the table and index, the slower the performance of all operations (both writes and reads)
  2. Backup and restore for terabytes of data is more challenging, and if we need to have redundancy (replication slave, clustering, etc.) we will have to store all the data N times

The answer is archiving old data. Archiving does not necessarily mean that the data will be permanently removed. Instead, the archived data can be placed into long-term storage (i.e., AWS S3) or loaded into a special purpose database that is optimized for storage (with compression) and reporting. The data is then available.

Actually, there are multiple use cases:

  • Sometimes the data just needs to be stored (i.e., for regulatory purposes) but does not have to be readily available (it’s not “customer facing” data)
  • The data might be useful for debugging or investigation (i.e., application or access logs)
  • In some cases, the data needs to be available for the customer (i.e., historical reports or bank transactions for the last six years)

In all of those cases, we can move the older data away from MySQL and load it into a “big data” solution. Even if the data needs to be available, we can still move it from the main MySQL server to another system. In this blog post, I will look at archiving MySQL tables in ClickHouse for long-term storage and real-time queries.

How To Archive?

Let’s say we have a 650G table that stores the history of all transactions, and we want to start archiving it. How can we approach this?

First, we will need to split this table into “old” and “new”. I assume that the table is not partitioned (partitioned tables are much easier to deal with). For example, if we have data from 2008 (ten years worth) but only need to store data from the last two months in the main MySQL environment, then deleting the old data would be challenging. So instead of deleting 99% of the data from a huge table, we can create a new table and load the newer data into that. Then rename (swap) the tables. The process might look like this:

  1. CREATE TABLE transactions_new LIKE transactions
  2. INSERT INTO transactions_new SELECT * FROM transactions WHERE trx_date > now() – interval 2 month
  3. RENAME TABLE transactions TO transactions_old, transactions_new TO transactions

Second, we need to move the transactions_old into ClickHouse. This is straightforward — we can pipe data from MySQL to ClickHouse directly. To demonstrate I will use the Wikipedia:Statistics project (a real log of all requests to Wikipedia pages).

Create a table in ClickHouse:

    id bigint,
    dt DateTime,
    project String,
    subproject String,
    path String,
    hits UInt64,
    size UInt64
ENGINE = MergeTree
0 rows in set. Elapsed: 0.010 sec.

Please note that I’m using the new ClickHouse custom partitioning. It does not require that you create a separate date column to map the table in MySQL to the same table structure in ClickHouse

Now I can “pipe” data directly from MySQL to ClickHouse:

mysql --quick -h localhost wikistats -NBe
"SELECT concat(id,',"',dt,'","',project,'","',subproject,'","', path,'",',hits,',',size) FROM wikistats" |
clickhouse-client -d wikistats --query="INSERT INTO wikistats FORMAT CSV"

Thirdwe need to set up a constant archiving process so that the data is removed from MySQL and transferred to ClickHouse. To do that we can use the “pt-archiver” tool (part of Percona Toolkit). In this case, we can first archive to a file and then load that file to ClickHouse. Here is the example:

Remove data from MySQL and load to a file (tsv):

pt-archiver --source h=localhost,D=wikistats,t=wikistats,i=dt --where "dt <= '2018-01-01 0:00:00'"  --file load_to_clickhouse.txt --bulk-delete --limit 100000 --progress=100000
TIME                ELAPSED   COUNT
2018-01-25T18:19:59       0       0
2018-01-25T18:20:08       8  100000
2018-01-25T18:20:17      18  200000
2018-01-25T18:20:26      27  300000
2018-01-25T18:20:36      36  400000
2018-01-25T18:20:45      45  500000
2018-01-25T18:20:54      54  600000
2018-01-25T18:21:03      64  700000
2018-01-25T18:21:13      73  800000
2018-01-25T18:21:23      83  900000
2018-01-25T18:21:32      93 1000000
2018-01-25T18:21:42     102 1100000

Load the file to ClickHouse:

cat load_to_clickhouse.txt | clickhouse-client -d wikistats --query="INSERT INTO wikistats FORMAT TSV"

The newer version of pt-archiver can use a CSV format as well:

pt-archiver --source h=localhost,D=wikitest,t=wikistats,i=dt --where "dt <= '2018-01-01 0:00:00'"  --file load_to_clickhouse.csv --output-format csv --bulk-delete --limit 10000 --progress=10000

How Much Faster Is It?

Actually, it is much faster in ClickHouse. Even the queries that are based on index scans can be much slower in MySQL compared to ClickHouse.

For example, in MySQL just counting the number of rows for one year can take 34 seconds (index scan):

mysql> select count(*) from wikistats where dt between '2017-01-01 00:00:00' and '2017-12-31 00:00:00';
| count(*)  |
| 103161991 |
1 row in set (34.82 sec)
mysql> explain select count(*) from wikistats where dt between '2017-01-01 00:00:00' and '2017-12-31 00:00:00'G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: wikistats
   partitions: NULL
         type: range
possible_keys: dt
          key: dt
      key_len: 6
          ref: NULL
         rows: 227206802
     filtered: 100.00
        Extra: Using where; Using index
1 row in set, 1 warning (0.00 sec)

In ClickHouse, it only takes 0.062 sec:

:) select count(*) from wikistats where dt between  toDateTime('2017-01-01 00:00:00') and  toDateTime('2017-12-31 00:00:00');
SELECT count(*)
FROM wikistats
WHERE (dt >= toDateTime('2017-01-01 00:00:00')) AND (dt <= toDateTime('2017-12-31 00:00:00'))
? 103161991 ?
1 rows in set. Elapsed: 0.062 sec. Processed 103.16 million rows, 412.65 MB (1.67 billion rows/s., 6.68 GB/s.)

Size on Disk

In my previous blog on comparing ClickHouse to Apache Spark to MariaDB, I also compared disk size. Usually, we can expect a 10x to 5x decrease in disk size in ClickHouse due to compression. Wikipedia:Statistics, for example, contains actual URIs, which can be quite large due to the article name/search phrase. This can be compressed very well. If we use only integers or use MD5 / SHA1 hashes instead of storing actual URIs, we can expect much smaller compression (i.e., 3x). Even with a 3x compression ratio, it is still pretty good as long-term storage.


As the data in MySQL keeps growing, the performance for all the queries will keep decreasing. Typically, queries that originally took milliseconds can now take seconds (or more). That requires a lot of changes (code, MySQL, etc.) to make faster.

The main goal of archiving the data is to increase performance (“make MySQL fast again”), decrease costs and improve ease of maintenance (backup/restore, cloning the replication slave, etc.). Archiving to ClickHouse allows you to preserve old data and make it available for reports.


Unravel Data raises $15M Series B for its big data performance monitoring platform

 Big data systems tend to be large, complex and often hard to troubleshoot. In the world of databases, web and mobile stacks, application performance management services like AppDynamics and New Relic help ops teams keep tabs on their system. In the big data world, Unravel Data is one of the few APM players to focus solely on the complete big data stack from ingestion to analysis. Read More


Clairvoyant launches Kogni to help companies track their most sensitive data

 As we inch ever closer to GDPR in May, companies doing business in Europe need to start getting a grip on the sensitive private data they have. The trouble is that as companies move their data into data lakes, massive big data stores, it becomes more difficult to find data in a particular category. Clairvoyant, an Arizona company is releasing a tool called Kogni that could help.
Chandra… Read More


Collibra snags $58M Series D led by Iconiq and Battery Ventures to simplify data governance

 Collibra, a company that wants to help firms understand data governance, announced a $58 million Series D funding round today led by Iconiq Capital and Battery Ventures.
All of the investors involved in this round were coming back for another dip in the well. In addition to Iconiq and Battery Ventures, early Collibra investors Dawn Capital, Index Ventures and Newion Investments also participated. Read More


AWS announces two new EC2 instance types

 At the re:Invent customer conference in Las Vegas today, AWS announced two new instance types designed for specific kinds of applications. The first is a generalized EC2 instance designed for developers who are trying to get a feel for the kinds of resources their application might require. These new M5 EC2 instances offer a set of typical resource allocations with optimized compute, memory… Read More


Microsoft makes Databricks a first-party service on Azure

 Databricks has made a name for itself as one of the most popular commercial services around the Apache Spark data analytics platform (which, not coincidentally, was started by the founders of Databricks). Now it’s coming to Microsoft’s Azure platform in the form of a preview of the imaginatively named “Azure Databricks.” Read More


Percona Live Open Source Database Conference 2018 Call for Papers Is Now Open!

Percona Live

Percona LiveAnnouncing the opening of the Percona Live Open Source Database Conference 2018 in Santa Clara, CA, call for papers. It will be open from now until December  22, 2017.

Our theme is “Championing Open Source Databases,” with topics of MySQL, MongoDB and other open source databases, including PostgreSQL, time series databases and RocksDB. Sessions tracks include Developers, Operations and Business/Case Studies.

We’re looking forward to your submissions! We want proposals that cover the many aspects and current trends of using open source databases, including design practices, application development, performance optimization, HA and clustering, cloud, containers and new technologies, as well as new and interesting ways to monitor and manage database environments.

Describe the technical and business values of moving to or using open source databases. How did you convince your company to make the move? Was there tangible ROI? Share your case studies, best practices and technical knowledge with an engaged audience of open source peers.

Possible topics include:

  • Application development. How are you building applications using open source databases to power the data layers? What languages, frameworks and data models help you to build applications that your customers love? Are you using MySQL, MongoDB, PostgreSQL, time series or other databases?  
  • Database performance. What database issues have you encountered while meeting new application and new workload demands? How did they affect the user experience? How did you address them? Are you using WiredTiger or a new storage engine like RocksDB? Have you moved to an in-memory engine? Let us know about the solutions you have found to make sure your applications can get data to users and customers.
  • DBaaS and PaaS. Are you using a Database as a Service (DBaaS) in the public cloud, or have you rolled out your own? Are you on AWS, Google Cloud, Microsoft Azure or RackSpace/ObjectRocket? Are you using a database in a Platform as a Service (PaaS) environment? Tell us how it’s going.
  • High availability. Are your applications a crucial part of your business model? Do they need to be available at all times, no matter what? What database challenges have you come across that impacted uptime, and how did you create a high availability environment to address them?
  • Scalability. Has scaling your business affected database performance, user experience or the bottom line? How are you addressing the database environment workload as your business scales? Let us know what technologies you used to solve issues.
  • Distributed databases. Are you moving toward a distributed model? Why? What is your plan for replication and sharding?
  • Observability and monitoring. How do we design open source database deployment with observability in mind? Are you using Elasticsearch or some other analysis tool? What tools are you using to monitor data? Grafana? Prometheus? Percona Monitoring and Management? How do you visualize application performance trends for maximum impact?
  • Container solutions. Do you use Docker, Kubernetes or other containers in your database environment? What are the best practices for using open source databases with containers and orchestration? Has it worked out for you? Did you run into challenges and how did you solve them?
  • Security. What security and compliance challenges are you facing and how are you solving them?
  • Migrating to open source databases. Did you recently migrate applications from proprietary to open source databases? How did it work out? What challenges did you face, and what obstacles did you overcome? What were the rewards?
  • What the future holds. What do you see as the “next big thing”? What new and exciting features just released? What’s in your next release? What new technologies will affect the database landscape? AI? Machine learning? Blockchain databases? Let us know what you see coming.

The Percona Live Open Source Database Conference 2018 Call for Papers is open until December 22, 2017. We invite you to submit your speaking proposal for breakout, tutorial or lightning talk sessions. Share your open source database experiences with peers and professionals in the open source community by presenting a:

  • Breakout Session. Broadly cover a technology area using specific examples. Sessions should be either 25 minutes or 50 minutes in length (including Q&A).
  • Tutorial Session. Present a technical session that aims for a level between a training class and a conference breakout session. Encourage attendees to bring and use laptops for working on detailed and hands-on presentations. Tutorials will be three or six hours in length (including Q&A).
  • Lightning Talk. Give a five-minute presentation focusing on one key point that interests the open source community: technical, lighthearted or entertaining talks on new ideas, a successful project, a cautionary story, a quick tip or demonstration.

Speaking at Percona Live is a great way to build your personal and company brands. If selected, you will receive a complimentary full conference pass!

Submit your talks now.

Tips for Submitting to Percona Live

Include presentation details, but be concise. Clearly state:

  • Purpose of the talk (problem, solution, action format, etc.)
  • Covered technologies
  • Target audience
  • Audience takeaway

Keep proposals free of sales pitches. The Committee is looking for case studies and in-depth technical talks, not ones that sound like a commercial.

Be original! Make your presentation stand out by submitting a proposal that focuses on real-world scenarios, relevant examples, and knowledge transfer.

Submit your proposals as soon as you can – the call for papers is open until December 22, 2017.


ActionIQ nabs $30M led by A16Z to bring big data targeting to marketers

 The trend of using big data analytics to glean more targeted insights for your business continues to be democratized, with an increasing number of startups hitting the market to help those who are not data scientists nor engineers take advantage of these kinds of tools. In the latest development, a startup called ActionIQ — a marketing activation platform that gives marketers better… Read More

Powered by WordPress | Theme: Aeros 2.0 by