Apr
23
2018
--

Percona Live 2018 Featured Talk: Data Integrity at Scale with Alexis Guajardo

Alexis Google Percona Live 2018

Percona Live 2018 Featured TalkWelcome to another interview blog for the rapidly-approaching Percona Live 2018. Each post in this series highlights a Percona Live 2018 featured talk at the conference and gives a short preview of what attendees can expect to learn from the presenter.

This blog post highlights Alexis Guajardo, Senior Software Engineer at Google.com. His session talk is titled Data Integrity at Scale. Keeping data safe is the top responsibility of anyone running a database. In this session, he dives into Cloud SQL’s storage architecture to demonstrate how they check data down to the disk level:

Percona: Who are you, and how did you get into databases? What was your path to your current responsibilities?

Alexis: I am a Software Engineer on the Cloud SQL team with Google Cloud. I got into databases by using FileMaker. However, the world of database technology has changed many times over since then.

Percona: Your session is titled “Data Integrity at Scale”. Has the importance of data integrity increased over time? Why?

Alexis Google Percona Live 2018Alexis: Data integrity has always been vital to databases and data in general. The most common method is using checksum validation to ensure data integrity. The challenge that we faced at Cloud SQL on Google Cloud was how to do this for two very popular open source database solutions, and how to do it at scale. The store for MySQL was a bit more straightforward, because of innochecksum.  PostgreSQL required our team to create a utility, which is open sourced. The complicated aspect of data corruption is that sometimes it is dormant and discovered at a most inopportune time. What we have instituted are frequent checks for corruption of the entire data set, so if there is a software bug or other issue, we can mitigate it as soon as possible.

Percona: How does scaling affect the ability to maintain data integrity?

AlexisThere is a benefit to working on a team that provides a public cloud. Since Google Cloud is not bounded by most restrictions that an individual or company would be, we can allocate resources to do data integrity verifications without restriction. If I were to implement a similar system at a smaller company, most likely there would be cost and resource restrictions. However, data integrity is a feature that Google Cloud provides.

Percona: What are three things a DBA should know about ensuring data integrity?

Alexis: I think that the three things can be simplified down to three words: verify your backups.

Even if someone does not use Cloud SQL, it is vital to take backups, maintain them and verify them. Having terabytes of backups, but without verification, leaves open the possibility that a software bug or hardware issue somehow corrupted a backup.

Percona: Why should people attend your talk? What do you hope people will take away from it? 

Alexis: I would say the main reason to attend my talk is to discover more about Cloud SQL. As a DBA or developer, having a managed database as a service solution takes away a lot of the minutia. But there are still the tasks of improving queries and creating applications.  However, having reliable and verified backups is vital. With the addition of high availability and the ability to scale up easily, Cloud SQL’s managed database solution makes life much easier.

Percona: What are you looking forward to at Percona Live (besides your talk)?

Alexis: The many talks about Vitesse look very interesting. It is also an open source Google technology, and to see its adoption by many companies and how they have benefited from its use will be interesting.

Want to find out more about this Percona Live 2018 featured talk, and data integrity at scale? Register for Percona Live 2018, and see Alexis session talk Data Integrity at Scale. Register now to get the best price! Use the discount code SeeMeSpeakPL18 for 10% off.

Percona Live Open Source Database Conference 2018 is the premier open source event for the data performance ecosystem. It is the place to be for the open source community. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Open Source Database Conference will be April 23-25, 2018 at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.

The post Percona Live 2018 Featured Talk: Data Integrity at Scale with Alexis Guajardo appeared first on Percona Database Performance Blog.

Apr
21
2018
--

In the NYC enterprise startup scene, security is job one

While most people probably would not think of New York as a hotbed for enterprise startups of any kind, it is actually quite active. When you stop to consider that the world’s biggest banks and financial services companies are located there, it would certainly make sense for security startups to concentrate on such a huge potential market — and it turns out, that’s the case.

According to Crunchbase, there are dozens of security startups based in the city with everything from biometrics and messaging security to identity, security scoring and graph-based analysis tools. Some established companies like Symphony, which was originally launched in the city (although it is now on the west coast), has raised almost $300 million. It was actually formed by a consortium of the world’s biggest financial services companies back in 2014 to create a secure unified messaging platform.

There is a reason such a broad-based ecosystem is based in a single place. The companies who want to discuss these kinds of solutions aren’t based in Silicon Valley. This isn’t typically a case of startups selling to other startups. It’s startups who have been established in New York because that’s where their primary customers are most likely to be.

In this article, we are looking at a few promising early-stage security startups based in Manhattan

Hypr: Decentralizing identity

Hypr is looking at decentralizing identity with the goal of making it much more difficult to steal credentials. As company co-founder and CEO George Avetisov puts it, the idea is to get rid of that credentials honeypot sitting on the servers at most large organizations, and moving the identity processing to the device.

Hypr lets organizations remove stored credentials from the logon process. Photo: Hypr

“The goal of these companies in moving to decentralized authentication is to isolate account breaches to one person,” Avetisov explained. When you get rid of that centralized store, and move identity to the devices, you no longer have to worry about an Equifax scenario because the only thing hackers can get is the credentials on a single device — and that’s not typically worth the time and effort.

At its core, Hypr is an SDK. Developers can tap into the technology in their mobile app or website to force the authorization to the device. This could be using the fingerprint sensor on a phone or a security key like a Yubikey. Secondary authentication could include taking a picture. Over time, customers can delete the centralized storage as they shift to the Hypr method.

The company has raised $15 million and has 35 employees based in New York City.

Uplevel Security: Making connections with graph data

Uplevel’s founder Liz Maida began her career at Akamai where she learned about the value of large data sets and correlating that data to events to help customers understand what was going on behind the scenes. She took those lessons with her when she launched Uplevel Security in 2014. She had a vision of using a graph database to help analysts with differing skill sets understand the underlying connections between events.

“Let’s build a system that allows for correlation between machine intelligence and human intelligence,” she said. If the analyst agrees or disagrees, that information gets fed back into the graph, and the system learns over time the security events that most concern a given organization.

“What is exciting about [our approach] is you get a new alert and build a mini graph, then merge that into the historical data, and based on the network topology, you can start to decide if it’s malicious or not,” she said.

Photo: Uplevel

The company hopes that by providing a graphical view of the security data, it can help all levels of security analysts figure out the nature of the problem, select a proper course of action, and further build the understanding and connections for future similar events.

Maida said they took their time creating all aspects of the product, making the front end attractive, the underlying graph database and machine learning algorithms as useful as possible and allowing companies to get up and running quickly. Making it “self serve” was a priority, partly because they wanted customers digging in quickly and partly with only 10 people, they didn’t have the staff to do a lot of hand holding.

Security Scorecard: Offering a way to measure security

The founders of Security Scorecard met while working at the NYC ecommerce site, Gilt. For a time ecommerce and adtech ruled the startup scene in New York, but in recent times enterprise startups have really started to come on. Part of the reason for that is many people started at these foundational startups and when they started their own companies, they were looking to solve the kinds of enterprise problems they had encountered along the way. In the case of Security Scorecard, it was how could a CISO reasonably measure how secure a company they were buying services from was.

Photo: Security Scorecard

“Companies were doing business with third-party partners. If one of those companies gets hacked, you lose. How do you vett the security of companies you do business with” company co-founder and CEO Aleksandr Yampolskiy asked when they were forming the company.

They created a scoring system based on publicly available information, which wouldn’t require the companies being evaluated to participate. Armed with this data, they could apply a letter grade from A-F. As a former CISO at Gilt, it was certainly a paint point he felt personally. They knew some companies did undertake serious vetting, but it was usually via a questionnaire.

Security Scorecard was offering a way to capture security signals in an automated way and see at a glance just how well their vendors were doing. It doesn’t stop with the simple letter grade though, allowing you to dig into the company’s strengths and weaknesses and see how they compare to other companies in their peer groups and how they have performed over time.

It also gives customers the ability to see how they compare to peers in their own industry and use the number to brag about their security position or conversely, they could use it to ask for more budget to improve it.

The company launched in 2013 and has raised over $62 million, according to Crunchbase. Today, they have 130 employees and 400 enterprise customers.

If you’re an enterprise security startup, you need to be where the biggest companies in the world do business. That’s in New York City, and that’s precisely why these three companies, and dozens of others have chosen to call it home.

Apr
18
2018
--

Stripe debuts Radar anti-fraud AI tools for big businesses, says it has halted $4B in fraud to date

Cybersecurity continues to be a growing focus and problem in the digital world, and now Stripe is launching a new paid product that it hopes will help its customers better battle one of the bigger side-effects of data breaches: online payment fraud. Today, Stripe is announcing Radar for Fraud Teams, an expansion of its free AI-based Radar service that runs alongside Stripe’s core payments API to help identify and block fraudulent transactions.

And there are further efforts that Stripe is planning in coming months. Michael Manapat, Stripe’s engineering manager for Radar and machine learning, said the company is going to soon launch a private beta of a “dynamic authentication” that will bring in two-factor authentication. This is on top of Stripe’s first forays into using biometric factors in payments, made via partners like Apple and Google. With these and others, fingerprints and other physical attributes have become increasingly popular ways to identify mobile and other users.

The initial iteration of Radar launched in October 2016, and since then, Manapat tells me that it has prevented $4 billion in fraud for its “hundreds of thousands” of customers.

Considering the wider scope of how much e-commerce is affected by fraud — one study estimates $57.8 billion in e-commerce fraud across eight major verticals in a one-year period between 2016 and 2017 — this is a decent dent, but there is a lot more work to be done. And Stripe’s position of knowing four out of every five payment card numbers globally (on account of the ubiquity of its payments API) gives it a strong position to be able to tackle it.

The new paid product comes alongside an update to the core, free product that Stripe is dubbing Radar 2.0, which Stripe claims will have more advanced machine learning built into it and can therefore up its fraud detection by some 25 percent over the previous version.

New features for the whole product (free and paid) will include being able to detect when a proxy VPN is being used (which fraudsters might use to appear like they are in one country when they are actually in another) and ingesting billions of data points to train its model, which is now being updated on a daily basis automatically — itself an improvement on the slower and more manual system that Manapat said Stripe has been using for the past couple of years.

Meanwhile, the paid product is an interesting development.

At the time of the original launch, Stripe co-founder John Collison hinted that the company would be considering a paid product down the line. Stripe has said multiple times that it’s in no rush to go public — and statement that a spokesperson reiterated this week — but it’s notable that a paid tier is a sign of how Stripe is slowly building up more monetization and revenue generation.

Stripe is valued at around $9.2 billion as of its last big round in 2016. Most recently, it raised $150 million back in that November 2016 round. A $44 million from March of this year, noted in Pitchbook, was actually related to issuing stock related to its quiet acquisition of point-of-sale payments startup Index in that month — incidentally another interesting move for Stripe to expand its position and placement in the payments ecosystem. Stripe has raised around $450 million in total.

The Teams product, aimed at businesses that are big enough to have dedicated fraud detection staff, will be priced at an additional $0.02 per transaction, on top of Stripe’s basic transaction fees of a 2.9 percent commission plus 30 cents per successful card charge in the U.S. (fees vary in other markets).

The chief advantage of taking the paid product will be that teams will be able to customise how Radar works with their own transactions.

This will include a more complete set of data for teams that review transactions, and a more granular set of tools to determine where and when sales are reviewed, for example based on usage patterns or the size of the transaction. There are already a set of flags the work to note when a card is used in frequent succession across disparate geographies; but Manapat said that newer details such as analysing the speed at which payment details are entered and purchases are made will now also factor into how it flags transactions for review.

Similarly, teams will be able to determine the value at which a transaction needs to be flagged. This is the online equivalent of when certain purchases require or waive you to enter a PIN or provide a signature to seal the deal. (And it’s interesting to see that some e-commerce operations are potentially allowing some dodgy sales to happen simply to keep up the user experience for the majority of legitimate transactions.)

Users of the paid product will also be able to now use Radar to help with their overall management of how it handles fraud. This will include being able to keep lists of attributes, names and numbers that are scrutinised, and to check against them with analytics also created by Stripe to help identify trending issues, and to plan anti-fraud activities going forward.

Updated with further detail about Stripe’s funding.

Apr
05
2018
--

Google Cloud gives developers more insights into their networks

Google Cloud is launching a new feature today that will give its users a new way to monitor and optimize how their data flows between their servers in the Google Cloud and other Google Services, on-premises deployments and virtually any other internet endpoint. As the name implies, VPC Flow Logs are meant for businesses that already use Google’s Virtual Private Cloud features to isolate their resources from other users.

VPC Flow Logs monitors and logs all the network flows (both UDP and TCP) that are sent from and received by the virtual machines inside a VPC, including traffic between Google Cloud regions. All of that data can be exported to Stackdriver Logging or BigQuery, if you want to keep it in the Google Cloud, or you can use Cloud Pub/Sub to export it to other real-time analytics or security platforms. The data updates every five seconds and Google promises that using this service has no impact on the performance of your deployed applications.

As the company notes in today’s announcement, this will allow network operators to get far more insight into the details of how the Google network performs and to troubleshoot issues if they arise. In addition, it will allow them to optimize their network usage and costs by giving them more information about their global traffic.

All of this data is also quite useful for performing forensics when it looks like somebody may have gotten into your network, too. If that’s your main use case, though, you probably want to export your data to a specialized security information and event management (SIEM) platform from vendors like Splunk or ArcSight.

Apr
02
2018
--

MongoDB Data at Rest Encryption Using eCryptFS

In this post, we’ll look at MongoDB data at rest encryption using eCryptFS, and how to deploy a MongoDB server using encrypted data files.

When dealing with data, a good security policy should enforce the use of “no trivial” passwords, the use of encrypted connections and hopefully encrypted files on the disks.

Only the MongoDB Enterprise edition has an “engine encryption” feature. The Community edition and Percona Server for MongoDB don’t (yet). This is why I’m going to introduce a useful way to achieve data encryption at rest for MongoDB, using a simple but effective tool: eCryptFS.

eCryptFS is an enterprise-class stacked cryptographic filesystem for Linux. You can use it to encrypt partitions or even any folder that doesn’t use a partition of its own, no matter the underlying filesystem or partition type. For more information about this too, visit the official website: http://ecryptfs.org/.

I’m using Ubuntu 16.04 and I have Percona Server for MongoDB already installed on the system. The data directory (dbpath) is in /var/lib/mongodb.

Preparation of the encrypted directory

First, let’s stop mongod if it’s running:

sudo service mongod stop

Install eCryptFS:

sudo apt-get install ecryptfs-utils

Create two new directories:

sudo mkdir /datastore
sudo mkdir /var/lib/mongodb-encrypted

We’ll use the /datastore directory as the folder where we copy all the mongo’s files, and have them automatically encrypted. It’s also useful to test later that everything is working correctly. The folder /var/lib/mongodb_encrypted is the mount point we’ll use as the new data directory for mongod.

Mount the encrypted directory

Now it’s time to use eCryptFS to mount the /datastore folder and define it as encrypted. Launch the command as follows, choose a passphrase and respond to all the questions with the default proposed value. In a real case, choose the answers that best fit for you, and a complex passphrase:

root@psmdb1:~# sudo mount -t ecryptfs /datastore /var/lib/mongo-encrypted
Passphrase:
Select cipher:
1) aes: blocksize = 16; min keysize = 16; max keysize = 32
2) blowfish: blocksize = 8; min keysize = 16; max keysize = 56
3) des3_ede: blocksize = 8; min keysize = 24; max keysize = 24
4) twofish: blocksize = 16; min keysize = 16; max keysize = 32
5) cast6: blocksize = 16; min keysize = 16; max keysize = 32
6) cast5: blocksize = 8; min keysize = 5; max keysize = 16
Selection [aes]:
Select key bytes:
1) 16
2) 32
3) 24
Selection [16]:
Enable plaintext passthrough (y/n) [n]:
Enable filename encryption (y/n) [n]:
Attempting to mount with the following options:
 ecryptfs_unlink_sigs
 ecryptfs_key_bytes=16
 ecryptfs_cipher=aes
 ecryptfs_sig=f946e4b85fd84010
Mounted eCryptfs

If you see Mounted eCryptfs as the last line, everything went well. Now you have the folder /datastore encrypted. Any file you create or copy into this folder is automatically encrypted by eCryptFS. Also, you have mounted the encrypted folder into the path /var/lib/mongo-encrypted.

For the sake of security, you can verify with the mount command that the directory is correctly mounted. You should see something similar to the following:

root@psmdb1:~# sudo mount | grep crypt
/datastore on /var/lib/mongo-encrypted type ecryptfs (rw,relatime,ecryptfs_sig=f946e4b85fd84010,ecryptfs_cipher=aes,ecryptfs_key_bytes=16,ecryptfs_unlink_sigs)

Copy mongo files

sudo cp -r /var/lib/mongodb/* /var/lib/mongo-encrypted

We copy all the files from the existent mongo’s data directory into the new path.

Since we are working as root (or we used sudo -s at the beginning), we need to change the ownership of the files to the mongod user, the default user for the database server. Otherwise, mongod won’t start:

sudo chown -R mongod:mongod /var/lib/mongo-encrypted/

Modify mongo configuration

Before starting mongod, we have to change the configuration into /etc/mongod.conf to instruct the server to use the new folder. So, change the line with dbpath as follow and save the file:

dbpath=/var/lib/mongo-encrypted

Launch mongod and verify

So, it’s time to start mongod, connect with the mongo shell and verify that it’s working as usual:

root@psmdb1:~# sudo service mongod start

The server works correctly and is unaware of the encrypted files because eCryptFS itself takes care of encryption and decryption activities at a lower level. There’s a little price to pay in terms of performance, as in every system that uses encryption, but we won’t worry about that since our first goal is security. In any case, eCryptFS has some small footprint.

Now, let’s verify the files directly.

Since the encrypted folder is mounted and automatically managed by eCryptFS, we can see the content of the files. Let’s have a look:

root@psmdb1:~# cat /var/lib/mongo-encrypted/mongod.lock
6965

But if we look at the same file into /datastore, we see weird characters:

root@psmdb1:~# cat /datastore/mongod.lock
?0???k?"3DUfw`?Pp?Ku?????b?_CONSOLE?F?_?@??[?'?b??^??fZ?7

As expected.

Make encrypted dbpath persistent

Finally, if you want to automatically mount the encrypted directory at startup, add the following line into /etc/fstab:

/datastore /var/lib/mongo-encrypted ecryptfs defaults 0 0

Create the file .ecryptfsrc into /root directory with the following lines:

key=passphrase:passphrase_passwd_file=/root/passphrase.txt
ecryptfs_sig=f946e4b85fd84010
ecryptfs_cipher=aes
ecryptfs_key_bytes=16
ecryptfs_passthrough=n
ecryptfs_enable_filename_crypto=n

You can find the value of the variable ecryptfs_sig in the file /root/.ecryptfs/sig-cache.txt.

Create the file /root/passphrase.txt containing your secret passphrase. The format is as follows:

passphrase_passwd=mypassphrase

Now you can reboot the system and have the encrypted directory mounted at startup.

Tip: it is not a good idea to have a plain text file on your server with our passphrase. To have a better security level, you can place this file into a USB key (for example) that you can mount at startup, or you can use some sort of wallet tool to protect your passphrase.

Conclusion

Security is more and more a “must have” that customers are requesting of anyone managing their data. This how-to guide shows that achieving MongoDB data at rest encryption success is not so complicated.

The post MongoDB Data at Rest Encryption Using eCryptFS appeared first on Percona Database Performance Blog.

Mar
16
2018
--

Cloud security startup Zscaler opens at $27.50, a pop of 72% on Nasdaq, raising $192M in its IPO

The first post-billion, big tech IPO of the year has opened with a bang. Zscaler, a security startup that confidentially filed for an IPO last year, started trading this morning as ZS on Nasdaq at a price of $27.50/share. This was a pop of 71.9  percent on its opening price of $16, and speaks to a bullish moment for security startups and potentially public listings for tech companies in general.

That could bode well for Dropbox, Spotify and others that are planning or considering public listings in the coming weeks and months.

As of 3:45PM Eastern time, the stock has gone significantly higher and has just reached a peak of $30.61 as it approaches the end of its first day of trading. We’ll continue to monitor the price as the day continues to see how the stock does, and also hear from the company itself.

Initially, Zscaler had expected to sell 10 million shares at a range between $10 and $12 per share, but interest led the company to expand that to 12 million shares at a $13-15 range, which then moved up to $16 and Zscaler last night raising $192 million giving it a valuation of over $1.9 billion — a sign of strong interest in the investor community that it’s now hoping will follow through in its debut and beyond.

Zscaler is a specialist in an area called software-defined perimeter (SDP) services, which allow enterprises and other organizations to better control how they allow employees to access apps and specific services within their IT networks: the idea is that rather than giving access to the full network, employees are authenticated just for the apps that they specifically need for their work.

SDP architectures have become increasingly popular in recent years as a way of better mitigating security threats in networks where employees are using a variety of devices, including their own private mobile phones, to access data and apps in corporate networks and in the cloud — both of which have become routes for malicious hackers to breach systems.

SDP services are being adopted by the likes of Google, and are being built by a number of other tech companies, both those that are looking to provide more value-added services around existing cloud or other IT offerings, and those that are already playing in the area of security, including Cisco, Check Point Software, EMC, Fortinet, Intel, Juniper Networks, Palo Alto Networks, Symantec (which has been involved in IP lawsuits with Zscaler) and more — which speaks both of the opportunity and challenge in the market for Zscaler. Estimates of the value of the market range from $7.8 billion to $11 billion by 2023.

Mar
14
2018
--

TypingDNA launches Chrome extension that verifies your identity based on typing

TypingDNA has a new approach to verifying your identity based on how you type.

The startup, which is part of the current class at Techstars NYC, is pitching this as an alternative to two-factor authentication — namely, the security feature that sends unique codes to a separate device (usually your phone) to make sure someone else isn’t logging in with your password.

The problem with two factor? TypingDNA Raul Popa put it simply: “It’s a bad user experience … Nobody wants to use a different device.” (I know that TechCrunch writers have had two-factor issues of their own, like when they’re trying to log in on an airplane and can’t connect their phone.)

So TypingDNA allows users to verify their identity without having to whip out their phone. Instead, they just enter their name and password into a window, then TypingDNA will analyze their typing and confirm that it’s really them.

TypingDNA Authenticator - Animation

The startup’s business model revolves around working with partners to incorporate the technology, but it’s also launching a free Chrome extension that works as an alternative to two-factor authentication on a wide range of services, including Amazon Web Services, Coinbase and Gmail.

Popa said TypingDNA measures two key aspects of your typing: How long it takes you to reach a key and how long you keep the key pressed down. Apparently these patterns are unique; Popa showed me that the system could tell the difference between his typing and mine, and you can test it out for yourself on the TypingDNA website.

He also said that the company can adjust the strictness of the system, getting the rate of false positives as low as 0.1 percent. In the case of the Chrome authenticator, Popa said, “We minimize the false acceptance rate” — so you might get rejected if you’re typing in an unusual position, or if there’s some other reason you’re typing slower or faster than usual. But in that case, the authenticator will just ask you to try again.

And again, you can use the Chrome extension on a variety of sites. Most two-factor options allow to confirm confirm a device using a QR code, which TypingDNA can grab. The two-factor codes are then sent to the TypingDNA extension (the codes are stored locally on your computer, not the company’s servers), and they’re revealed once you’ve verified your identity with the aforementioned typing.

You can visit TypingDNA to learn more and download the extension.

Mar
13
2018
--

Don’t Get Hit with a Database Disaster: Database Security Compliance

Percona Live 2018 security talks

In this post, we discuss database security compliance, what you should be looking at and where to get more information.

As Percona’s Chief Customer Officer, I get the opportunity to talk with a lot of customers. Hearing about the problems that both their technical teams face, as well as the business challenges their companies experience first-hand is incredibly valuable in terms of what the market is facing in general. Not every problem you see has a purely technical solution, and not every good technical solution solves the core business problem.

Matt Yonkovit, Percona CCOAs database technology advances and data continues to be the core blood of most modern applications, DBA’s will have a say in business level strategic planning more than ever. This coincides with the advances in technology and automation that make many classic manual “DBA” jobs and tasks obsolete. Traditional DBA’s are evolving into a blend of system architect, data strategist and master database architect. I want to talk about the business problems that not only the C-Suite care about, but DBAs as a whole need to care about in the near future.

Let’s start with one topic everyone should have near the top of their list: security.

We did a recent survey of our customers, and their biggest concern right now is security and compliance.

Not long ago, most DBA’s I knew dismissed this topic as “someone else’s problem” (I remember being told that the database is only as secure as the network, so fix the network!). Long gone are the days when network security was enough. Even the DBA’s who did worry about security only did so within the limited scope of what the database system could provide out of the box.  Again, not enough.

So let me run an experiment:

Raise your hand if your company has some bigger security initiative this year. 

I’m betting a lot of you raised your hand!

Security is not new to the enterprise. It’s been a priority for years now. However, it has not been receiving a hyper-focus in the open source database space until the last three years or so. Why? There have been a number of high profile database security breaches in the last year, all highlighting a need for better database security. This series of serious data breaches have exposed how fragile some security protocols in companies are. If that was not enough, new government regulations and laws have made data protection non-optional. This means you have to take the security of your database seriously, or there could be fines and penalties.

Percona Live 2018 security talksGovernment regulations are nothing new, but the breadth and depth of these are growing and are opening up a whole new challenge for databases systems and administrators. GDPR was signed into law two years ago (you can read more here: https://en.wikipedia.org/wiki/General_Data_Protection_Regulation and https://www.dataiq.co.uk/blog/summary-eu-general-data-protection-regulation) and is scheduled to take effect on May 25, 2018. This has many businesses scrambling not only to understand the impact, but figure out how they need to comply. These regulations redefine simple things, like what constitutes “personal data” (for instance, your anonymous buying preferences or location history even without your name).

New requirements also mean some areas get a bit more complicated as they approach the gray area of definition. For instance, GDPR guarantees the right to be forgotten. What does this mean? In theory, it means end-users can request that all their personal information is removed from your systems as if they did not exist. Seems simple, but in reality, you can go as far down the rabbit hole as you want. Does your application support this already? What about legacy applications? Even if the apps can handle it, does this mean previously taken database backups have to forget you as well? There is a lot to process for sure.

So what are the things you can do?

  1. Educate yourself and understand expectations, even if you weren’t involved in compliance discussions before.
  2. Start working on incremental improvements now on your data security. This is especially true in the area’s where you have some control, without massive changes to the application. Encryption at rest is a great place to start if you don’t have it.
  3. Start talking with others in the organization about how to identify and protect personal information.
  4. Look to increase security by default by getting involved in new applications early in the design phase.

The good news is you are not alone in tackling this challenge. Every company must address it. Because of this focus on security, we felt strongly about ensuring we had a security track at Percona Live 2018 this year. These talks from Fastly, Facebook, Percona, and others provide information on how companies around the globe are tackling these security issues. In true open source fashion, we are better when we learn and grow from one another.

What are the Percona Live 2018 security talks?

We have a ton of great security content this year at Percona Live, across a bunch of technologies and open source software. Some of the more interesting Percona Live 2018 security talks are:

Want to attend Percona Live 2018 security talks? Register for Percona Live 2018. Register now to get the best price! Use the discount code SeeMeSpeakPL18 for 10% off.

Percona Live Open Source Database Conference 2018 is the premier open source event for the data performance ecosystem. It is the place to be for the open source community. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Open Source Database Conference will be April 23-25, 2018 at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.

Mar
09
2018
--

This Week in Data with Colin Charles 31: Meltdown/Spectre Performance Regressions and Percona Live 2018

Colin Charles

Join Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

Have you been following the Meltdown/Spectre performance regressions? Some of the best blog posts have been coming from Brendan Gregg, who’s keynoting at Percona Live this year. We’ve also got Scott Simpson from Upwork giving a keynote about how and why they use MongoDB. This is in addition to all the other fun talks we have, so please register now. Don’t forget to also book your hotel room!

Even though the Percona Live conference now covers much more than just MySQL, it’s worth noting that the MySQL Community Awards 2018: Call for Nominations! is happening now. You have until Friday, March 15, 2018, to make a nomination. Winners get into the Hall of Fame. Yes, I am also on the committee to make selections.

Another highlight: Open-sourcing a 10x reduction in Apache Cassandra tail latency by Dikang Gu of Instagram (Facebook). This is again thanks to RocksDB. Check out Rocksandra, and don’t forget to register for Percona Live to see the talk: Cassandra on RocksDB.

This week, I spent some time at Percona Headquarters in Raleigh, North Carolina. The building from the outside is pictured well in Google Maps. I thought it might be fun to show you a few photos (the office is huge with quite a handful working there despite the fact that Percona is largely remote).

Peter at Percona Headquarters
Percona awards and bookshelf, featuring some very antique encyclopedias.

 

Peter at Percona Headquarters 2
Peter Zaitsev, Percona CEO, outside his office (no, it is not an open office plan – everyone has rooms, including visitors like myself).

 

We’re all at SCALE16x now – so come see our talks (Peter Zaitsev and I are both speaking), and we have a booth where you can say hello to Rick Golba, Marc Sherwood and Dave Avery.

Releases

Link List

Upcoming appearances

  • SCALE16x – Pasadena, California, USA – March 8-11 2018
  • FOSSASIA 2018 – Singapore – March 22-25 2018

Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

Mar
09
2018
--

InfoSum’s first product touts decentralized big data insights

Nick Halstead’s new startup, InfoSum, is launching its first product today — moving one step closer to his founding vision of a data platform that can help businesses and organizations unlock insights from big data silos without compromising user privacy, data security or data protection law. So a pretty high bar then.

If the underlying tech lives up to the promises being made for it, the timing for this business looks very good indeed, with the European Union’s new General Data Protection Regulation (GDPR) mere months away from applying across the region — ushering in a new regime of eye-wateringly large penalties to incentivize data handling best practice.

InfoSum bills its approach to collaboration around personal data as fully GDPR compliant — because it says it doesn’t rely on sharing the actual raw data with any third parties.

Rather a mathematical model is used to make a statistical comparison, and the platform delivers aggregated — but still, says Halstead — useful insights. Though he says the regulatory angle is fortuitous, rather than the full inspiration for the product.

“Two years ago, I saw that the world definitely needed a different way to think about working on knowledge about people,” he tells TechCrunch. “Both for privacy [reasons] — there isn’t a week where we don’t see some kind of data breach… they happen all the time — but also privacy isn’t enough by itself. There has to be a commercial reason to change things.”

The commercial imperative he reckons he’s spied is around how “unmanageable” big data can become when it’s pooled for collaborative purposes.

Datasets invariably need a lot of cleaning up to make different databases align and overlap. And the process of cleaning and structuring data so it can be usefully compared can run to multiple weeks. Yet that effort has to be put in before you really know if it will be worth your while doing so.

That snag of time + effort is a major barrier preventing even large companies from doing more interesting things with their data holdings, argues Halstead.

So InfoSum’s first product — called Link — is intended to give businesses a glimpse of the “art of the possible”, as he puts it — in just a couple of hours, rather than the “nine, ten weeks” he says it might otherwise take them.

“I set myself a challenge… could I get through the barriers that companies have around privacy, security, and the commercial risks when they handle consumer data. And, more importantly, when they need to work with third parties or need to work across their corporation where they’ve got numbers of consumer data and they want to be able to look at that data and look at the combined knowledge across those.

“That’s really where I came up with this idea of non-movement of data. And that’s the core principle of what’s behind InfoSum… I can connect knowledge across two data sets, as if they’ve been pooled.”

Halstead says that the problem with the traditional data pooling route — so copying and sharing raw data with all sorts of partners (or even internally, thereby expanding the risk vector surface area) — is that it’s risky. The myriad data breaches that regularly make headlines nowadays are a testament to that.

But that’s not the only commercial consideration in play, as he points out that raw data which has been shared is immediately less valuable — because it can’t be sold again.

“If I give you a data set in its raw form, I can’t sell that to you again — you can take it away, you can slice it and dice it as many ways as you want. You won’t need to come back to me for another three or four years for that same data,” he argues. “From a commercial point of view [what we’re doing] makes the data more valuable. In that data is never actually having to be handed over to the other party.”

Not blockchain for privacy

Decentralization, as a technology approach, is also of course having a major moment right now — thanks to blockchain hype. But InfoSum is definitely not blockchain. Which is a good thing. No sensible person should be trying to put personal data on a blockchain.

“The reality is that all the companies that say they’re doing blockchain for privacy aren’t using blockchain for the privacy part, they’re just using it for a trust model, or recording the transactions that occur,” says Halstead, discussing why blockchain is terrible for privacy.

“Because you can’t use the blockchain and say it’s GDPR compliant or privacy safe. Because the whole transparency part of it and the fact that it’s immutable. You can’t have an immutable database where you can’t then delete users from it. It just doesn’t work.”

Instead he describes InfoSum’s technology as “blockchain-esque” — because “everyone stays holding their data”. “The trust is then that because everyone holds their data, no one needs to give their data to everyone else. But you can still crucially, through our technology, combine the knowledge across those different data sets.”

So what exactly is InfoSum doing to the raw personal data to make it “privacy safe”? Halstead claims it goes “beyond hashing” or encrypting it. “Our solution goes beyond that — there is no way to re-identify any of our data because it’s not ever represented in that way,” he says, further claiming: “It is absolutely 100 per cent data isolation, and we are the only company doing this in this way.

“There are solutions out there where traditional models are pooling it but with encryption on top of it. But again if the encryption gets broken the data is still ending up being in a single silo.”

InfoSum’s approach is based on mathematically modeling users, using a “one way model”, and using that to make statistical comparisons and serve up aggregated insights.

“You can’t read things out of it, you can only test things against it,” he says of how it’s transforming the data. “So it’s only useful if you actually knew who those users were beforehand — which obviously you’re not going to. And you wouldn’t be able to do that unless you had access to our underlying code-base. Everyone else either users encryption or hashing or a combination of both of those.”

This one-way modeling technique is in the process of being patented — so Halstead says he can’t discuss the “fine details” — but he does mention a long standing technique for optimizing database communications, called bloom filters, saying those sorts of “principles” underpin InfoSum’s approach.

Although he also says it’s using those kind of techniques differently. Here’s how InfoSum’s website describes this process (which it calls Quantum):

InfoSum Quantum irreversibly anonymises data and creates a mathematical model that enables isolated datasets to be statistically compared. Identities are matched at an individual level and results are collated at an aggregate level – without bringing the datasets together.

On the surface, the approach shares a similar structure to Facebook’s Custom Audiences Product, where advertisers’ customer lists are locally hashed and then uploaded to Facebook for matching against its own list of hashed customer IDs — with any matches used to create a custom audience for ad targeting purposes.

Though Halstead argues InfoSum’s platform offers more for even this kind of audience building marketing scenario, because its users can use “much more valuable knowledge” to model on — knowledge they would not comfortably share with Facebook “because of the commercial risks of handing over that first person valuable data”.

“For instance if you had an attribute that defined which were your most valuable customers, you would be very unlikely to share that valuable knowledge — yet if you could safely then it would be one of the most potent indicators to model upon,” he suggests.

He also argues that InfoSum users will be able to achieve greater marketing insights via collaborations with other users of the platform vs being a customer of Facebook Custom Audiences — because Facebook simply “does not open up its knowledge”.

“You send them your customer lists, but they don’t then let you have the data they have,” he adds. “InfoSum for many DMPs [data management platforms] will allow them to collaborate with customers so the whole purchasing of marketing can be much more transparent.”

He also emphasizes that marketing is just one of the use-cases InfoSum’s platform can address.

Decentralized bunkers of data

One important clarification: InfoSum customers’ data does get moved — but it’s moved into a “private isolated bunker” of their choosing, rather than being uploaded to a third party.

“The easiest one to use is where we basically create you a 100 per cent isolated instance in Amazon [Web Services],” says Halstead. “We’ve worked with Amazon on this so that we’ve used a whole number of techniques so that once we create this for you, you put your data into it — we don’t have access to it. And when you connect it to the other part we use this data modeling so that no data then moves between them.”

“The ‘bunker’ is… an isolated instance,” he adds, elaborating on how communications with these bunkers are secured. “It has its own firewall, a private VPN, and of course uses standard SSL security. And once you have finished normalising the data it is turned into a form in which all PII [personally identifiable information] is deleted.

“And of course like any other security related company we have had independent security companies penetration test our solution and look at our architecture design.”

Other key pieces of InfoSum’s technology are around data integration and identity mapping — aimed at tackling the (inevitable) problem of data in different databases/datasets being stored in different formats. Which again is one of the commercial reasons why big data silos often stay just that: Silos.

Halstead gave TechCrunch a demo showing how the platform ingests and connects data, with users able to use “simple steps” to teach the system what is meant by data types stored in different formats — such as that ‘f’ means the same as ‘female’ for gender category purposes — to smooth the data mapping and “try to get it as clean as possible”.

Once that step has been completed, the user (or collaborating users) are able to get a view on how well linked their data sets are — and thus to glimpse “the start of the art of the possible”.

In practice this means they can choose to run different reports atop their linked datasets — such as if they want to enrich their data holdings by linking their own users across different products to gain new insights, such as for internal research purposes.

Or, where there’s two InfoSum users linking different data sets, they could use it for propensity modeling or lookalike modeling of customers, says Halstead. So, for example, a company could link models of their users with models of the users of a third party that holds richer data on its users to identify potential new customer types to target marketing at.

“Because I’ve asked to look at the overlap I can literally say I only know the gender of these people but I would also like to know what their income is,” he says, fleshing out another possible usage scenario. “You can’t drill into this, you can’t do really deep analytics — that’s what we’ll be launching later. But Link allows you to get this idea of what would it look like if I combine our datasets.

“The key here is it’s opening up a whole load of industries where sensitivity around doing this — and where, even in industries that share a lot of data already but where GDPR is going to be a massive barrier to it in the future.”

Halstead says he expects big demand from the marketing industry which is of course having to scramble to rework its processes to ensure they don’t fall foul of GDPR.

“Within marketing there is going to be a whole load of new challenges for companies where they were currently enhancing their databases, buying up large raw datasets and bringing their data into their own CRM. That world’s gone once we’ve got GDPR.

“Our model is safer, faster, and actually still really lets people do all the things they did before but while protecting the customers.”

But it’s not just marketing exciting him. Halstead believes InfoSum’s approach to lifting insights from personal data could be very widely applicable — arguing, for example, that it’s only a minority of use-cases, such as credit risk and fraud within banking, where companies actually need to look at data at an individual level.

One area he says he’s “very passionate” about InfoSum’s potential is in the healthcare space.

“We believe that this model isn’t just about helping marketing and helping a whole load of others — healthcare especially for us I think is going to be huge. Because [this affords] the ability to do research against health data where health data is never been actually shared,” he says.

“In the UK especially we’ve had a number of massive false starts where companies have, for very good reasons, wanted to be able to look at health records and combine data — which can turn into vital research to help people. But actually their way of doing it has been about giving out large datasets. And that’s just not acceptable.”

He even suggests the platform could be used for training AIs within the isolated bunkers — flagging a developer interface that will be launching after Link which will let users query the data as a traditional SQL query.

Though he says he sees most initial healthcare-related demand coming from analytics that need “one or two additional attributes” — such as, for example, comparing health records of people with diabetes with activity tracker data to look at outcomes for different activity levels.

“You don’t need to drill down into individuals to know that the research capabilities could give you incredible results to understand behavior,” he adds. “When you do medical research you need bodies of data to be able to prove things so the fact that we can only work at an aggregate level is not, I don’t think, any barrier to being able to do the kind of health research required.”

Another area he believes could really benefit is M&A — saying InfoSum’s platform could offer companies a way to understand how their user bases overlap before they sign on the line. (It is also of course handling and thus simplifying the legal side of multiple entities collaborating over data sets.)

“There hasn’t been the technology to allow them to look at whether there’s an overlap before,” he claims. “It puts the power in the hands of the buyer to be able to say we’d like to be able to look at what your user base looks like in comparison to ours.

“The problem right now is you could do that manually but if they then backed out there’s all kinds of legal problems because I’ve had to hand the raw data over… so no one does it. So we’re going to change the M&A market for allowing people to discover whether I should acquire someone before they go through to the data room process.”

While Link is something of a taster of what InfoSum’s platform aims to ultimately offer (with this first product priced low but not freemium), the SaaS business it’s intending to get into is data matchmaking — whereby, once it has a pipeline of users, it can start to suggest links that might be interesting for its customers to explore.

“There is no point in us reinventing the wheel of being the best visualization company because there’s plenty that have done that,” he says. “So we are working on data connectors for all of the most popular BI tools that plug in to then visualize the actual data.

“The long term vision for us moves more into being more of an introductory service — i.e. one we’ve got 100 companies in this how do we help those companies work out what other companies that they should be working with.”

“We’ve got some very good systems for — in a fully anonymized way — helping you understand what the intersection is from your data to all of the other datasets, obviously with their permission if they want us to calculate that for them,” he adds.

“The way our investors looked at this, this is the big opportunity going forward. There is not limit, in a decentralized world… imagine 1,000 bunkers around the world in these different corporates who all can start to collaborate. And that’s our ultimate goal — that all of them are still holding onto their own knowledge, 100% privacy safe, but then they have that opportunity to work with each other, which they don’t right now.”

Engineering around privacy risks?

But does he not see any risks to privacy of enabling the linking of so many separate datasets — even with limits in place to avoid individuals being directly outed as connected across different services?

“However many data sets there are the only thing it can reveal extra is whether every extra data has an extra bit of knowledge,” he responds on that. “And every party has the ability to define  what bit of data they would then want to be open to others to then work on.

“There are obviously sensitivities around certain combinations of attributes, around religion, gender and things like that. Where we already have a very clever permission system where the owners can define what combinations are acceptable and what aren’t.”

“My experience of working with all the social networks has meant — I hope — that we are ahead of the game of thinking about those,” he adds, saying that the matchmaking stage is also six months out at this point.

“I don’t see any down sides to it, as long as the controls are there to be able to limit it. It’s not like it’s going to be a sudden free for all. It’s an introductory service, rather than an open platform so everyone can see everything else.”

The permission system is clearly going to be important. But InfoSum does essentially appear to be heading down the platform route of offloading responsibility for ethical considerations — in its case around dataset linkages — to its customers.

Which does open the door to problematic data linkages down the line, and all sorts of unintended dots being joined.

Say, for example, a health clinic decides to match people with particular medical conditions to users of different dating apps — and the relative proportions of HIV rates across straight and gay dating apps in the local area gets published. What unintended consequences might spring from that linkage being made?

Other equally problematic linkages aren’t hard to imagine. And we’ve seen the appetite businesses have for making creepy observations about their users public.

“Combining two sets of aggregate data meaningfully is not easy,” says Eerke Boiten, professor of cyber security at De Montfort University, discussing InfoSum’s approach. “If they can make this all work out in a way that makes sense, preserves privacy, and is GDPR compliant, then they deserve a patent I suppose.”

On data linkages, Boiten points to the problems Facebook has had with racial profiling as illustrative of the potential pitfalls.

He also says there may also be GDPR-specific risks around customer profiling enabled by the platform. In an edge case scenario, for example, where two overlapped datasets are linked and found to have a 100% user match, that would mean people’s personal data had been processed by default — so that processing would have required a legal basis to be in place beforehand.

And there may be wider legal risks around profiling too. If, for example, linkages are used to deny services or vary pricing to certain types or blocks of customers, is that legal or ethical?

“From a company’s perspective, if it already has either consent or a legitimate purpose (under GDPR) to use customer data for analytical/statistical purposes then it can use our products,” says InfoSum’s COO Danvers Baillieu, on data processing consent. “Where a company has an issue using InfoSum as a sub-processor, then… we can set up the system differently so that we simply supply the software and they run it on their own machines (so we are not a data processor) –- but this is not yet available in Link.”

Baillieu also notes that the bin sizes InfoSum’s platform aggregates individuals into are configurable in its first product. “The default bin size is 10, and the absolute minimum is three,” he adds.

“The other key point around disclosure control is that our system never needs to publish the raw data table. All the famous breaches from Netflix onwards are because datasets have been pseudonymised badly and researchers have been able to run analysis across the visible fields and then figure out who the individuals are — this is simply not possible with our system as this data is never revealed.”

‘Fully GDPR compliant’ is certainly a big claim — and one that it going to have a lot of slings and arrows thrown at it as data gets ingested by InfoSum’s platform.

It’s also fair to say that a whole library of books could be written about technology’s unintended consequences.

Indeed, InfoSum’s own website credits Halstead as the inventor of the embedded retweet button, noting the technology is “something that is now ubiquitous on almost every website in the world”.

Those ubiquitous social plugins are also of course a core part of the infrastructure used to track Internet users wherever and almost everywhere they browse. So does he have any regrets about the invention, given how that bit of innovation has ended up being so devastating for digital privacy?

“When I invented it, the driving force for the retweet button was only really as a single number to count engagement. It was never to do with tracking. Our version of the retweet button never had any trackers in it,” he responds on that. “It was the number that drove our algorithms for delivering news in a very transparent way.

“I don’t need to add my voice to all the US pundits of the regrets of the beast that’s been unleashed. All of us feel that desire to unhook from some of these networks now because they aren’t being healthy for us in certain ways. And I certainly feel that what we’re not doing for improving the world of data is going to be good for everyone.”

When we first covered the UK-based startup it was going under the name CognitiveLogic — a placeholder name, as three weeks in Halstead says he was still figuring out exactly how to take his idea to market.

The founder of DataSift has not had difficulties raising funding for his new venture. There was an initial $3M from Upfront Ventures and IA Ventures, with the seed topped up by a further $5M last year, with new investors including Saul Klein (formerly Index Ventures) and Mike Chalfen of Mosaic Ventures. Halstead says he’ll be looking to raise “a very large Series A” over the summer.

In the meanwhile he says he has a “very long list” of hundreds customers wanting to get their hands on the platform to kick its tires. “The last three months has been a whirlwind of me going back to all of the major brands, all of the big data companies, there no large corporate that doesn’t have these kinds of challenges,” he adds.

“I saw a very big client this morning… they’re a large multinational, they’ve got three major brands where the three customer sets had never been joined together. So they don’t even know what the overlap of those brands are at the moment. So even giving them that insight would be massively valuable to them.”

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com