Sep
26
2017
--

Microsoft Excel is about to get a lot smarter

 Microsoft Excel users rejoice — your favorite spreadsheet is about to get a lot smarter. Thanks to the help of machine learning and a better connection to the outside world, Excel will soon be able to understand more about your inputs and then pull additional information from the internet as necessary. Read More

Sep
14
2017
--

Looker’s latest looks to simplify data integrations

 Looker is holding its Join user conference this week, and these affairs often involve a new release to show off to customers. Today, the company released Looker 5, which they say will make it easier for employees to make use of data in their work lives. Company CEO Frank Bien believes there is a growing group of people, who need quick access to data to do their jobs. “At a high level,… Read More

May
16
2017
--

Nexla launches data operations platform with $3.5 million investment

 Nexla, a competitor in the TechCrunch Disrupt Battlefield this week in New York City, has more on its plate than simply impressing the judges. It also chose to launch at the event and, while it was at it, announced $3.5 million in funding led by Blumberg Capital with participation from Storm Ventures, Engineering Capital and Correlation Ventures. Read More

May
02
2017
--

Orbital Insight closes $50M Series C led by Sequoia

 Orbital Insight, a geospatial analytics startup, announced it had completed raising a $50 million Series C round of financing from Sequoia. The fresh capital brings the company’s total fundraising to $78.7 million. Read More

Apr
25
2017
--

Immuta adds accountability and control for project-based data science

 Fresh off $8 million in Series A financing, Immuta is releasing the second version of its data science governance platform. With the democratization of machine learning comes new risks for businesses that have too many workers manipulating data sets and models without oversight.  The Immuta platform helps companies maintain an understanding of how digital assets are applied and shared across… Read More

Mar
30
2017
--

Performance Evaluation of SST Data Transfer: With Encryption (Part 2)

SST Data Transfer

In this blog post, we’ll look at the performance of SST data transfer using encryption.

In my previous post, we reviewed SST data transfer in an unsecured environment. Now let’s take a closer look at a setup with encrypted network connections between the donor and joiner nodes.

The base setup is the same as the previous time:

  • Database server: Percona XtraDB Cluster 5.7 on donor node
  • Database: sysbench database – 100 tables 4M rows each (total ~122GB)
  • Network: donor/joiner hosts are connected with dedicated 10Gbit LAN
  • Hardware: donor/joiner hosts – boxes with 28 Cores+HT/RAM 256GB/Samsung SSD 850/Ubuntu 16.04

The setup details for the encryption aspects in our testing:

  • Cryptography libraries: openssl-1.0.2, openssl-1.1.0, libgcrypt-1.6.5(for xbstream encryption)
  • CPU hardware acceleration for AES – AES-NI: enabled/disabled
  • Ciphers suites: aes(default), aes128, aes256, chacha20(openssl-1.1.0)

Several notes regarding the above aspects:

  • Cryptography libraries. Now almost every Linux distribution is based on the openssl-1.0.2. This is the previous stable version of the OpenSSL library. The latest stable version (1.1.0) has various performance/scalability fixes and also support of new ciphers that may notably improve throughput, However, it’s problematic to upgrade from 1.0.2 to 1.1.0, or just to find packages for openssl-1.1.0 for existing distributions. This is due to the fact that replacing OpenSSL triggers update/upgrade of a significant number of packages. So in order to use openssl-1.1.0, most likely you will need to build it from sources. The same applies to socat – it will require some effort to build socat with openssl-1.1.0.
  • AES-NI. The Advanced Encryption Standard Instruction Set (AES-NI) is an extension to the x86 CPU’s from Intel and AMD. The purpose of AES-NI is to improve the performance of encryption and decryption operations using the Advanced Encryption Standard (AES), like the AES128/AES256 ciphers. If your CPU supports AES-NI, there should be an option in BIOS that allows you to enabled/disable that feature. In Linux, you can check /proc/cpuinfo for the existence of an “aes” flag. If it’s present, then AES-NI is available and exposed to the OS.There is a way to check what acceleration ratio you can expect from it:
    # AES_NI disabled with OPENSSL_ia32cap
    OPENSSL_ia32cap="~0x200000200000000" openssl speed -elapsed -evp aes-128-gcm
    ...
    The 'numbers' are in 1000s of bytes per second processed.
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    aes-128-gcm      57535.13k    65924.18k   164094.81k   175759.36k   178757.63k
    # AES_NI enabled
    openssl speed -elapsed -evp aes-128-gcm
    The 'numbers' are in 1000s of bytes per second processed.
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    aes-128-gcm     254276.67k   620945.00k   826301.78k   906044.07k   923740.84k

    Our interest is the very last column: 178MB/s(wo AES-NI) vs 923MB/s(w AES-NI)

  • Ciphers. In our testing for network encryption with socat+openssl 1.0.2/1.1.0, we used the following ciphers suites:
    DEFAULT – if you don’t specify a cipher/cipher string for OpenSSL connection, this suite will be used
    AES128 – suite with aes128 ciphers only
    AES256 – suites with aes256 ciphers onlyAdditionally, for openssl-1.1.0, there is an extra cipher suite:
    CHACHA20 – cipher suites using ChaCha20 algoIn the case of xtrabackup, where internal encryption is based on libgcrypt, we use the AES128/AES256 ciphers from this library.
  • SST methods. Streaming database files from the the donor to joiner with the rsync protocol over an OpenSSL-encrypted connection:
    (donor) rsync | socat+ssl       socat+ssl| rsync(daemon mode) (joiner)

    The current approach of wsrep_sst_rsync.sh doesn’t allow you to use the rsync SST method with SSL. However, there is a project that tries to address the lack of SSL support for rsync method. The idea is to create a secure connection with socat and then use that connection as a tunnel to connect rsync between the joiner and donor hosts. In my testing, I used a similar approach.

    Also take a note that in the chart below, there are results for two variants of rsync: “rsync” (the current approach), and “rsync_improved” (the improved one). I’ve explained the difference between them in my previous post.

  • Backup data on the donor side and stream it to the joiner in xbstream format over an OpenSSL encrypted connection

    (donor) xtrabackup| socat+ssl  socat+ssl | xbstream (joiner)

    In my testing for streaming over encrypted connections, I used the

    --parallel=4

     option for xtrabackup. In my previous post, I showed that this is important factor to get the best time. There is also a way to pass the name of the cipher that will be used by socat for the OpenSSL connection in the wsrep_sst_xtrabackup-v2.sh script with the

    sockopt

     option. For instance:

    [sst]
    inno-backup-opts="--parallel=4"
    sockopt=",cipher=AES128"
  • Backup data on the donor side/encrypt it internally(with libgcrypt) and stream the data to the joiner in xbstream format, and afterwards decrypt files on the joiner

    (donor) xtrabackup | socat   socat | xbstream ; xtrabackup decrypt (joiner)

    The xtrabackup tool has a feature to encrypt data when performing a backup. That encryption is based on the libgcrypt library, and it’s possible to use AES128 or AES256 ciphers. For encryption, it’s necessary to generate a key and then provide it to xtrabackup to perform encryption on fly. There is a way to specify the number of threads that will encrypt data, along with the chunk size to tune process of encryption.

    The current version of xtrabackup supports an efficient way to read, compress and encrypt data in parallel, and then write/stream it. From the other side, when we accept a stream we can’t decompress/decrypt stream on the fly. At first, the stream should be received/written to disk with the xbstream tool and only after that can you use xtrabackup with

    --decrypt/--decompress

     modes to unpack data. The inability to process data on the fly and save the stream to disk for later processing has a notable impact on stream time from the donor to the joiner. We have a plan to fix that issue, so that encryption+compression+streaming of data with xtrabackup happens without the necessity to write stream to the disk on the receiver side.

    For my testing, in the case of xtrabackup with internal encryption, I didn’t use SSL encryption for socat.

Results (click on the image for an enlarged view):

SST Data Transfer

Observations:

  • Transferring data with rsync is very inefficient, and the improved version is 2-2.5 times faster. Also, you may note that in the case of “no-aes-n”, the rsync_improved method has the best time for default/aes128/aes256 ciphers. The reason is that we perform both data transfers in parallel (we spawn rsync process for each file), as well as encryption/decryption (socat forks extra processes for each stream). This approach allows us to compensate for the absence of hardware acceleration by using several CPU cores. In all other cases, we only use one CPU for streaming of data and encryption/decryption.
  • xtrabackup (with hardware optimized crc32) shows the best time in all cases, except for the default/aes128/aes256 ciphers in “no-aes-ni” mode (where rsync_imporved showed the best time). However I would like to remind you that SST with rsync is a blocking operation, and during the data transfer the donor node becomes READ-ONLY. xtrabackup, on the other hand, uses backup locks and allows any operations on donor node during SST.
  • On the boxes without hardware acceleration (no-aes-ni mode), the chacha20 cipher allows you to perform data transfer 2-3 times faster. It’s a very good replacement for “aes” ciphers on such boxes. However, the problem with that cipher is that it is available only in openssl-1.1.0. In order to use it, you will need a custom build of OpenSSL and socat for many distros.
  • Regarding xtrabackup with internal encryption (xtrabackup_enc): reading/encrypting and streaming data is quite fast, especially with the latest libgcrypt library(1.7.x). The problem is decryption. As I’ve explained above, right now we need to get the stream and save encrypted data to storage first, and then perform the extra step of reading/decrypting and saving the data back. That extra part consumes 2/3 of the total time. Improving the xbstream tool to perform steam decryption/decompression on the fly would allow you to get very good results.

Testing Details

For purposes of the testing, I’ve created a script “sst-bench.sh” that covers all the methods used in this post. You can use it to measure all the above SST methods in your environment. In order to run the script, you have to adjust several environment variables at the beginning of the script:

joiner ip

, datadirs location on joiner and donor hosts, etc. After that, put the script on the “donor” and “joiner” hosts and run it as the following:

#joiner_host>
sst_bench.sh --mode=joiner --sst-mode=<tar|xbackup|rsync> --cipher=<DEFAULT|AES128|AES256|CHACHA20> --ssl=<0|1> --aesni=<0|1>
#donor_host>
sst_bench.sh --mode=donor --sst-mode=<tar|xbackup|rsync|rsync_improved> --cipher=<DEFAULT|AES128|AES256|CHACHA20> --ssl=<0|1> --aesni=<0|1>

Mar
20
2017
--

Prophet: Forecasting our Metrics (or Predicting the Future)

Prophet

In this blog post, we’ll look at how Prophet can forecast metrics.

Facebook recently released a forecasting tool called Prophet. Prophet can forecast a particular metric in which we have an interest. It works by fitting time-series data to get a prediction of how that metric will look in the future.

For example, it could be used to:

  • Predict how much HTTP traffic we will get, and scale accordingly when needed
  • See if a particular feature of our application will have success or if its usage will decline
  • Get an approximate date when our database server’s resources will be exhausted
  • Forecast new customer’s sign up and resize the staff accordingly
  • See what next year’s Black Friday or Cyber Monday will look like, and if we have the resources to handle them
  • Predict how many animals will enter a shelter in the coming years, as I did in a personal project I will show here

At its core, it uses a Generalized Additive Model. It is basically the merging of two models. First, a generalized linear model that, in the case of Prophet, can be a linear or logistic regression (depending on what we choose). Second, an additive model applied to that regression. The final graph represents the combination of those two. That is, the smoothed regression area of the variable to predict. For more technical details of how it works, check out Prophet’s paper.

Most of the previous points can be summarized in a simple concept, capacity planning. Let’s see how it works.

Usage Example

Prophet provides either a Python or R library. The following example will use the Python one. You can install it using:

pip install prophet

Prophet expects the metrics with a particular structure: a Pandas DataFrame with two columns, ds and y:

ds y
0 2013-10-01 34
1 2013-10-02 43
2 2013-10-03 20
3 2013-10-04 12
4 2013-10-05 46

 

The data I am going to use here is from Kaggle Competition Shelter Animal Outcomes. The idea is to find out how Austin Animal Center‘s workload will evolve in the future by trying to predict the number of animal outcomes per day for the next three years. I am using this dataset because it has enough data, shows a very simple trend and it is a non-technical metric (no previous knowledge on the topic is needed). The same method can be applied to most of the services or business metrics you could have.

At this point, we have the metric stored in a local variable, called “series” in this particular example. Now we only need to fit it into our model:

m = Prophet()
m.fit(series);

and define how far into the future we want to predict (three years in this case):

future = m.make_future_dataframe(periods=365*3)

Now, just plot the data:

m.plot(forecast)
plt.title("Outcomes forecast per Year",fontsize=20)
plt.xlabel("Year",fontsize=20)
plt.ylabel("Number of outcomes",fontsize=20)
plt.show()

Prophet

The graph shows a smoothed regression surface. We can see that the data provided covers from the last months 2013 to the first of 2016. From that point, those are the predictions.

We can already find some interesting data. Our data shows a large increase during the summer months and predicts it to continue in the future. But this representation also has some problems. As we can see, there are at least three outliers with values > 65. The fastest way to deal with outliers is to just remove them. ?

series[series["y"]>65]

ds y
0 2014-07-12 129
1 2015-07-18 97
2 2015-07-19 81

 

series.drop(series[series["y"]>60].index,inplace=True)

Now the graph looks much better. Let’s also add a horizontal line that will help to see the trend:

Prophet

From that forecast, Austin Animal Center should expect an increase in the next few years but not a large one. Therefore, the increase trend year-over-year won’t cause problems in the near future. But there could be a moment when we reach the shelter’s maximum capacity.

Recommendations

  • If we want to forecast a metric, we recommend you have at least one year of data to fit the model. If we have less data, we could miss some seasonal effects. In our model above, for example, the large increase of work during summer months.
  • In some cases, you might only want information about particular holidays (for example Black Fridays or Christmas). In that case, it is possible to create a model for those particular days. The documentation explains how to do this. But in summary, you need to create a new Pandas DataFrame that includes all previous Black Friday dates, and those from the future that you want to predict. Then, create the model as before, but specify that you are interested in a holiday effect:
    m = Prophet(holidays=holidays)
  • We recommend you use daily data. The graph could show strange results if we want daily forecasts from non-daily data. In case the metric shows monthly information, freq=’M’ can be used (as shown in the documentation).

Conclusion

When we want to predict the future of a particular metric, we can use Prophet to make that forecast, and then plan for it based on the information we get from the model. It can be used on very different types of problems, and it is very easy to use. Do you want to know how loaded your database will be in the future? Ask Prophet!

Jan
17
2017
--

TellusLabs wants to help us better understand our planet

cascades-and-agriculture-aug-2016 If you’ve spent time following companies like Orbital Insight and Descartes Labs, you might assume the geospatial analytics race has been won. But TellusLabs thinks, on the contrary, that the table hasn’t even started to cool. Armed with $3 million in new seed funding from IA Ventures and an investor group including Hyperplane VC, FounderCollective and Project11, the… Read More

Jan
11
2017
--

Industrial drone platform Kespry brings in new CEO

Kespry drone Commercial drone maker Kespry, which launched in 2013, announced it was bringing in industry vet George Mathew as CEO today.
Mathew, who most recently was president and COO at data analytics startup Alteryx, has also had stints as GM of Business Intelligence at SAP and director of Technical Account Management at Salesforce. It’s fair to say that Kespry was interested in his data… Read More

Dec
07
2016
--

SnapLogic snaps up another $40 million

Casual woman inside a cafe downloading or sharing information with a smartphone . A lot of apps, media and other information flying out or into the phone SnapLogic solves a big problem for companies. It helps them connect legacy data sources to the cloud or to an in-house data lake. Today, it announced a $40 million round, almost exactly a year after announcing a $37.5 million round. The round was led by European private equity firm Vitruvian Partners with participation from previous investors Andreessen Horowitz, Capital One, Ignition… Read More

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com