Alibaba unveils Hanguang 800, an AI inference chip it says significantly increases the speed of machine learning tasks

Alibaba Group introduced its first AI inference chip today, a neural processing unit called Hanguang 800 that it says makes performing machine learning tasks dramatically faster and more energy efficient. The chip, announced today during Alibaba Cloud’s annual Apsara Computing Conference in Hangzhou, is already being used to power features on Alibaba’s e-commerce sites, including product search and personalized recommendations. It will be made available to Alibaba Cloud customers later.

As an example of what the chip can do, Alibaba said it usually takes Taobao an hour to categorize the one billion product images that are uploaded to the e-commerce platform each day by merchants and prepare them for search and personalized recommendations. Using Hanguang 800, Taobao was able to complete the task in only five minutes.

Alibaba is already using Hanguang 800 in many of its business operations that need machine processing. In addition to product search and recommendations, this includes automatic translation on its e-commerce sites, advertising and intelligence customer services.

Though Alibaba hasn’t revealed when the chip will be available to its cloud customers, the chip may help Chinese companies reduce their dependence on U.S. technology as the trade war makes business partnerships between Chinese and American tech companies more difficult. It also can help Alibaba Cloud grow in markets outside of China. Within China, it is the market leader, but in the Asia-Pacific region, Alibaba Cloud still ranks behind Amazon, Microsoft and Google, according to the Synergy Research Group.

Hanguang 800 was created by T-Head, the unit that leads the development of chips for cloud and edge computing within Alibaba DAMO Academy, the global research and development initiative in which Alibaba is investing more than $15 billion. T-Head developed the chip’s hardware and algorithms designed for business apps, including Alibaba’s retail and logistics apps.

In a statement, Alibaba Group CTO and president of Alibaba Cloud Intelligence Jeff Zhang (pictured above) said, “The launch of Hanguang 800 is an important step in our pursuit of next-generation technologies, boosting computing capabilities that will drive both our current and emerging businesses while improving energy-efficiency.”

He added, “In the near future, we plan to empower our clients by providing access through our cloud business to the advanced computing that is made possible by the chip, anytime and anywhere.”

T-Head’s other launches included the XuanTie 910 earlier this year, an IoT processor based on RISC-V, the open-source hardware instruction set that began as a project at UC Berkeley. XuanTie 910 was created for heavy-duty IoT applications, including edge servers, networking, gateway and autonomous vehicles.

Alibaba DAMO Academy collaborates with universities around the world, including UC Berkeley and Tel Aviv University. Researchers in the program focus on machine learning, network security, visual computing and natural language processing, with the goal of serving two billion customers and creating 100 million jobs by 2035.


Google and Twitter are using AMD’s new EPYC Rome processors in their data centers

Google and Twitter are among the companies now using EPYC Rome processors, AMD announced today during a launch event for the 7nm chips. The release of EPYC Rome marks a major step in AMD’s processor war with Intel, which said last month that its own 7nm chips, Ice Lake, won’t be available until 2021 (though it is expected to release its 10nm node this year).

Intel is still the biggest data center processor maker by far, however, and also counts Google and Twitter among its customers. But AMD’s latest releases and its strategy of undercutting competitors with lower pricing have quickly transformed it into a formidable rival.

Google has used other AMD chips before, including in its “Millionth Server,” built in 2008, and says it is now the first company to use second-generation EPYC chips in its data centers. Later this year, Google will also make available to Google Cloud customers virtual machines that run on the chips.

In a press statement, Bart Sano, Google vice president of engineering, said “AMD 2nd Gen Epyc processors will help us continue to do what we do best in our datacenters: innovate. Its scalable compute, memory and I/O performance will expand out ability to drive innovation forward in our infrastructure and will give Google Cloud customers the flexibility to choose the best VM for their workloads.”

Twitter plans to begin using EPYC Rome in its data center infrastructure later this year. Its senior director of engineering, Jennifer Fraser, said the chips will reduce the energy consumption of its data centers. “Using the AMD EPYC 7702 processor, we can scale out our compute clusters with more cores in less space using less power, which translates to 25% lower [total cost of ownership] for Twitter.”

In a comparison test between 2-socket Intel Xeon 6242 and AMD EPYC 7702P processors, AMD claimed that its chips were able to reduce total cost of ownership by up to 50% across “numerous workloads.” AMD EPYC Rome’s flagship is the 64-core, 128-thread 7742 chip, with a 2.25 base frequency, 225 default TDP and 256MB of total cache, starts at $6,950.


Fungible raises $200 million led by SoftBank Vision Fund to help companies handle increasingly massive amounts of data

Fungible, a startup that wants to help data centers cope with the increasingly massive amounts of data produced by new technologies, has raised a $200 million Series C led by SoftBank Vision Fund, with participation from Norwest Venture Partners and its existing investors. As part of the round, SoftBank Investment Advisers senior managing partner Deep Nishar will join Fungible’s board of directors.

Founded in 2015, Fungible now counts about 200 employees and has raised more than $300 million in total funding. Its other investors include Battery Ventures, Mayfield Fund, Redline Capital and Walden Riverwood Ventures. Its new capital will be used to speed up product development. The company’s founders, CEO Pradeep Sindhu and Bertrand Serlet, say Fungible will release more information later this year about when its data processing units will be available and their on-boarding process, which they say will not require clients to change their existing applications, networking or server design.

Sindu previously founded Juniper Networks, where he held roles as chief scientist and CEO. Serlet was senior vice president of software engineering at Apple before leaving in 2011 and founding Upthere, a storage startup that was acquired by Western Digital in 2017. Sindu and Serlet describe Fungible’s objective as pivoting data centers from a “compute-centric” model to a data-centric one. While the company is often asked if they consider Intel and Nvidia competitors, they say Fungible Data Processing Units (DPU) complement tech, including central and graphics processing units, from other chip makers.

Sindhu describes Fungible’s DPUs as a new building block in data center infrastructure, allowing them to handle larger amounts of data more efficiently and also potentially enabling new kinds of applications. Its DPUs are fully programmable and connect with standard IPs over Ethernet local area networks and local buses, like the PCI Express, that in turn connect to CPUs, GPUs and storage. Placed between the two, the DPUs act like a “super-charged data traffic controller,” performing computations offloaded by the CPUs and GPUs, as well as converting the IP connection into high-speed data center fabric.

This better prepares data centers for the enormous amounts of data generated by new technology, including self-driving cars, and industries such as personalized healthcare, financial services, cloud gaming, agriculture, call centers and manufacturing, says Sindu.

In a press statement, Nishar said “As the global data explosion and AI revolution unfold, global computing, storage and networking infrastructure are undergoing a fundamental transformation. Fungible’s products enable data centers to leverage their existing hardware infrastructure and benefit from these new technology paradigms. We look forward to partnering with the company’s visionary and accomplished management team as they power the next generation of data centers.”


MongoDB on ARM Processors

reads updates transactions per hour per $

ARM processors have been around for a while. In mid-2015/2016 there were a couple of attempts by the community to port MongoDB to work with this architecture. At the time, the main storage engine was MMAP and most of the available ARM boards were 32-bits. Overall, the port worked, but the fact is having MongoDB running on a Raspberry Pi was more a hack than a setup. The public cloud providers didn’t yet offer machines running with these processors.

The ARM processors are power-efficient and, for this reason, they are used in smartphones, smart devices and, now, even laptops. It was just a matter of time to have them available in the cloud as well. Now that AWS is offering ARM-based instances you might be thinking: “Hmmm, these instances include the same amount of cores and memory compared to the traditional x86-based offers, but cost a fraction of the price!”.

But do they perform alike?

In this blog, we selected three different AWS instances to compare: one powered by  an ARM processor, the second one backed by a traditional x86_64 Intel processor with the same number of cores and memory as the ARM instance, and finally another Intel-backed instance that costs roughly the same as the ARM instance but carries half as many cores. We acknowledge these processors are not supposed to be “equivalent”, and we do not intend to go deeper in CPU architecture in this blog. Our goal is purely to check how the ARM-backed instance fares in comparison to the Intel-based ones.

These are the instances we will consider in this blog post.


We will use the Yahoo Cloud Serving Benchmark (YCSB, running on a dedicated instance (c5d.4xlarge) to simulate load in three distinct tests:

  1. a load of 1 billion documents in one collection having only the primary key (which we’ll call Inserts).
  2. a workload comprised of exclusively reads (Reads)
  3. a workload comprised of a mix of 75% reads with 5% scans plus 25% updates (Reads/Updates)

We will run each test with a varying number of concurrent threads (32, 64, and 128), repeating each set three times and keeping only the second-best result.

All instances will run the same MongoDB version (4.0.3, installed from a tarball and running with default settings) and operating system, Ubuntu 16.04. We chose this setup because MongoDB offer includes an ARM version for Ubuntu-based machines.

All the instances will be configured with:

  • 100 GB EBS with 5000 PIOPS and 20 GB EBS boot device
  • Data volume formatted with XFS, 4k blocks
  • Default swappiness and disk scheduler
  • Default kernel parameters
  • Enhanced cloud watch configured
  • Free monitoring tier enabled

Preparing the environment

We start with the setup of the benchmark software we will use for the test, YCSB. The first task was to spin up a powerful machine (c5d.4xlarge) to run the software and then prepare the environment:

The YCSB program requires Java, Maven, Python, and pymongo which doesn’t come by default in our Linux version – Ubuntu server x86. Here are the steps we used to configure our environment:

Installing Java

sudo apt-get install java-devel

Installing Maven

sudo tar xzf apache-maven-*-bin.tar.gz -C /usr/local
cd /usr/local
sudo ln -s apache-maven-* maven
sudo vi /etc/profile.d/

Add the following to

export M2_HOME=/usr/local/maven
export PATH=${M2_HOME}/bin:${PATH}

Installing Python 2.7

sudo apt-get install python2.7

Installing pip to resolve the pymongo dependency

sudo apt-get install python-pip

Installing pymongo (driver)

sudo pip install pymongo

Installing YCSB

curl -O --location
tar xfvz ycsb-0.5.0.tar.gz
cd ycsb-0.5.0

YCSB comes with different workloads, and also allows for the customization of a workload to match our own requirements. If you want to learn more about the workloads have a look at

First, we will edit the workloads/workloada file to perform 1 billion inserts (for our first test) while also preparing it to later perform only reads (for our second test):


We will then change the workloads/workloadb file so as to provide a mixed workload for our third test.  We also set it to perform 1 billion reads, but we break it down into 70% of read queries and 30% of updates with a scan ratio of 5%, while also placing a cap on the maximum number of scanned documents (2000) in an effort to emulate real traffic – workloads are not perfect, right?


With that, we have the environment configured for testing.

Running the tests

With all instances configured and ready, we run the stress test against our MongoDB servers using the following command :

./bin/ycsb [load/run] mongodb -s -P workloads/workload[ab] -threads [32/64/128] \
 -p mongodb.url=mongodb://[0-9] \

The parameters between brackets varied according to the instance and operation being executed:

  • [load/run] load means insert data while run means perform action (update/read)
  • workload[a/b] reference the different workloads we’ve used
  • [32/64/128] indicate the number of concurrent threads being used for the test
  • ycsb0000[0-9] is the database name we’ve used for the tests (for reference only)


Without further ado, the table below summarizes the results for our tests:




Performance cost

Considering throughput alone – and in the context of those tests, particularly the last one – you may get more performance for the same cost. That’s certainly not always the case, which our results above also demonstrate. And, as usual, it depends on “how much performance do you need” – a matter that is even more pertinent in the cloud. With that in mind, we had another look at our data under the “performance cost” lens.

As we saw above, the c5.4xlarge instance performed better than the other two instances for a little over 50% more (in terms of cost). Did it deliver 50% more (performance) as well? Well, sometimes it did even more than that, but not always. We used the following formula to extrapolate the OPS (Operations Per Second) data we’ve got from our tests into OPH (Operations Per Hour), so we could them calculate how much bang (operations) for the buck (US$1) each instance was able to provide:

transactions/hour/US$1 = (OPS * 3600) / instance cost per hour

This is, of course, an artificial metric that aims to correlate performance and cost. For this reason, instead of plotting the raw values, we have normalized the results using the best performer instance as baseline(100%):



The intent behind these was only to demonstrate another way to evaluate how much we’re getting for what we’re paying. Of course, you need to have a clear understanding of your own requirements in order to make a balanced decision.

Parting thoughts

We hope this post awakens your curiosity not only about how MongoDB may perform on ARM-based servers, but also by demonstrating another way you can perform your own tests with the YCSB benchmark. Feel free to reach out to us through the comments section below if you have any suggestions, questions, or other observations to make about the work we presented here.


Does the Meltdown Fix Affect Performance for MySQL on Bare Metal?

Meltdown Fix Affect Performance small

In this blog post, we’ll look at does the Meltdown fix affect performance for MySQL on bare metal servers.

Since the news about the Meltdown bug, there were a lot of reports on the performance hit from proposed fixes. We have looked at how the fix affects MySQL (Percona Server for MySQL) under a sysbench workload.

In this case, we used bare metal boxes with the following specifications:

  • Two-socket Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz (in total 56 entries in /proc/cpuinfo)
  • Ubuntu 16.04
  • Memory: 256GB
  • Storage: Samsung SM863 1.9TB SATA SSD
  • Percona Server for MySQL 5.7.20
  • Kernel (vulnerable) 4.13.0-21
  • Kernel (with Meltdown fix) 4.13.0-25

Please note, the current kernel for Ubuntu 16.04 contains only a Meltdown fix, and not one for Spectre.

We performed the validation with the tool. The database size is 100GB in a sysbench workload with 100 tables, 4mln rows each with Pareto distribution.

We have used a socket connection and TCP host connection to measure a possible overhead from the TCP network connection. We also perform read-write and read-only benchmarks.

The results are below for a various number of threads:

Meltdown Fix Affect Performance


  • Nokpti: kernel without KPTI patch (4.13.0-21)
  • Pti: kernel with KPTI patch (4.13.0-25), with PTI enabled
  • Nopti: kernel with KPTI patch (4.13.0-25), with PTI disabled


testname bp socket threads pti nopti nokpti nopti_pct pti_pct
1 OLTP_RO in-memory tcp_socket 1 709.93 718.47 699.50 -2.64 -1.47
4 OLTP_RO in-memory tcp_socket 8 5473.05 5500.08 5483.40 -0.30 0.19
3 OLTP_RO in-memory tcp_socket 64 21716.18 22036.98 21548.46 -2.22 -0.77
2 OLTP_RO in-memory tcp_socket 128 21606.02 22010.36 21548.62 -2.10 -0.27
 5 OLTP_RO in-memory unix_socket 1 750.41 759.33 776.88 2.31 3.53
8 OLTP_RO in-memory unix_socket 8 5851.80 5896.86 5986.89 1.53 2.31
7 OLTP_RO in-memory unix_socket 64 23052.10 23552.26 23191.48 -1.53 0.60
6 OLTP_RO in-memory unix_socket 128 23215.38 23602.64 23146.42 -1.93 -0.30
9 OLTP_RO io-bound tcp_socket 1 364.03 369.68 370.51 0.22 1.78
12 OLTP_RO io-bound tcp_socket 8 3205.05 3225.21 3210.63 -0.45 0.17
11 OLTP_RO io-bound tcp_socket 64 15324.66 15456.44 15364.25 -0.60 0.26
10 OLTP_RO io-bound tcp_socket 128 17705.29 18007.45 17748.70 -1.44 0.25
13 OLTP_RO io-bound unix_socket 1 421.74 430.10 432.88 0.65 2.64
16 OLTP_RO io-bound unix_socket 8 3322.19 3367.46 3367.34 -0.00 1.36
15 OLTP_RO io-bound unix_socket 64 15977.28 16186.59 16248.42 0.38 1.70
14 OLTP_RO io-bound unix_socket 128 18729.10 19111.55 18962.02 -0.78 1.24
17 OLTP_RW in-memory tcp_socket 1 490.76 495.21 489.49 -1.16 -0.26
20 OLTP_RW in-memory tcp_socket 8 3445.66 3459.16 3414.36 -1.30 -0.91
19 OLTP_RW in-memory tcp_socket 64 11165.77 11167.44 10861.44 -2.74 -2.73
18 OLTP_RW in-memory tcp_socket 128 12176.96 12226.17 12204.85 -0.17 0.23
21 OLTP_RW in-memory unix_socket 1 530.08 534.98 540.27 0.99 1.92
24 OLTP_RW in-memory unix_socket 8 3734.93 3757.98 3772.17 0.38 1.00
23 OLTP_RW in-memory unix_socket 64 12042.27 12160.86 12138.01 -0.19 0.80
22 OLTP_RW in-memory unix_socket 128 12930.34 12939.02 12844.78 -0.73 -0.66
25 OLTP_RW io-bound tcp_socket 1 268.08 270.51 270.71 0.07 0.98
28 OLTP_RW io-bound tcp_socket 8 1585.39 1589.30 1557.58 -2.00 -1.75
27 OLTP_RW io-bound tcp_socket 64 4828.30 4782.42 4620.57 -3.38 -4.30
26 OLTP_RW io-bound tcp_socket 128 5158.66 5172.82 5321.03 2.87 3.15
29 OLTP_RW io-bound unix_socket 1 280.54 282.06 282.35 0.10 0.65
32 OLTP_RW io-bound unix_socket 8 1582.69 1584.52 1601.26 1.06 1.17
31 OLTP_RW io-bound unix_socket 64 4519.45 4485.72 4515.28 0.66 -0.09
30 OLTP_RW io-bound unix_socket 128 5524.28 5460.03 5275.53 -3.38 -4.50


As you can see, there is very little difference between runs (in 3-4% range), which fits into variance during the test.

Similar experiments were done on different servers and workloads:

There also we see a negligible difference that fits into measurement variance.

Overhead analysis

To understand why we do not see much effect in MySQL (InnoDB workloads), let’s take a look where we expect to see the overhead from the proposed fix.

The main overhead is expected from a system call, so let’s test syscall execution on the kernel before the fix and after the fix (thanks for Alexey Kopytov for an idea how to test it with sysbench).

We will use the following script syscall.lua:

ffi.cdef[[long syscall(long, long, long, long);]]
function event()
 for i = 1, 10000 do
 ffi.C.syscall(0, 0, 0, 0)

Basically, we measure the time for executing 10000 system calls (this will be one event).

To run benchmark:

sysbench syscall.lua --time=60 --report-interval=1 run


And the results are following:

  • On the kernel without the fix (4.13.0-21): 455 events/sec
  • On the kernel with the fix (4.13.0-26): 250 events/sec

This means that time to execute 10000 system calls increased from 2.197ms to 4ms.

While this increase looks significant, it does not have much effect on MySQL (InnoDB engine). In MySQL, you can expect most system calls done for IO or network communication.

We can assume that the time to execute 10000 IO events on the fast storage takes 1000ms, so adding an extra 2ms for the system calls corresponds to adding 0.2% in overhead (which is practically invisible in MySQL workloads).

I expect the effect will be much more visible if we work with MyISAM tables cached in OS memory. In this case, the syscall overhead would be much more visible when accessing data in memory.


From our results, we do not see a measurable effect from KPTI patches (to mitigate the Meltdown vulnerability) running on bare metal servers with Ubuntu 16.04 and 4.13 kernel series.

Reference commands and configs:

sysbench oltp_read_only.lua   {--mysql-socket=/tmp/mysql.sock|--mysql-host=} --mysql-user=root
--mysql-db=sbtest100t4M --rand-type=pareto  --tables=100  --table-size=4000000 --num-threads=$threads --report-interval=1
--max-time=180 --max-requests=0  run


sysbench oltp_read_write.lua   {--mysql-socket=/tmp/mysql.sock|--mysql-host=} --mysql-user=root
--mysql-db=sbtest100t4M --rand-type=pareto  --tables=100  --table-size=4000000 --num-threads=$threads --report-interval=1
--max-time=180 --max-requests=0  run

Percona Server 5.7.20-19

numactl --physcpubind=all --interleave=all   /usr/bin/env LD_PRELOAD=/data/opt/alexey.s/ ./bin/mysqld
--defaults-file=/data/opt/alexey.s/my-perf57.cnf --basedir=. --datadir=/data/sam/sbtest100t4M   --user=root  --innodb_flush_log_at_trx_commit=1
--innodb-buffer-pool-size=150GB --innodb-log-file-size=10G --innodb-buffer-pool-instances=8  --innodb-io-capacity-max=20000
--innodb-io-capacity=10000 --loose-innodb-page-cleaners=8 --ssl=0

My.cnf file:

innodb_flush_log_at_trx_commit = 1
innodb_file_per_table = true
innodb_log_buffer_size = 128M
innodb_log_file_size = 10G
innodb_log_files_in_group = 2

Powered by WordPress | Theme: Aeros 2.0 by