Mar
02
2021
--

Microsoft Azure expands its NoSQL portfolio with Managed Instances for Apache Cassandra

At its Ignite conference today, Microsoft announced the launch of Azure Managed Instance for Apache Cassandra, its latest NoSQL database offering and a competitor to Cassandra-centric companies like Datastax. Microsoft describes the new service as a ‘semi-managed offering that will help companies bring more of their Cassandra-based workloads into its cloud.

“Customers can easily take on-prem Cassandra workloads and add limitless cloud scale while maintaining full compatibility with the latest version of Apache Cassandra,” Microsoft explains in its press materials. “Their deployments gain improved performance and availability, while benefiting from Azure’s security and compliance capabilities.”

Like its counterpart, Azure SQL Manages Instance, the idea here is to give users access to a scalable, cloud-based database service. To use Cassandra in Azure before, businesses had to either move to Cosmos DB, its highly scalable database service which supports the Cassandra, MongoDB, SQL and Gremlin APIs, or manage their own fleet of virtual machines or on-premises infrastructure.

Cassandra was originally developed at Facebook and then open-sourced in 2008. A year later, it joined the Apache Foundation and today it’s used widely across the industry, with companies like Apple and Netflix betting on it for some of their core services, for example. AWS launched a managed Cassandra-compatible service at its re:Invent conference in 2019 (it’s called Amazon Keyspaces today), Microsoft launched the Cassandra API for Cosmos DB in September 2018. With today’s announcement, though, the company can now offer a full range of Cassandra-based servicer for enterprises that want to move these workloads to its cloud.


Early Stage is the premiere ‘how-to’ event for startup entrepreneurs and investors. You’ll hear firsthand how some of the most successful founders and VCs build their businesses, raise money and manage their portfolios. We’ll cover every aspect of company-building: Fundraising, recruiting, sales, legal, PR, marketing and brand building. Each session also has audience participation built-in — there’s ample time included in each for audience questions and discussion.


Feb
17
2021
--

Microsoft’s Dapr open-source project to help developers build cloud-native apps hits 1.0

Dapr, the Microsoft-incubated open-source project that aims to make it easier for developers to build event-driven, distributed cloud-native applications, hit its 1.0 milestone today, signifying the project’s readiness for production use cases. Microsoft launched the Distributed Application Runtime (that’s what “Dapr” stand for) back in October 2019. Since then, the project released 14 updates and the community launched integrations with virtually all major cloud providers, including Azure, AWS, Alibaba and Google Cloud.

The goal for Dapr, Microsoft Azure CTO Mark Russinovich told me, was to democratize cloud-native development for enterprise developers.

“When we go look at what enterprise developers are being asked to do — they’ve traditionally been doing client, server, web plus database-type applications,” he noted. “But now, we’re asking them to containerize and to create microservices that scale out and have no-downtime updates — and they’ve got to integrate with all these cloud services. And many enterprises are, on top of that, asking them to make apps that are portable across on-premises environments as well as cloud environments or even be able to move between clouds. So just tons of complexity has been thrown at them that’s not specific to or not relevant to the business problems they’re trying to solve.”

And a lot of the development involves re-inventing the wheel to make their applications reliably talk to various other services. The idea behind Dapr is to give developers a single runtime that, out of the box, provides the tools that developers need to build event-driven microservices. Among other things, Dapr provides various building blocks for things like service-to-service communications, state management, pub/sub and secrets management.

Image Credits: Dapr

“The goal with Dapr was: let’s take care of all of the mundane work of writing one of these cloud-native distributed, highly available, scalable, secure cloud services, away from the developers so they can focus on their code. And actually, we took lessons from serverless, from Functions-as-a-Service where with, for example Azure Functions, it’s event-driven, they focus on their business logic and then things like the bindings that come with Azure Functions take care of connecting with other services,” Russinovich said.

He also noted that another goal here was to do away with language-specific models and to create a programming model that can be leveraged from any language. Enterprises, after all, tend to use multiple languages in their existing code, and a lot of them are now looking at how to best modernize their existing applications — without throwing out all of their current code.

As Russinovich noted, the project now has more than 700 contributors outside of Microsoft (though the core commuters are largely from Microsoft) and a number of businesses started using it in production before the 1.0 release. One of the larger cloud providers that is already using it is Alibaba. “Alibaba Cloud has really fallen in love with Dapr and is leveraging it heavily,” he said. Other organizations that have contributed to Dapr include HashiCorp and early users like ZEISS, Ignition Group and New Relic.

And while it may seem a bit odd for a cloud provider to be happy that its competitors are using its innovations already, Russinovich noted that this was exactly the plan and that the team hopes to bring Dapr into a foundation soon.

“We’ve been on a path to open governance for several months and the goal is to get this into a foundation. […] The goal is opening this up. It’s not a Microsoft thing. It’s an industry thing,” he said — but he wasn’t quite ready to say to which foundation the team is talking.

 

Feb
17
2021
--

TigerGraph raises $105M Series C for its enterprise graph database

TigerGraph, a well-funded enterprise startup that provides a graph database and analytics platform, today announced that it has raised a $105 million Series C funding round. The round was led by Tiger Global and brings the company’s total funding to over $170 million.

“TigerGraph is leading the paradigm shift in connecting and analyzing data via scalable and native graph technology with pre-connected entities versus the traditional way of joining large tables with rows and columns,” said TigerGraph founder and CEO, Yu Xu. “This funding will allow us to expand our offering and bring it to many more markets, enabling more customers to realize the benefits of graph analytics and AI.”

Current TigerGraph customers include the likes of Amgen, Citrix, Intuit, Jaguar Land Rover and UnitedHealth Group. Using a SQL-like query language (GSQL), these customers can use the company’s services to store and quickly query their graph databases. At the core of its offerings is the TigerGraphDB database and analytics platform, but the company also offers a hosted service, TigerGraph Cloud, with pay-as-you-go pricing, hosted either on AWS or Azure. With GraphStudio, the company also offers a graphical UI for creating data models and visually analyzing them.

The promise for the company’s database services is that they can scale to tens of terabytes of data with billions of edges. Its customers use the technology for a wide variety of use cases, including fraud detection, customer 360, IoT, AI and machine learning.

Like so many other companies in this space, TigerGraph is facing some tailwind thanks to the fact that many enterprises have accelerated their digital transformation projects during the pandemic.

“Over the last 12 months with the COVID-19 pandemic, companies have embraced digital transformation at a faster pace driving an urgent need to find new insights about their customers, products, services, and suppliers,” the company explains in today’s announcement. “Graph technology connects these domains from the relational databases, offering the opportunity to shrink development cycles for data preparation, improve data quality, identify new insights such as similarity patterns to deliver the next best action recommendation.”

Feb
09
2021
--

Is overseeing cloud operations the new career path to CEO?

When Amazon announced last week that founder and CEO Jeff Bezos planned to step back from overseeing operations and shift into an executive chairman role, it also revealed that AWS CEO Andy Jassy, head of the company’s profitable cloud division, would replace him.

As Bessemer partner Byron Deeter pointed out on Twitter, Jassy’s promotion was similar to Satya Nadella’s ascent at Microsoft: in 2014, he moved from executive VP in charge of Azure to the chief exec’s office. Similarly, Arvind Krishna, who was promoted to replace Ginni Rometti as IBM CEO last year, also was formerly head of the company’s cloud business.

Could Nadella’s successful rise serve as a blueprint for Amazon as it makes a similar transition? While there are major differences in the missions of these companies, it’s inevitable that we will compare these two executives based on their former jobs. It’s true that they have an awful lot in common, but there are some stark differences, too.

Replacing a legend

For starters, Jassy is taking over for someone who founded one of the world’s biggest corporations. Nadella replaced Steve Ballmer, who had taken over for the company’s face, Bill Gates. Holger Mueller, an analyst at Constellation Research, says this notable difference could have a huge impact for Jassy with his founder boss still looking over his shoulder.

“There’s a lot of similarity in the two situations, but Satya was a little removed from the founder Gates. Bezos will always hover and be there, whereas Gates (and Ballmer) had retired for good. [ … ] It was clear [they] would not be coming back. [ … ] For Jassy, the owner could [conceivably] come back anytime,” Mueller said.

But Andrew Bartels, an analyst at Forrester Research, says it’s not a coincidence that both leaders were plucked from the cloud divisions of their respective companies, even if it was seven years apart.

“In both cases, these hyperscale business units of Microsoft and Amazon were the fastest-growing and best-performing units of the companies. [ … ] In both cases, cloud infrastructure was seen as a platform on top of which and around which other cloud offerings could be developed,” Bartels said. The companies both believe that the leaders of these two growth engines were best suited to lead the company into the future.

Feb
02
2021
--

What Andy Jassy’s promotion to Amazon CEO could mean for AWS

Blockbuster news struck late this afternoon when Amazon announced that Jeff Bezos would be stepping back as CEO of Amazon, the company he built from a business in his garage to worldwide behemoth. As he takes on the role of executive chairman, his replacement will be none other than AWS CEO Andy Jassy.

With Jassy moving into his new role at the company, the immediate question is who replaces him to run AWS. Let the games begin. Among the names being tossed about in the rumor mill are Peter DeSantis, vice president of global infrastructure at AWS and Matt Garman, who is vice president of sales and marketing. Both are members of Bezos’ elite executive team known as the S-team and either would make sense as Jassy’s successor. Nobody knows for sure though, and it could be any number of people inside the organization, or even someone from outside. Amazon was not ready to comment on a successor yet with the hand-off still months away.

Holger Mueller, a senior analyst at Constellation Research, says that Jassy is being rewarded for doing a stellar job raising AWS from a tiny side business to one on a $50 billion run rate. “On the finance side it makes sense to appoint an executive who intimately knows Amazon’s most profitable business, that operates in more competitive markets. [Appointing Jassy] ensures that the new Amazon CEO does not break the ‘golden goose’,” Mueller told me.

Alex Smith, VP of channels, who covers the cloud infrastructure market at analyst firm Canalys, says the writing has been on the wall that a transition was in the works. “This move has been coming for some time. Jassy is the second most public-facing figure at Amazon and has lead one of its most successful business units. Bezos can go out on a high and focus on his many other ventures,” Smith said.

Smith adds that this move should enhance AWS’s place in the organization. “I think this is more of an AWS gain, in terms of its increasing strategic importance to Amazon going forward, rather than loss in terms of losing Andy as direct lead. I expect he’ll remain close to that organization.”

Ed Anderson, a Gartner analyst also sees Jassy as the obvious choice to take over for Bezos. “Amazon is a company driven by technology innovation, something Andy has been doing at AWS for many years now. Also, it’s worth noting that Andy Jassy has an impressive track record of building and running a very large business. Under Andy’s leadership, AWS has grown to be one of the biggest technology companies in the world and one of the most impactful in defining what the future of computing will be,” Anderson said.

In the company earnings report released today, AWS came in at $12.74 billion for the quarter up 28% YoY from $9.6 billion a year ago. That puts the company on an elite $50 billion run rate. No other cloud infrastructure vendor, even the mighty Microsoft, is even close in this category. Microsoft stands at around 20% marketshare compared to AWS’s approximately 33% market share.

It’s unclear what impact the executive shuffle will have on the company at large or AWS in particular. In some ways it feels like when Larry Ellison stepped down as CEO of Oracle in 2014 to take on the exact same executive chairman role. While Safra Catz and Mark Hurd took over at co-CEOs in that situation, Ellison has remained intimately involved with the company he helped found. It’s reasonable to assume that Bezos will do the same.

With Jassy, the company is getting a man who has risen through the ranks since joining the company in 1997 after getting an undergraduate degree and an MBA from Harvard. In 2002 he became VP/technical assistant, working directly under Bezos. It was in this role that he began to see the need for a set of common web services for Amazon developers to use. This idea grew into AWS and Jassy became a VP at the fledgling division working his way up until he was appointed CEO in 2016.

Jan
29
2021
--

Subscription-based pricing is dead: Smart SaaS companies are shifting to usage-based models

Software buying has evolved. The days of executives choosing software for their employees based on IT compatibility or KPIs are gone. Employees now tell their boss what to buy. This is why we’re seeing more and more SaaS companies — Datadog, Twilio, AWS, Snowflake and Stripe, to name a few — find success with a usage-based pricing model.

The usage-based model allows a customer to start at a low cost, while still preserving the ability to monetize a customer over time.

The usage-based model allows a customer to start at a low cost, minimizing friction to getting started while still preserving the ability to monetize a customer over time because the price is directly tied with the value a customer receives. Not limiting the number of users who can access the software, customers are able to find new use cases — which leads to more long-term success and higher lifetime value.

While we aren’t going 100% usage-based overnight, looking at some of the megatrends in software —  automation, AI and APIs — the value of a product normally doesn’t scale with more logins. Usage-based pricing will be the key to successful monetization in the future. Here are four top tips to help companies scale to $100+ million ARR with this model.

1. Land-and-expand is real

Usage-based pricing is in all layers of the tech stack. Though it was pioneered in the infrastructure layer (think: AWS and Azure), it’s becoming increasingly popular for API-based products and application software — across infrastructure, middleware and applications.

API-based products and appliacation software – across infrastructure, middleware and applications.

Image Credits: Kyle Poyar / OpenView

Some fear that investors will hate usage-based pricing because customers aren’t locked into a subscription. But, investors actually see it as a sign that customers are seeing value from a product and there’s no shelf-ware.

In fact, investors are increasingly rewarding usage-based companies in the market. Usage-based companies are trading at a 50% revenue multiple premium over their peers.

Investors especially love how the usage-based pricing model pairs with the land-and-expand business model. And of the IPOs over the last three years, seven of the nine that had the best net dollar retention all have a usage-based model. Snowflake in particular is off the charts with a 158% net dollar retention.

Jan
28
2021
--

Load Balancing ProxySQL in AWS

Load Balancing ProxySQL in AWS

Load Balancing ProxySQL in AWSThere are several ways to deploy ProxySQL between your applications and the database servers. A common approach is to have a floating virtual IP (VIP) managed by keepalived as the application endpoint. The proxies have to be strategically provisioned to improve the resiliency of the solution (different hardware, network segments, etc,).

When we consider cloud environments, spreading instances across many availability zones (AZ) is considered a best practice, but that presents a problem regarding VIP handling.

Per definition, VPC subnets have to be created in a specific AZ, and subnet IP ranges can’t overlap with one another. An IP address cannot simply be moved to an instance on a different AZ, as it would end up in a subnet that doesn’t include it.

So in order to use the VIP method, we would need to keep all our proxies in a single AZ. This clearly is not the best idea. In addition to this, the regular VIP method doesn’t work, due to the fact that broadcast is not allowed in AWS.

Let’s instead see how to overcome this by putting ProxySQL instances behind a Network Load Balancer (NLB) instead.

Creating a Load Balancer

1. Create an NLB, specifying the subnets where you launched the ProxySQL instances:

aws elbv2 create-load-balancer \ 
--name proxysql-lb \
--type network \
--scheme internal \
--subnets subnet-03fd9799aedda2a1d subnet-0c9c99a5902d8760f

With the above command, the LB internal endpoints will automatically pick an available IP address on each subnet. Alternatively, if you want to specify the IP addresses yourself, you can run the following:

aws elbv2 create-load-balancer \
--name proxysql-lb \
--type network \
--scheme internal \
--subnet-mappings Subnet-Id=subnet-03fd9799aedda2a1d,PrivateIPv4Address=10.1.1.2 Subnet-Id=subnet-0c9c99a5902d8760f,PrivateIPv4Address=10.1.2.2

The output of the above includes the Amazon Resource Name (ARN) of the load balancer, with the following format:

arn:aws:elasticloadbalancing:us-east-1:686800432451:loadbalancer/net/ivan-proxysql-lb/980f7598e7c43506

Let’s save the value on a variable for later use:

export LB_ARN=<paste the value from above>

Adding the ProxySQL Targets

2. Create a target group, specifying the same VPC that you used for your ProxySQL instances:

aws elbv2 create-target-group \ 
--name proxysql-targets \ 
--protocol TCP \
--port 6033 \
--target-type instance \ 
--health-check-port 6032 \ 
--health-check-interval-seconds 10 \ 
--vpc-id vpc-018cc1c34d4d709d5

The output should include the ARN of the target group with this format:

arn:aws:elasticloadbalancing:us-east-1:686800432451:targetgroup/proxysql-targets/d997e5efc62db322

We can store the value for later use:

export TG_ARN=<paste the value from above>

3. Register your ProxySQL instances with the target group:

aws elbv2 register-targets \
--target-group-arn $TG_ARN \
--targets Id=i-02d9e450af1b00524

aws elbv2 register-targets \
--target-group-arn $TG_ARN \
--targets Id=i-05d9f450af1b00521

Creating the LB Listener

4. Create a listener for your load balancer with a default rule to forward requests to your target group:

aws elbv2 create-listener \ 
--load-balancer-arn $LB_ARN \ 
--protocol TCP \
--port 3306 \
--default-actions Type=forward,TargetGroupArn=$TG_ARN

The output contains the ARN of the listener, with the following format:

arn:aws:elasticloadbalancing:us-east-1:686800432451:listener/net/ivan-proxysql-lb/980f7598e7c43506/0d0c68ddde71b83f

5. You can verify the health of the registered targets using the following? command:

aws elbv2 describe-target-health --target-group-arn $TG_ARN

Be aware it takes a few minutes for the health to go green.

Testing Access

6. Now let’s get the DNS name of the load balancer:

LB_DNS=$(aws elbv2 describe-load-balancers --load-balancer-arns $LB_ARN --query 'LoadBalancers[0].DNSName' --output text)

7. Test access to the load balancer itself:

curl -v $LB_DNS:3306

8. Finally, test the connection to the database through the load balancer:

mysql -u percona -p -hinternal-proxysql-1232905176.us-east-1.elb.amazonaws.com

Final Considerations

For this example, I am using a simple TCP connection to ProxySQL’s admin port as the health check. Another option would be to expose a separate HTTP service that queries ProxySQL to handle more complex health check logic.

It is also important to mention the difference between target-type:instance and target-type:ip for the target group. In the latter, if you check the client connections on the Proxy side (stats_mysql_processlist table) you will see they all come from the load balancer address instead of the actual client. Hence it is more desirable to use instance, to see the real client IP.

Jan
22
2021
--

PostgreSQL on ARM-based AWS EC2 Instances: Is It Any Good?

PostgreSQL on ARM-based AWS EC2

The expected growth of ARM processors in data centers has been a hot topic for discussion for quite some time, and we were curious to see how it performs with PostgreSQL. The general availability of ARM-based servers for testing and evaluation was a major obstacle. The icebreaker was when AWS announced their ARM-based processors offering in their cloud in 2018. But we couldn’t see much excitement immediately, as many considered it is more “experimental” stuff. We were also cautious about recommending it for critical use and never gave enough effort in evaluating it.  But when the second generation of Graviton2 based instances was announced in May 2020, we wanted to seriously consider. We decided to take an independent look at the price/performance of the new instances from the standpoint of running PostgreSQL.

Important: Note that while it’s tempting to call this comparison of PostgreSQL on x86 vs arm, that would not be correct. These tests compare PostgreSQL on two virtual cloud instances, and that includes way more moving parts than just a CPU. We’re primarily focusing on the price-performance of two particular AWS EC2 instances based on two different architectures.

Test Setup

For this test, we picked two similar instances. One is the older

m5d.8xlarge

, and the other is a new Graviton2-based

m6gd.8xlarge

. Both instances come with local “ephemeral” storage that we’ll be using here. Using very fast local drives should help expose differences in other parts of the system and avoid testing cloud storage. The instances are not perfectly identical, as you’ll see below, but are close enough to be considered same grade. We used Ubuntu 20.04 AMI and PostgreSQL 13.1 from pgdg repo. We performed tests with small (in-memory) and large (io-bound) database sizes.

Instances

Specifications and On-Demand pricing of the instances as per the AWS Pricing Information for Linux in the Northern Virginia region. With the currently listed prices,

m6gd.8xlarge

is 25% cheaper.

Graviton2 (arm) Instance

Instance : m6gd.8xlarge 	
Virtual CPUs : 32
RAM  : 128 GiB 	
Storage : 1 x 1900 NVMe SSD (1.9 TiB)
Price : $1.4464 per Hour

Regular (x86) Instance

Instance : m5d.8xlarge
Virtual CPUs : 32
RAM : 128 GiB
Storage : 2 x 600 NVMe SSD (1.2 TiB)
Price : $1.808 per Hour

OS and PostgreSQL setup

We selected Ubuntu 20.04.1 LTS AMIs for the instances and didn’t change anything on the OS side. On the m5d.8xlarge instance, two local NVMe drives were unified in a single raid0 device. PostgreSQL was installed using

.deb

packages available from the PGDG repository.

The PostgreSQL version string shows confirm the OS architecture

postgres=# select version();
                                                                version                                                                 
----------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 13.1 (Ubuntu 13.1-1.pgdg20.04+1) on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
(1 row)

** aarch64 stands for 64-bit ARM architecture

The following PostgreSQL configuration was used for testing.

max_connections = '200'
shared_buffers = '32GB'
checkpoint_timeout = '1h'
max_wal_size = '96GB'
checkpoint_completion_target = '0.9'
archive_mode = 'on'
archive_command = '/bin/true'
random_page_cost = '1.0'
effective_cache_size = '80GB'
maintenance_work_mem = '2GB'
autovacuum_vacuum_scale_factor = '0.4'
bgwriter_lru_maxpages = '1000'
bgwriter_lru_multiplier = '10.0'
wal_compression = 'ON'
log_checkpoints = 'ON'
log_autovacuum_min_duration = '0'

pgbench Tests

First, a preliminary round of tests is done using pgbench, the micro-benchmarking tool available with PostgreSQL. This allows us to test with a different combination of a number of clients and jobs like:

pgbench -c 16 -j 16 -T 600 -r

Where 16 client connections and 16 pgbench jobs feeding the client connections are used.

Read-Write Without Checksum

The default load that

pgbench

creates is a tpcb-like Read-write load. We used the same on a PostgreSQL instance which doesn’t have checksum enabled.

We could see a 19% performance gain on ARM.

x86 (tps) 28878
ARM (tps) 34409

Read-Write With Checksum

We were curious whether the checksum calculation has any impact on Performance due to the architecture difference. if the PostgreSQL level checksum is enabled. PostgreSQL 12 onwards, the checksum can be enabled using pg_checksum utility as follows:

pg_checksums -e -D $PGDATA

x86 (tps) 29402
ARM (tps) 34701

To our surprise, the results were marginally better! Since the difference is around just 1.7%, we consider it as a noise. At least we feel that it is ok to conclude that enabling checksum doesn’t have any noticeable performance degradation on these modern processors.

Read-Only Without Checksum

Read-only loads are expected to be CPU-centric. Since we selected a database size that fully fits into memory, we could eliminate IO related overheads.

x86 (tps) 221436.05
ARM (tps) 288867.44

The results showed a 30% gain in tps for the ARM than the x86 instance.

Read-Only With Checksum

We wanted to check whether we could observe any tps change if we have checksum enabled when the load becomes purely CPU centric.

x86 (tps) 221144.3858
ARM (tps) 279753.1082

The results were very close to the previous one, with 26.5% gains.

In pgbench tests, we observed that as the load becomes CPU centric, the difference in performance increases. We couldn’t observe any performance degradation with checksum.

Note on checksums

PostgreSQL calculates and writes checksum for pages when they are written out and read in the buffer pool. In addition, hint bits are always logged when checksums are enabled, increasing the WAL IO pressure. To correctly validate the overall checksum overhead, we would need longer and larger testing, similar to once we did with sysbench-tpcc.

Testing With sysbench-tpcc

We decided to perform more detailed tests using sysbench-tpcc. We were mainly interested in the case where the database fits into memory. On a side note, while PostgreSQL on the arm server showed no issues, sysbench was much more finicky compared to the x86 one.

Each round of testing consisted of a few steps:

  1. Restore the data directory of the necessary scale (10/200).
  2. Run a 10-minute warmup test with the same parameters as the large test.
  3. Checkpoint on the PG side.
  4. Run the actual test.

In-memory, 16 threads:

In-memory, 16 threads

With this moderate load, the ARM instance shows around 15.5% better performance than the x86 instance. Here and after, the percentage difference is based on the mean tps value.

You might be wondering why there is a sudden drop in performance towards the end of the test. It is related to checkpointing with

full_page_writes

. Even though for in-memory testing we used pareto distribution, a considerable amount of pages is going to be written out after each checkpoint. In this case, the instance showing more performance triggered checkpoint by WAL earlier than its counterpart. These dips are going to be present across all tests performed.

In-memory, 32 threads:

In-memory, 32 threads

When concurrency increased to 32, the difference in performance reduced to nearly 8%.

In-memory, 64 threads:

In-memory, 64 threads

Pushing instances close to their saturation point (remember, both are 32-cpu instances), we see the difference reducing further to 4.5%.

In-memory, 128 threads:

In-memory, 128 threads

When both instances are past their saturation point, the difference in performance becomes negligible, although it’s still there at 1.4% Additionally, we could observe a 6-7% drop in throughput(tps) for ARM and a 4% drop for x86 when concurrency increased from 64 to 128 on these 32 vCPU machines.

Not everything we measured is favorable to the Graviton2-based instance. In the IO-bound tests (~200G dataset, 200 warehouses, uniform distribution), we saw less difference between the two instances, and at 64 and 128 threads, regular m5d instance performed better. You can see this on the combined plots below.

A possible reason for this, especially the significant meltdown at 128 threads for m6gd.8xlarge, is that it lacks the second drive that m5d.8xlarge has. There’s no perfectly comparable couple of instances available currently, so we consider this a fair comparison; each instance type has an advantage. More testing and profiling is necessary to correctly identify the cause, as we expected local drives to negligibly affect the tests. IO-bound testing with EBS can potentially be performed to try and remove the local drives from the equation.

More details of the test setup, results of the tests, scripts used, and data generated during the testing are available from this GitHub repo.

Summary

There were not many cases where the ARM instance becomes slower than the x86 instance in the tests we performed. The test results were consistent throughout the testing of the last couple of days. While ARM-based instance is 25 percent cheaper, it is able to show a 15-20% performance gain in most of the tests over the corresponding x86 based instances. So ARM-based instances are giving conclusively better price-performance in all aspects. We should expect more and more cloud providers to provide ARM-based instances in the future. Please let us know if you wish to see any different type of benchmark tests.

Join the discussion on Hacker News

Jan
20
2021
--

Drain Kubernetes Nodes… Wisely

Drain Kubernetes Nodes Wisely

Drain Kubernetes Nodes WiselyWhat is Node Draining?

Anyone who ever worked with containers knows how ephemeral they are. In Kubernetes, not only can containers and pods be replaced, but the nodes as well. Nodes in Kubernetes are VMs, servers, and other entities with computational power where pods and containers run.

Node draining is the mechanism that allows users to gracefully move all containers from one node to the other ones. There are multiple use cases:

  • Server maintenance
  • Autoscaling of the k8s cluster – nodes are added and removed dynamically
  • Preemptable or spot instances that can be terminated at any time

Why Drain?

Kubernetes can automatically detect node failure and reschedule the pods to other nodes. The only problem here is the time between the node going down and the pod being rescheduled. Here’s how it goes without draining:

  1. Node goes down – someone pressed the power button on the server.
  2. kube-controller-manager

    , the service which runs on masters, cannot get the

    NodeStatus

    from the

    kubelet

    on the node. By default it tries to get the status every 5 seconds and it is controlled by

    --node-monitor-period

    parameter of the controller.

  3. Another important parameter of the
    kube-controller-manager

    is

    --node-monitor-grace-period

    , which defaults to 40s. It controls how fast the node will be marked as

    NotReady

    by the master.

  4. So after ~40 seconds
    kubectl get nodes

    shows one of the nodes as

    NotReady

    , but the pods are still there and shown as running. This leads us to

    --pod-eviction-timeout

    , which is 5 minutes by default (!). It means that after the node is marked as

    NotReady

    , only after 5 minutes Kubernetes starts to evict the Pods.

Drain Kubernetes Nodes

So if someone shuts down the server, then only after almost six minutes (with default settings), Kubernetes starts to reschedule the pods to other nodes. This timing is also valid for managed k8s clusters, like GKE.

These defaults might seem to be too high, but this is done to prevent frequent pods flapping, which might impact your application and infrastructure in a far more negative way.

Okay, Draining How?

As mentioned before – draining is the graceful method to move the pods to another node. Let’s see how draining works and what pitfalls are there.

Basics

kubectl drain {NODE_NAME}

command most likely will not work. There are at least two flags that need to be set explicitly:

  • --ignore-daemonsets

    – it is not possible to evict pods that run under a DaemonSet. This flag ignores these pods.

  • --delete-emptydir-data

    – is an acknowledgment of the fact that data from EmptyDir ephemeral storage will be gone once pods are evicted.

Once the drain command is executed the following happens:

  1. The node is cordoned. It means that no new pods can be placed on this node. In the Kubernetes world, it is a Taint
    node.kubernetes.io/unschedulable:NoSchedule

    placed on the node that most of the pods tolerate.

  2. Pods, except the ones that belong to DaemonSets, are evicted and hopefully scheduled on another node.

Pods are evicted and now the server can be powered off. Wrong.

DaemonSets

If for some reason your application or service uses a DaemonSet primitive, the pod was not drained from the node. It means that it still can perform its function and even receive the traffic from the load balancer or the service. 

The best way to ensure that it is not happening – delete the node from the Kubernetes itself.

  1. Stop the
    kubelet

    on the node.

  2. Delete the node from the cluster with
    kubectl delete {NODE_NAME}

If

kubelet

is not stopped, the node will appear again after the deletion.

Pods are evicted, node is deleted, and now the server can be powered off. Wrong again.

Load Balancer

Here is quite a standard setup:

kubernetes Load Balancer

The external load balancer sends the traffic to all Kubernetes nodes. Kube-proxy and Container Network Interface internals are dealing with routing the traffic to the correct pod.

There are various ways to configure the load balancer, but as you see it might be still sending the traffic to the node. Make sure that the node is removed from the load balancer before powering it off. For example, AWS node termination handler does not remove the node from the Load Balancer, which causes a short packet loss in the event of node termination.

Conclusion

Microservices and Kubernetes shifted the paradigm of systems availability. SRE teams are focused on resilience more than on stability. Nodes, containers, and load balancers can fail, but they are ready for it. Kubernetes is an orchestration and automation tool that helps here a lot, but there are still pitfalls that must be taken care of to meet SLAs.

Jan
13
2021
--

Running Kubernetes on the Edge

Running Kubernetes on the Edge

Running Kubernetes on the EdgeWhat is Edge

Edge is a buzzword that, behind the curtain, means moving private or public clouds closer to the end devices. End devices, such as the Internet of Things (from a doorbell to a VoIP station), become more complex and require more computational power.  There is a constant growth of connected devices and by the end of 2025, there will be 41.6 billion of them, generating 69.4 Zettabytes of data.

Latency, speed of data processing, or security concerns do not allow computation to happen in the cloud. Businesses rely on edge computing or micro clouds, which can run closer to the end devices. All this constructs the Edge.

How Kubernetes Helps Here

Containers are portable and quickly becoming a de facto standard to ship software. Kubernetes is a container orchestrator with robust built-in scaling capabilities. This gives the perfect toolset for businesses to shape their Edge computing with ease and without changing existing processes.

The cloud-native landscape has various small Kubernetes distributions that were designed and built for the Edge: k3s, microk8s, minikube, k0s, and newly released EKS Distro. They are lightweight, can be deployed with few commands, and are fully conformant. Projects like KubeEdge bring even more simplicity and standardization into the Kubernetes ecosystem on the Edge.

Running Kubernetes on the Edge also poses the challenge to manage hundreds and thousands of clusters. Google Anthos, Azure Arc, and VMWare Tanzu allow you to run your clusters anywhere and manage them through a single interface with ease.

Topologies

We are going to review various topologies that Kubernetes provides for the Edge to bring computation and software closer to the end devices.

The end device is a Kubernetes cluster

Kubernetes cluster edge

Some devices run complex software and require multiple components to operate – web servers, databases, built-in data-processing, etc. Using packages is an option, but compared to containers and automated orchestration, it is slow and sometimes turns the upgrade process into a nightmare. In such cases, it is possible to run a Kubernetes cluster on each end device and manage software and infrastructure components using well-known primitives.

The drawback of this solution is the overhead that comes from running etcd and masters’ components on every device.

The end device is a node

The end device is a node kubernetes

In this case, you can manage each end device through a single Kubernetes control plane. Deploying software to support and run your phones, printers or any other devices can be done through standard Kubernetes primitives.

Micro-clouds

kubernetes Micro-clouds

This topology is all about moving computational power closer to the end devices by creating micro-clouds on the Edge. Micro-cloud is formed by the Kubernetes nodes on the server farm on the customer premises. Running your AI/ML (like Kubeflow) or any other resource-heavy application in your own micro-cloud is done with Kubernetes and its primitives.

How Percona Addresses Edge Challenges

We at Percona continue to invest in the Kubernetes ecosystem and expand our partnership with the community. Our Kubernetes Operators for Percona XtraDB Cluster and MongoDB are open source and enable anyone to run production-ready MySQL and MongoDB databases on the Edge.

Check out how easy it is to deploy our operators on Minikube or EKS Distro (which is similar to microk8s). We are working on furthering Day 2 operations simplification and in future blog posts, you will see how to deploy and manage databases on multiple Kubernetes clusters with KubeApps.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com