Migration of a MongoDB Replica Set to a Sharded Cluster

Migration of a MongoDB Replica Set to a Sharded Cluster

Migration of a MongoDB Replica Set to a Sharded ClusterIn this blog post, we will discuss how can we migrate from a replica set to sharded cluster. 

Before moving to migration let me briefly explain Replication and Sharding and why do we need to shard a replica Set.

Replication: It creates additional copies of data and allows for automatic failover to another node in case Primary went down. It also helps to scale our reads if the application is fine to read data that may not be the latest.

Sharding: It allows horizontal scaling of data writes by allowing data partition in multiple servers by using a shard key. Here, we should understand that a shard key is very important to distribute the data evenly across multiple servers. 

Why Do We Need a Sharded Cluster?

We need sharding due to the below reasons:

  1. By adding shards, we can reduce the number of operations each shard manages. 
  2. It increases the Read/Write capacity by distributing the Reads/Writes across multiple servers. 
  3. It also gives high availability as we deploy the replicas for the shards, config servers, and multiple MongoS.

Sharded cluster will include two more components which are Config Servers and Query routers i.e. MongoS.

Config Servers: It keeps metadata for the sharded cluster. The metadata comprises a list of chunks on each shard and the ranges that define the chunks. The metadata indicates the state of all the data and its components within the cluster. 

Query Routers(MongoS): It caches metadata and uses it to route the read or write operations to the respective shards. It also updates the cache when there are any metadata changes for the sharded cluster like Splitting of chunks or shard addition etc. 

Note: Before starting the migration process it’s recommended that you perform a full backup (if you don’t have one already).

The Procedure of Migration:

  1. Initiate at least a three-member replica set for the Config Server ( another member can be included as a hidden node for the backup purpose).
  2. Perform necessary OS, H/W, and disk-level tuning as per the existing Replica set.
  3. Setup the appropriate clusterRole for the Config servers in the mongod config file.
  4. Create at least two more nodes for the Query routers ( MongoS )
  5. Set appropriate configDB parameters in the mongos config file.
  6. Repeat step 2 from above to tune as per the existing replica set.
  7. Apply proper SELinux policies on all the newly configured nodes of Config server and MongoS.
  8. Add clusterRole parameter into existing replica set nodes in a rolling fashion.
  9. Copy all the users from the replica set to any MongoS.
  10. Connect to any MongoS and add the existing replica set as Shard. 

Note: Do not enable sharding on any database until the shard key is finalized. If it’s finalized then we can enable the sharding.

Detailed Migration Plan:

Here, we are assuming that a Replica set has three nodes (1 primary, and 2 secondaries)

  1. Create three servers to initiate a 3-member replica set for the Config Servers.

Perform necessary OS, H/W, and disk-level tuning. To know more about it, please visit our blog on Tuning Linux for MongoDB.

  1. Install the same version of Percona Server for MongoDB as the existing replica set from here.
  2. In the config file of the config server mongod, add the parameter clusterRole: configsvr and port: 27019  to start it as config server on port 27019.
  3. If SELinux policy is enabled then set the necessary SELinux policy for dbPath, keyFile, and logs as below.
sudo semanage fcontext -a -t mongod_var_lib_t '/dbPath/mongod.*'

sudo chcon -Rv -u system_u -t mongod_var_lib_t '/dbPath/mongod'

sudo restorecon -R -v '/dbPath/mongod'

sudo semanage fcontext -a -t mongod_log_t '/logPath/log.*'

sudo chcon -Rv -u system_u -t mongod_log_t '/logPath/log'

sudo restorecon -R -v '/logPath/log'

sudo semanage port -a -t mongod_port_t -p tcp 27019

Start all the Config server mongod instances and connect to any one of them. Create a temporary user on it and initiate the replica set.

> use admin

> rs.initiate()

> db.createUser( { user: "tempUser", pwd: "<password>", roles:[{role: "root" , db:"admin"}]})

Create a role anyResource with action anyAction as well and assign it to “tempUser“.

>db.getSiblingDB("admin").createRole({ "role": "pbmAnyAction",

      "privileges": [

         { "resource": { "anyResource": true },

           "actions": [ "anyAction" ]



      "roles": []



>db.grantRolesToUser( "tempUser", [{role: "pbmAnyAction", db: "admin"}]  )

> rs.add("config_host[2-3]:27019")

Now our Config server replica set is ready, let’s move to deploying Query routers i.e. MongoS.

  1. Create two instances for the MongoS and tune the OS, H/W, and disk. To do it follow our blog Tuning Linux for MongoDB or point 1 from the above Detailed migration.
  2. In mongos config file, adjust the configDB parameter and include only non-hidden nodes of Config servers ( In this blog post, we have not mentioned starting hidden config servers).
  3. Apply SELinux policies if it’s enabled, then follow step 4 and keep the same keyFile and start the MongoS on port 27017.
  4. Add the below parameter in mongod.conf on the Replica set nodes. Make sure the services are restarted in a rolling fashion i.e. start with the Secondaries then step down the existing Primary and restart it with port 27018.
clusterRole: shardsvr

Login to any MongoS and authenticate using “tempUser” and add the existing replica set as a shard.

> sh.addShard( "replicaSetName/<URI of the replica set>") //Provide URI of the replica set

Verify it with:

> sh.status() or db.getSiblingDB("config")['shards'].find()

Connect to the Primary of the replica set and copy all the users and roles. To authenticate/authorize mention the replica set user.

> var mongos = new Mongo("mongodb://put MongoS URI string here/admin?authSource=admin") //Provide the URI of the MongoS with tempUser for authentication/authorization.

>db.getSiblingDB("admin").system.roles.find().forEach(function(d) {


>db.getSiblingDB("admin").system.users.find().forEach(function(d) { mongos.getDB('admin').getCollection('system.users').insert(d)});

  1.  Connect to any MongoS and verify copied users on it. 
  2.  Shard the database if shardKey is finalized (In this post, we are not sharing this information as it’s related to migration of Replica set to Sharded cluster only).

Shard the database:


Shard the collection with hash-based shard key:

>sh.shardCollection("<db>.<coll1>", { <shard key field> : "hashed" } )

Shard the collection with range based shard key:

sh.shardCollection("<db>.<coll1>", { <shard key field> : 1, ... } )


Migration of a MongoDB replica set to a sharded cluster is very important to scale horizontally, increase the read/write operations, and also reduce the operations each shard manages.

We encourage you to try our products like Percona Server for MongoDB, Percona Backup for MongoDB, or Percona Operator for MongoDB. You can also visit our site to know “Why MongoDB Runs Better with Percona”.


MongoDB – Converting Replica Set to Standalone

Converting Replica Set to Standalone MongoDB

Converting Replica Set to Standalone MongoDBOne of the reasons for using MongoDB is its simplicity in scaling its structure,  either as Replicaset(scale-up) or as Sharded Cluster(scale-out).

Although a bit tricky in Sharded environments, the reverse process is also feasible.

From the possibilities, you can:

So far, so good, but if for some reason that mongo project which started as a Replica Set does not require such structure and now a Standalone server meets the requirements; 

  • Can you shrink from Replica Set to a Standalone server? 
    • If possible, what is the procedure to achieve this configuration?

This blog post will walk through the necessary steps to convert a Replica Set into a Standalone server.


Before starting the process, it is important to mention that it’s not advised to run Standalone configuration on production servers as your entire availability will only rely on a single point.

Another observation is this activity requires downtime, that’s because you will need to remove the replication parameter, which can not be at runtime.

Last note; to avoid any data change during the procedure, it’s recommended to stop writing from the application. You will need to adjust the connection string at the end of the process for the new configuration.

How To:

The testing environment is a standard Replica Set configuration with three nodes, as seen at the following replication status.

		"_id" : 0,
		"name" : "node1:27017",
		"stateStr" : "PRIMARY"
		"_id" : 1,
		"name" : "node2:27017",
		"stateStr" : "SECONDARY",
		"_id" : 2,
		"name" : "node3:27017",
		"stateStr" : "SECONDARY",

  • The PRIMARY is our reliable data source; thus, it will be the Standalone server at the end of the process.

It is not needed to trigger any election process unless you have a preferred node to be the Standalone. If that’s your case and that node is not the PRIMARY yet, please make sure to make PRIMARY before starting the procedure.

That said, let’s get our hands dirty!

1) Removing the secondaries:

  • From the PRIMARY node, let’s remove node 2 and 3:
rs0:PRIMARY> rs.remove("node2:27017")
	"ok" : 1,
	"$clusterTime" : {
		"clusterTime" : Timestamp(1640109346, 1),
		"signature" : {
			"hash" : BinData(0,"663Y5u3GkZkQ0Ci7EU8IYUldVIM="),
			"keyId" : NumberLong("7044208418021703684")
	"operationTime" : Timestamp(1640109346, 1)

rs0:PRIMARY> rs.remove("node3:27017")
	"ok" : 1,
	"$clusterTime" : {
		"clusterTime" : Timestamp(1640109380, 1),
		"signature" : {
			"hash" : BinData(0,"RReAzLZmHWfzVcdnLoGJ/uz04Vo="),
			"keyId" : NumberLong("7044208418021703684")
	"operationTime" : Timestamp(1640109380, 1)

Please note: Although my Replica Set comprises three nodes, you must let only the PRIMARY on replication at this step. If you have more nodes, the process is the same.

Then, the expected status should be:

rs0:PRIMARY> rs.status().members
		"_id" : 0,
		"name" : "node1:27017",
		"health" : 1,
		"state" : 1,
		"stateStr" : "PRIMARY",
		"uptime" : 1988,
		"optime" : {
			"ts" : Timestamp(1640109560, 1),
			"t" : NumberLong(1)
		"optimeDate" : ISODate("2021-12-21T17:59:20Z"),
		"syncSourceHost" : "",
		"syncSourceId" : -1,
		"infoMessage" : "",
		"electionTime" : Timestamp(1640107580, 2),
		"electionDate" : ISODate("2021-12-21T17:26:20Z"),
		"configVersion" : 5,
		"configTerm" : 1,
		"self" : true,
		"lastHeartbeatMessage" : ""


2) Changing the configuration:

  • Even though node2 and node3 are no longer part of the replication, let’s stop the mongod process. Keeping their data safe, as it can be used in a recovery scenario if necessary:
node2-shell> systemctl stop mongod
node3-shell> systemctl stop mongod

  • Then, you can remove the replication parameter from PRIMARY and restart it, but first! In the configuration file:
    • Please remove or comment the lines:
#  replSetName: "rs0"

If you start MongoDB via the command line, please remove the option  --replSet.

  • Once changed, you can restart the mongod on PRIMARY:
node1-shell> systemctl restart mongod

3) Removing replication objects:

  • Once connected to former PRIMARY, now Standalone, MongoDB will warn you with the following message:
2021-12-21T18:23:52.056+00:00: Document(s) exist in 'system.replset', but started without --replSet. Database contents may appear inconsistent with the writes that were visible when this node was running as part of a replica set. Restart with --replSet unless you are doing maintenance and no other clients are connected. The TTL collection monitor will not start because of this. For more info see http://dochub.mongodb.org/core/ttlcollections


That is because the node was working under a Replica Set configuration, and although it’s a Standalone now, there is the old structure existing.

For further detail, that structure resides under the local database, which holds all necessary data for replication, and for the database itself:

mongo> use local
switched to db local

mongo> show collections


To remove that warning, you should remove the old Replica Set object on local.system.replset.

local.system.replset holds the replica set’s configuration object as its single document. To view the object’s configuration information, issue rs.conf() from mongosh. You can also query this collection directly.

However, it’s important to mention the local.system.replset is a system object, and there is no built-in role that you can use to remove objects –  Unless you create a custom role that grants anyAction on anyResource to the user, Or you grant the internal role __system to a user.


Please note, both of that options gives access to every resource in the system and is intended for internal use. Do not use this permissions, other than in exceptional circumstances.  

For this process, we will create a new user, temporary, with __system privilege in which we can drop later after the action.

3. a) Creating the temporary user:

mongo> db.getSiblingDB("admin").createUser({user: "dba_system", "pwd": "sekret","roles" : [{ "db" : "admin", "role" : "__system" }]});

Confirmation output:

Successfully added user:{

3. b) Removing the object:
  1. First; Connect with the created user.
  2. Then:
mongo> use local;
mongo> db.system.replset.remove({})

Confirmation output:

WriteResult({ "nRemoved" : 1 })


3. c) Dropping the temporary user:
  1. Disconnect from that system user.
  2. Then:
mongo> db.dropUser("dba_system")


4) Restart the service:

shell> systemctl restart mongod


Once reconnected, you will not see that warning again, and the server is ready!


5) Connection advice:

  • Please make sure to adjust the connection string from the app and clients, passing only the Standalone server.



The conversion process, in general, is simple and should not face problems. But is essential to highlight that we strongly recommend you test and run this procedure in a lower environment. And if you plan to use it on production, please bear in mind the said before.

And if you have any questions, feel free to contact us in the comment section below!


See you!



Scaling CockroachDB in the red ocean of relational databases

Most database startups avoid building relational databases, since that market is dominated by a few goliaths. Oracle, MySQL and Microsoft SQL Server have embedded themselves into the technical fabric of large- and medium-size companies going back decades. These established companies have a lot of market share and a lot of money to quash the competition.

So rather than trying to compete in the relational database market, over the past decade, many database startups focused on alternative architectures such as document-centric databases (like MongoDB), key-value stores (like Redis) and graph databases (like Neo4J). But Cockroach Labs went against conventional wisdom with CockroachDB: It intentionally competed in the relational database market with its relational database product.

While it did face an uphill battle to penetrate the market, Cockroach Labs saw a surprising benefit: It didn’t have to invent a market. All it needed to do was grab a share of a market that also happened to be growing rapidly.

Cockroach Labs has a bright future, compelling technology, a lot of money in the bank and has an experienced, technically astute executive team.

In previous parts of this EC-1, I looked at the origins of CockroachDB, presented an in-depth technical description of its product as well as an analysis of the company’s developer relations and cloud service, CockroachCloud. In this final installment, we’ll look at the future of the company, the competitive landscape within the relational database market, its ability to retain talent as it looks toward a potential IPO or acquisition, and the risks it faces.

CockroachDB’s success is not guaranteed. It has to overcome significant hurdles to secure a profitable place for itself among a set of well-established database technologies that are owned by companies with very deep pockets.

It’s not impossible, though. We’ll first look at MongoDB as an example of how a company can break through the barriers for database startups competing with incumbents.

When life gives you Mongos, make MongoDB

Dev Ittycheria, MongoDB CEO, rings the Nasdaq Stock Market Opening Bell. Image Credits: Nasdaq, Inc

MongoDB is a good example of the risks that come with trying to invent a new database market. The company started out as a purely document-centric database at a time when that approach was the exception rather than the rule.

Web developers like document-centric databases because they address a number of common use cases in their work. For example, a document-centric database works well for storing comments to a blog post or a customer’s entire order history and profile.


What does Red Hat’s sale to IBM tell us about Couchbase’s valuation?

The IPO rush of 2021 continued this week with a fresh filing from NoSQL provider Couchbase. The company raised hundreds of millions while private, making its impending debut an important moment for a number of private investors, including venture capitalists.

According to PitchBook data, Couchbase was last valued at a post-money valuation of $580 million when it raised $105 million in May 2020. The company — despite its expansive fundraising history — is not a unicorn heading into its debut to the best of our knowledge.

We’d like to uncover whether it will be one when it prices and starts to trade, so we dug into Couchbase’s business model and its financial performance, hoping to better understand the company and its market comps.

The Couchbase S-1

The Couchbase S-1 filing details a company that sells database tech. More specifically, Couchbase offers customers database technology that includes what NoSQL can offer (“schema flexibility,” in the company’s phrasing), as well as the ability to ask questions of their data with SQL queries.

Couchbase’s software can be deployed on clouds, including public clouds, in hybrid environments, and even on-prem setups. The company sells to large companies, attracting 541 customers by the end of its fiscal 2021 that generated $107.8 million in annual recurring revenue, or ARR, by the close of last year.

Couchbase breaks its revenue into two main buckets. The first, subscription, includes software license income and what the company calls “support and other” revenues, which it defines as “post-contract support,” or PCS, which is a package of offerings, including “support, bug fixes and the right to receive unspecified software updates and upgrades” for the length of the contract.

The company’s second revenue bucket is services, which is self-explanatory and lower-margin than its subscription products.


Meroxa raises $15M Series A for its real-time data platform

Meroxa, a startup that makes it easier for businesses to build the data pipelines to power both their analytics and operational workflows, today announced that it has raised a $15 million Series A funding round led by Drive Capital. Existing investors Root, Amplify and Hustle Fund also participated in this round, which together with the company’s previously undisclosed $4.2 million seed round now brings total funding in the company to $19.2 million.

The promise of Meroxa is that businesses can use a single platform for their various data needs and won’t need a team of experts to build their infrastructure and then manage it. At its core, Meroxa provides a single software-as-a-service solution that connects relational databases to data warehouses and then helps businesses operationalize that data.

Image Credits: Meroxa

“The interesting thing is that we are focusing squarely on relational and NoSQL databases into data warehouse,” Meroxa co-founder and CEO DeVaris Brown told me. “Honestly, people come to us as a real-time FiveTran or real-time data warehouse sink. Because, you know, the industry has moved to this [extract, load, transform] format. But the beautiful part about us is, because we do change data capture, we get that granular data as it happens.” And businesses want this very granular data to be reflected inside of their data warehouses, Brown noted, but he also stressed that Meroxa can expose this stream of data as an API endpoint or point it to a Webhook.

The company is able to do this because its core architecture is somewhat different from other data pipeline and integration services that, at first glance, seem to offer a similar solution. Because of this, users can use the service to connect different tools to their data warehouse but also build real-time tools on top of these data streams.

Image Credits: Meroxa

“We aren’t a point-to-point solution,” Meroxa co-founder and CTO Ali Hamidi explained. “When you set up the connection, you aren’t taking data from Postgres and only putting it into Snowflake. What’s really happening is that it’s going into our intermediate stream. Once it’s in that stream, you can then start hanging off connectors and say, ‘Okay, well, I also want to peek into the stream, I want to transfer my data, I want to filter out some things, I want to put it into S3.’ ”

Because of this, users can use the service to connect different tools to their data warehouse but also build real-time tools to utilize the real-time data stream. With this flexibility, Hamidi noted, a lot of the company’s customers start with a pretty standard use case and then quickly expand into other areas as well.

Brown and Hamidi met during their time at Heroku, where Brown was a director of product management and Hamidi a lead software engineer. But while Heroku made it very easy for developers to publish their web apps, there wasn’t anything comparable in the highly fragmented database space. The team acknowledges that there are a lot of tools that aim to solve these data problems, but few of them focus on the user experience.

Image Credits: Meroxa

“When we talk to customers now, it’s still very much an unsolved problem,” Hamidi said. “It seems kind of insane to me that this is such a common thing and there is no ‘oh, of course you use this tool because it addresses all my problems.’ And so the angle that we’re taking is that we see user experience not as a nice-to-have, it’s really an enabler, it is something that enables a software engineer or someone who isn’t a data engineer with 10 years of experience in wrangling Kafka and Postgres and all these things. […] That’s a transformative kind of change.”

It’s worth noting that Meroxa uses a lot of open-source tools but the company has also committed to open-sourcing everything in its data plane as well. “This has multiple wins for us, but one of the biggest incentives is in terms of the customer, we’re really committed to having our agenda aligned. Because if we don’t do well, we don’t serve the customer. If we do a crappy job, they can just keep all of those components and run it themselves,” Hamidi explained.

Today, Meroxa, which the team founded in early 2020, has more than 24 employees (and is 100% remote). “I really think we’re building one of the most talented and most inclusive teams possible,” Brown told me. “Inclusion and diversity are very, very high on our radar. Our team is 50% black and brown. Over 40% are women. Our management team is 90% underrepresented. So not only are we building a great product, we’re building a great company, we’re building a great business.”  


Microsoft Azure expands its NoSQL portfolio with Managed Instances for Apache Cassandra

At its Ignite conference today, Microsoft announced the launch of Azure Managed Instance for Apache Cassandra, its latest NoSQL database offering and a competitor to Cassandra-centric companies like Datastax. Microsoft describes the new service as a ‘semi-managed offering that will help companies bring more of their Cassandra-based workloads into its cloud.

“Customers can easily take on-prem Cassandra workloads and add limitless cloud scale while maintaining full compatibility with the latest version of Apache Cassandra,” Microsoft explains in its press materials. “Their deployments gain improved performance and availability, while benefiting from Azure’s security and compliance capabilities.”

Like its counterpart, Azure SQL Manages Instance, the idea here is to give users access to a scalable, cloud-based database service. To use Cassandra in Azure before, businesses had to either move to Cosmos DB, its highly scalable database service which supports the Cassandra, MongoDB, SQL and Gremlin APIs, or manage their own fleet of virtual machines or on-premises infrastructure.

Cassandra was originally developed at Facebook and then open-sourced in 2008. A year later, it joined the Apache Foundation and today it’s used widely across the industry, with companies like Apple and Netflix betting on it for some of their core services, for example. AWS launched a managed Cassandra-compatible service at its re:Invent conference in 2019 (it’s called Amazon Keyspaces today), Microsoft launched the Cassandra API for Cosmos DB in September 2018. With today’s announcement, though, the company can now offer a full range of Cassandra-based servicer for enterprises that want to move these workloads to its cloud.

Early Stage is the premiere ‘how-to’ event for startup entrepreneurs and investors. You’ll hear firsthand how some of the most successful founders and VCs build their businesses, raise money and manage their portfolios. We’ll cover every aspect of company-building: Fundraising, recruiting, sales, legal, PR, marketing and brand building. Each session also has audience participation built-in — there’s ample time included in each for audience questions and discussion.


Datastax acquires Kesque as it gets into data streaming

Datastax, the company best known for commercializing the open-source Apache Cassandra database, is moving beyond databases. As the company announced today, it has acquired Kesque, a cloud messaging service.

The Kesque team built its service on top of the Apache Pulsar messaging and streaming project. Datastax has now taken that team’s knowledge in this area and, combined with its own expertise, is launching its own Pulsar-based streaming platform by the name of Datastax Luna Streaming, which is now generally available.

This move comes right as Datastax is also now, for the first time, announcing that it is cash-flow positive and profitable, as the company’s chief product officer, Ed Anuff, told me. “We are at over $150 million in [annual recurring revenue]. We are cash-flow positive and we are profitable,” he told me. This marks the first time the company is publically announcing this data. In addition, the company also today revealed that about 20 percent of its annual contract value is now for DataStax Astra, its managed multi-cloud Cassandra service and that the number of self-service Asta subscribers has more than doubled from Q3 to Q4.

The launch of Luna Streaming now gives the 10-year-old company a new area to expand into — and one that has some obvious adjacencies with its existing product portfolio.

“We looked at how a lot of developers are building on top of Cassandra,” Anuff, who joined Datastax after leaving Google Cloud last year, said. “What they’re doing is, they’re addressing what people call ‘data-in-motion’ use cases. They have huge amounts of data that are coming in, huge amounts of data that are going out — and they’re typically looking at doing something with streaming in conjunction with that. As we’ve gone in and asked, “What’s next for Datastax?,’ streaming is going to be a big part of that.”

Given Datastax’s open-source roots, it’s no surprise the team decided to build its service on another open-source project and acquire an open-source company to help it do so. Anuff noted that while there has been a lot of hype around streaming and Apache Kafka, a cloud-native solution like Pulsar seemed like the better solution for the company. Pulsar was originally developed at Yahoo! (which, full disclosure, belongs to the same Verizon Media Group family as TechCrunch) and even before acquiring Kesque, Datastax already used Pulsar to build its Astra platform. Other Pulsar users include Yahoo, Tencent, Nutanix and Splunk.

“What we saw was that when you go and look at doing streaming in a scale-out way, that Kafka isn’t the only approach. We looked at it, and we liked the Pulsar architecture, we like what’s going on, we like the community — and remember, we’re a company that grew up in the Apache open-source community — we said, ‘okay, we think that it’s got all the right underpinnings, let’s go and get involved in that,” Anuff said. And in the process of doing so, the team came across Kesque founder Chris Bartholomew and eventually decided to acquire his company.

The new Luna Streaming offering will be what Datastax calls a “subscription to success with Apache Pulsar.’ It will include a free, production-ready distribution of Pulsar and an optional, SLA-backed subscription tier with enterprise support.

Unsurprisingly, Datastax also plans to remain active in the Pulsar community. The team is already making code contributions, but Anuff also stressed that Datastax is helping out with scalability testing. “This is one of the things that we learned in our participation in the Apache Cassandra project,” Anuff said. “A lot of what these projects need is folks coming in doing testing, helping with deployments, supporting users. Our goal is to be a great participant in the community.”


VESoft raises $8M to meet China’s growing need for graph databases

Sherman Ye founded VESoft in 2018 when he saw a growing demand for graph databases in China. Its predecessors, like Neo4j and TigerGraph, had already been growing aggressively in the West for a few years, while China was just getting to know the technology that leverages graph structures to store data sets and depict their relationships, such as those used for social media analysis, e-commerce recommendations and financial risk management.

VESoft is ready for further growth after closing an $8 million funding round led by Redpoint China Ventures, an investment firm launched by Silicon Valley-based Redpoint Ventures in 2005. Existing investor Matrix Partners China also participated in the Series pre-A round. The new capital will allow the startup to develop products and expand to markets in North America, Europe and other parts of Asia.

The 30-people team is comprised of former employees from Alibaba, Facebook, Huawei and IBM. It’s based in Hangzhou, a scenic city known for its rich history and housing Alibaba and its financial affiliate Ant Financial, where Ye previously worked as a senior engineer after his four-year stint with Facebook in California. From 2017 to 2018, the entrepreneur noticed that Ant Financial’s customers were increasingly interested in adopting graph databases as an alternative to relational databases, a model that had been popular since the 80s and normally organizes data into tables.

“While relational databases are capable of achieving many functions carried out by graph databases… they deteriorate in performance as the quantity of data grows,” Ye told TechCrunch during an interview. “We didn’t use to have so much data.”

Information explosion is one reason why Chinese companies are turning to graph databases, which can handle millions of transactions to discover patterns within scattered data. The technology’s rise is also a response to new forms of online businesses that depend more on relationships.

“Take recommendations for example. The old model recommends content based purely on user profiles, but the problem of relying on personal browsing history is it fails to recommend new things. That was fine for a long time as the Chinese [internet] market was big enough to accommodate many players. But as the industry becomes saturated and crowded… companies need to ponder how to retain existing users, lengthen their time spent, and win users from rivals.”

The key lies in serving people content and products they find appealing. Graph databases come in handy, suggested Ye, when services try to predict users’ interest or behavior as the model uncovers what their friends or people within their social circles like. “That’s a lot more effective than feeding them what’s trending.”

Neo4j compares relational and graph databases (Link)

The company has made its software open source, which the founder believed can help cultivate a community of graph database users and educate the market in China. It will also allow VESoft to reach more engineers in the English-speaking world who are well-acquainted with the open-source culture.

“There is no such thing as being ‘international’ or ‘domestic’ for a technology-driven company. There are no boundaries between countries in the open-source world,” reckoned Ye.

When it comes to generating income, the startup plans to launch a paid version for enterprises, which will come with customized plug-ins and host services.

The Nebula Graph, the brand of VESoft’s database product, is now serving 20 enterprise clients from areas across social media, e-commerce and finance, including big names like food delivery giant Meituan, popular social commerce app Xiaohongshu and e-commerce leader JD.com. A number of overseas companies are also trialing Nebula.

The time is ripe for enterprise-facing startups with a technological moat in China as the market for consumers has been divided by incumbents like Tencent and Alibaba. This makes fundraising relatively easy for VESoft. The founder is confident that Chinese companies are rapidly catching up with their Western counterparts in the space, for the gargantuan amount of data and the myriad of ways data is used in the country “will propel the technology forward.”


Microsoft partners with Redis Labs to improve its Azure Cache for Redis

For a few years now, Microsoft has offered Azure Cache for Redis, a fully managed caching solution built on top of the open-source Redis project. Today, it is expanding this service by adding Redis Enterprise, Redis Lab’s commercial offering, to its platform. It’s doing so in partnership with Redis Labs and while Microsoft will offer some basic support for the service, Redis Labs will handle most of the software support itself.

Julia Liuson, Microsoft’s corporate VP of its developer tools division, told me that the company wants to be seen as a partner to open-source companies like Redis Labs, which was among the first companies to change its license to prevent cloud vendors from commercializing and repackaging their free code without contributing back to the community. Last year, Redis Labs partnered with Google Cloud to bring its own fully managed service to its platform and so maybe it’s no surprise that we are now seeing Microsoft make a similar move.

Liuson tells me that with this new tier for Azure Cache for Redis, users will get a single bill and native Azure management, as well as the option to deploy natively on SSD flash storage. The native Azure integration should also make it easier for developers on Azure to integrate Redis Enterprise into their applications.

It’s also worth noting that Microsoft will support Redis Labs’ own Redis modules, including RediSearch, a Redis-powered search engine, as well as RedisBloom and RedisTimeSeries, which provide support for new datatypes in Redis.

“For years, developers have utilized the speed and throughput of Redis to produce unbeatable responsiveness and scale in their applications,” says Liuson. “We’ve seen tremendous adoption of Azure Cache for Redis, our managed solution built on open source Redis, as Azure customers have leveraged Redis performance as a distributed cache, session store, and message broker. The incorporation of the Redis Labs Redis Enterprise technology extends the range of use cases in which developers can utilize Redis, while providing enhanced operational resiliency and security.”


Datastax acquires The Last Pickle

Data management company Datastax, one of the largest contributors to the Apache Cassandra project, today announced that it has acquired The Last Pickle (and no, I don’t know what’s up with that name either), a New Zealand-based Cassandra consulting and services firm that’s behind a number of popular open-source tools for the distributed NoSQL database.

As Datastax Chief Strategy Officer Sam Ramji, who you may remember from his recent tenure at Apigee, the Cloud Foundry Foundation, Google and Autodesk, told me, The Last Pickle is one of the premier Apache Cassandra consulting and services companies. The team there has been building Cassandra-based open source solutions for the likes of Spotify, T Mobile and AT&T since it was founded back in 2012. And while The Last Pickle is based in New Zealand, the company has engineers all over the world that do the heavy lifting and help these companies successfully implement the Cassandra database technology.

It’s worth mentioning that Last Pickle CEO Aaron Morton first discovered Cassandra when he worked for WETA Digital on the special effects for Avatar, where the team used Cassandra to allow the VFX artists to store their data.

“There’s two parts to what they do,” Ramji explained. “One is the very visible consulting, which has led them to become world experts in the operation of Cassandra. So as we automate Cassandra and as we improve the operability of the project with enterprises, their embodied wisdom about how to operate and scale Apache Cassandra is as good as it gets — the best in the world.” And The Last Pickle’s experience in building systems with tens of thousands of nodes — and the challenges that its customers face — is something Datastax can then offer to its customers as well.

And Datastax, of course, also plans to productize The Last Pickle’s open-source tools like the automated repair tool Reaper and the Medusa backup and restore system.

As both Ramji and Datastax VP of Engineering Josh McKenzie stressed, Cassandra has seen a lot of commercial development in recent years, with the likes of AWS now offering a managed Cassandra service, for example, but there wasn’t all that much hype around the project anymore. But they argue that’s a good thing. Now that it is over ten years old, Cassandra has been battle-hardened. For the last ten years, Ramji argues, the industry tried to figure out what the de factor standard for scale-out computing should be. By 2019, it became clear that Kubernetes was the answer to that.

“This next decade is about what is the de facto standard for scale-out data? We think that’s got certain affordances, certain structural needs and we think that the decades that Cassandra has spent getting harden puts it in a position to be data for that wave.”

McKenzie also noted that Cassandra provides users with a number of built-in features like support for mutiple data centers and geo-replication, rolling updates and live scaling, as well as wide support across programming languages, give it a number of advantages over competing databases.

“It’s easy to forget how much Cassandra gives you for free just based on its architecture,” he said. “Losing the power in an entire datacenter, upgrading the version of the database, hardware failing every day? No problem. The cluster is 100 percent always still up and available. The tooling and expertise of The Last Pickle really help bring all this distributed and resilient power into the hands of the masses.”

The two companies did not disclose the price of the acquisition.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com