May
02
2019
--

Couchbase’s mobile database gets built-in ML and enhanced synchronization features

Couchbase, the company behind the eponymous NoSQL database, announced a major update to its mobile database today that brings some machine learning smarts, as well as improved synchronization features and enhanced stats and logging support, to the software.

“We’ve led the innovation and data management at the edge since the release of our mobile database five years ago,” Couchbase’s VP of Engineering Wayne Carter told me. “And we’re excited that others are doing that now. We feel that it’s very, very important for businesses to be able to utilize these emerging technologies that do sit on the edge to drive their businesses forward, and both making their employees more effective and their customer experience better.”

The latter part is what drove a lot of today’s updates, Carter noted. He also believes that the database is the right place to do some machine learning. So with this release, the company is adding predictive queries to its mobile database. This new API allows mobile apps to take pre-trained machine learning models and run predictive queries against the data that is stored locally. This would allow a retailer to create a tool that can use a phone’s camera to figure out what part a customer is looking for.

To support these predictive queries, Couchbase mobile is also getting support for predictive indexes. “Predictive indexes allow you to create an index on prediction, enabling correlation of real-time predictions with application data in milliseconds,” Carter said. In many ways, that’s also the unique value proposition for bringing machine learning into the database. “What you really need to do is you need to utilize the unique values of a database to be able to deliver the answer to those real-time questions within milliseconds,” explained Carter.

The other major new feature in this release is delta synchronization, which allows businesses to push far smaller updates to the databases on their employees’ mobile devices. That’s because they only have to receive the information that changed instead of a full updated database. Carter says this was a highly requested feature, but until now, the company always had to prioritize work on other components of Couchbase.

This is an especially useful feature for the company’s retail customers, a vertical where it has been quite successful. These users need to keep their catalogs up to data and quite a few of them supply their employees with mobile devices to help shoppers. Rumor has it that Apple, too, is a Couchbase user.

The update also includes a few new features that will be more of interest to operators, including advanced stats reporting and enhanced logging support.

Apr
09
2019
--

Google Cloud challenges AWS with new open-source integrations

Google today announced that it has partnered with a number of top open-source data management and analytics companies to integrate their products into its Google Cloud Platform and offer them as managed services operated by its partners. The partners here are Confluent, DataStax, Elastic, InfluxData, MongoDB, Neo4j and Redis Labs.

The idea here, Google says, is to provide users with a seamless user experience and the ability to easily leverage these open-source technologies in Google’s cloud. But there is a lot more at play here, even though Google never quite says so. That’s because Google’s move here is clearly meant to contrast its approach to open-source ecosystems with Amazon’s. It’s no secret that Amazon’s AWS cloud computing platform has a reputation for taking some of the best open-source projects and then forking those and packaging them up under its own brand, often without giving back to the original project. There are some signs that this is changing, but a number of companies have recently taken action and changed their open-source licenses to explicitly prevent this from happening.

That’s where things get interesting, because those companies include Confluent, Elastic, MongoDB, Neo4j and Redis Labs — and those are all partnering with Google on this new project, though it’s worth noting that InfluxData is not taking this new licensing approach and that while DataStax uses lots of open-source technologies, its focus is very much on its enterprise edition.

“As you are aware, there has been a lot of debate in the industry about the best way of delivering these open-source technologies as services in the cloud,” Manvinder Singh, the head of infrastructure partnerships at Google Cloud, said in a press briefing. “Given Google’s DNA and the belief that we have in the open-source model, which is demonstrated by projects like Kubernetes, TensorFlow, Go and so forth, we believe the right way to solve this it to work closely together with companies that have invested their resources in developing these open-source technologies.”

So while AWS takes these projects and then makes them its own, Google has decided to partner with these companies. While Google and its partners declined to comment on the financial arrangements behind these deals, chances are we’re talking about some degree of profit-sharing here.

“Each of the major cloud players is trying to differentiate what it brings to the table for customers, and while we have a strong partnership with Microsoft and Amazon, it’s nice to see that Google has chosen to deepen its partnership with Atlas instead of launching an imitation service,” Sahir Azam, the senior VP of Cloud Products at MongoDB told me. “MongoDB and GCP have been working closely together for years, dating back to the development of Atlas on GCP in early 2017. Over the past two years running Atlas on GCP, our joint teams have developed a strong working relationship and support model for supporting our customers’ mission critical applications.”

As for the actual functionality, the core principle here is that Google will deeply integrate these services into its Cloud Console; for example, similar to what Microsoft did with Databricks on Azure. These will be managed services and Google Cloud will handle the invoicing and the billings will count toward a user’s Google Cloud spending commitments. Support will also run through Google, so users can use a single service to manage and log tickets across all of these services.

Redis Labs CEO and co-founder Ofer Bengal echoed this. “Through this partnership, Redis Labs and Google Cloud are bringing these innovations to enterprise customers, while giving them the choice of where to run their workloads in the cloud, he said. “Customers now have the flexibility to develop applications with Redis Enterprise using the fully integrated managed services on GCP. This will include the ability to manage Redis Enterprise from the GCP console, provisioning, billing, support, and other deep integrations with GCP.”

Jan
31
2019
--

Google’s Cloud Firestore NoSQL database hits general availability

Google today announced that Cloud Firestore, its serverless NoSQL document database for mobile, web and IoT apps, is now generally available. In addition, Google is also introducing a few new features and bringing the service to 10 new regions.

With this launch, Google is giving developers the option to run their databases in a single region. During the beta, developers had to use multi-region instances, and, while that obviously has some advantages with regard to resilience, it’s also more expensive and not every app needs to run in multiple regions.

“Some people don’t need the added reliability and durability of a multi-region application,” Google product manager Dan McGrath told me. “So for them, having a more cost-effective regional instance is very attractive, as well as data locality and being able to place a Cloud Firestore database as close as possible to their user base.”

The new regional instance pricing is up to 50 percent cheaper than the current multi-cloud instance prices. Which solution you pick does influence the SLA guarantee Google gives you, though. While the regional instances are still replicated within multiple zones inside the region, all of the data is still within a limited geographic area. Hence, Google promises 99.999 percent availability for multi-region instances and 99.99 percent availability for regional instances.

And talking about regions, Cloud Firestore is now available in 10 new regions around the world. Firestore launched with a single location when it launched and added two more during the beta. With this, Firestore is now available in 13 locations (including the North America and Europe multi-region offerings). McGrath tells me Google is still in the planning stage for deciding the next phase of locations, but he stressed that the current set provides pretty good coverage across the globe.

Also new in this release is deeper integration with Stackdriver, the Google Cloud monitoring service, which can now monitor read, write and delete operations in near-real time. McGrath also noted that Google plans to add the ability to query documents across collections and increment database values without needing a transaction.

It’s worth noting that while Cloud Firestore falls under the Google Firebase brand, which typically focuses on mobile developers, Firestore offers all of the usual client-side libraries for Compute Engine or Kubernetes Engine applications, too.

“If you’re looking for a more traditional NoSQL document database, then Cloud Firestore gives you a great solution that has all the benefits of not needing to manage the database at all,” McGrath said. “And then, through the Firebase SDK, you can use it as a more comprehensive back-end as a service that takes care of things like authentication for you.”

One of the advantages of Firestore is that it has extensive offline support, which makes it ideal for mobile developers but also IoT solutions. Maybe it’s no surprise, then, that Google is positioning it as a tool for both Google Cloud and Firebase users.

Jan
24
2019
--

Microsoft acquires Citus Data

Microsoft today announced that it has acquired Citus Data, a company that focused on making PostgreSQL databases faster and more scalable. Citus’ open-source PostgreSQL extension essentially turns the application into a distributed database and, while there has been a lot of hype around the NoSQL movement and document stores, relational databases — and especially PostgreSQL — are still a growing market, in part because of tools from companies like Citus that overcome some of their earlier limitations.

Unsurprisingly, Microsoft plans to work with the Citus Data team to “accelerate the delivery of key, enterprise-ready features from Azure to PostgreSQL and enable critical PostgreSQL workloads to run on Azure with confidence.” The Citus co-founders echo this in their own statement, noting that “as part of Microsoft, we will stay focused on building an amazing database on top of PostgreSQL that gives our users the game-changing scale, performance, and resilience they need. We will continue to drive innovation in this space.”

PostgreSQL is obviously an open-source tool, and while the fact that Microsoft is now a major open-source contributor doesn’t come as a surprise anymore, it’s worth noting that the company stresses that it will continue to work with the PostgreSQL community. In an email, a Microsoft spokesperson also noted that “the acquisition is a proof point in the company’s commitment to open source and accelerating Azure PostgreSQL performance and scale.”

Current Citus customers include the likes of real-time analytics service Chartbeat, email security service Agari and PushOwl, though the company notes that it also counts a number of Fortune 100 companies among its users (they tend to stay anonymous). The company offers both a database as a service, an on-premises enterprise version and the free open-source edition. For the time being, it seems like that’s not changing, though over time I would suspect that Microsoft will transition users of the hosted service to Azure.

The price of the acquisition was not disclosed. Citus Data, which was founded in 2010 and graduated from the Y Combinator program, previously raised more than $13 million from the likes of Khosla Ventures, SV Angel and Data Collective.

Sep
24
2018
--

Microsoft updates its planet-scale Cosmos DB database service

Cosmos DB is undoubtedly one of the most interesting products in Microsoft’s Azure portfolio. It’s a fully managed, globally distributed multi-model database that offers throughput guarantees, a number of different consistency models and high read and write availability guarantees. Now that’s a mouthful, but basically, it means that developers can build a truly global product, write database updates to Cosmos DB and rest assured that every other user across the world will see those updates within 20 milliseconds or so. And to write their applications, they can pretend that Cosmos DB is a SQL- or MongoDB-compatible database, for example.

CosmosDB officially launched in May 2017, though in many ways it’s an evolution of Microsoft’s existing Document DB product, which was far less flexible. Today, a lot of Microsoft’s own products run on CosmosDB, including the Azure Portal itself, as well as Skype, Office 365 and Xbox.

Today, Microsoft is extending Cosmos DB with the launch of its multi-master replication feature into general availability, as well as support for the Cassandra API, giving developers yet another option to bring existing products to CosmosDB, which in this case are those written for Cassandra.

Microsoft now also promises 99.999 percent read and write availability. Previously, it’s read availability promise was 99.99 percent. And while that may not seem like a big difference, it does show that after more of a year of operating Cosmos DB with customers, Microsoft now feels more confident that it’s a highly stable system. In addition, Microsoft is also updating its write latency SLA and now promises less than 10 milliseconds at the 99th percentile.

“If you have write-heavy workloads, spanning multiple geos, and you need this near real-time ingest of your data, this becomes extremely attractive for IoT, web, mobile gaming scenarios,” Microsoft CosmosDB architect and product manager Rimma Nehme told me. She also stressed that she believes Microsoft’s SLA definitions are far more stringent than those of its competitors.

The highlight of the update, though, is multi-master replication. “We believe that we’re really the first operational database out there in the marketplace that runs on such a scale and will enable globally scalable multi-master available to the customers,” Nehme said. “The underlying protocols were designed to be multi-master from the very beginning.”

Why is this such a big deal? With this, developers can designate every region they run Cosmos DB in as a master in its own right, making for a far more scalable system in terms of being able to write updates to the database. There’s no need to first write to a single master node, which may be far away, and then have that node push the update to every other region. Instead, applications can write to the nearest region, and Cosmos DB handles everything from there. If there are conflicts, the user can decide how those should be resolved based on their own needs.

Nehme noted that all of this still plays well with CosmosDB’s existing set of consistency models. If you don’t spend your days thinking about database consistency models, then this may sound arcane, but there’s a whole area of computer science that focuses on little else but how to best handle a scenario where two users virtually simultaneously try to change the same cell in a distributed database.

Unlike other databases, Cosmos DB allows for a variety of consistency models, ranging from strong to eventual, with three intermediary models. And it actually turns out that most CosmosDB users opt for one of those intermediary models.

Interestingly, when I talked to Leslie Lamport, the Turing award winner who developed some of the fundamental concepts behind these consistency models (and the popular LaTeX document preparation system), he wasn’t all that sure that the developers are making the right choice. “I don’t know whether they really understand the consequences or whether their customers are going to be in for some surprises,” he told me. “If they’re smart, they are getting just the amount of consistency that they need. If they’re not smart, it means they’re trying to gain some efficiency and their users might not be happy about that.” He noted that when you give up strong consistency, it’s often hard to understand what exactly is happening.

But strong consistency comes with its drawbacks, too, which leads to higher latency. “For strong consistency there are a certain number of roundtrip message delays that you can’t avoid,” Lamport noted.

The CosmosDB team isn’t just building on some of the fundamental work Lamport did around databases, but it’s also making extensive use of TLA+, the formal specification language Lamport developed in the late 90s. Microsoft, as well as Amazon and others, are now training their engineers to use TLA+ to describe their algorithms mathematically before they implement them in whatever language they prefer.

“Because [CosmosDB is] a massively complicated system, there is no way to ensure the correctness of it because we are humans, and trying to hold all of these failure conditions and the complexity in any one person’s — one engineer’s — head, is impossible,” Microsoft Technical Follow Dharma Shukla noted. “TLA+ is huge in terms of getting the design done correctly, specified and validated using the TLA+ tools even before a single line of code is written. You cover all of those hundreds of thousands of edge cases that can potentially lead to data loss or availability loss, or race conditions that you had never thought about, but that two or three years ago after you have deployed the code can lead to some data corruption for customers. That would be disastrous.”

“Programming languages have a very precise goal, which is to be able to write code. And the thing that I’ve been saying over and over again is that programming is more than just coding,” Lamport added. “It’s not just coding, that’s the easy part of programming. The hard part of programming is getting the algorithms right.”

Lamport also noted that he deliberately chose to make TLA+ look like mathematics, not like another programming languages. “It really forces people to think above the code level,” Lamport noted and added that engineers often tell him that it changes the way they think.

As for those companies that don’t use TLA+ or a similar methodology, Lamport says he’s worried. “I’m really comforted that [Microsoft] is using TLA+ because I don’t see how anyone could do it without using that kind of mathematical thinking — and I worry about what the other systems that we wind up using built by other organizations — I worry about how reliable they are.”

more Microsoft Ignite 2018 coverage

Apr
21
2018
--

Timescale is leading the next wave of NYC database tech

Data is the lifeblood of the modern corporation, yet acquiring, storing, processing, and analyzing it remains a remarkably challenging and expensive project. Every time data infrastructure finally catches up with the streams of information pouring in, another source and more demanding decision-making makes the existing technology obsolete.

Few cities rely on data the same way as New York City, nor has any other city so shaped the technology that underpins our data infrastructure. Back in the 1960s, banks and accounting firms helped to drive much of the original computation industry with their massive finance applications. Today, that industry has been supplanted by finance and advertising, both of which need to make microsecond decisions based on petabyte datasets and complex statistical models.

Unsurprisingly, the city’s hunger for data has led to waves of database companies finding their home in the city.

As web applications became increasingly popular in the mid-aughts, SQL databases came under increasing strain to scale, while also proving to be inflexible in terms of their data schemas for the fast-moving startups they served. That problem spawned Manhattan-based MongoDB, whose flexible “NoSQL” schemas and horizontal scaling capabilities made it the default choice for a generation of startups. The company would go on to raise $311 million according to Crunchbase, and debuted late last year on NASDAQ, trading today with a market cap of $2 billion.

At the same time that the NoSQL movement was hitting its stride, academic researchers and entrepreneurs were exploring how to evolve SQL to scale like its NoSQL competitors, while retaining the kinds of features (joining tables, transactions) that make SQL so convenient for developers.

One leading company in this next generation of database tech is New York-based Cockroach Labs, which was founded in 2015 by a trio of former Square, Viewfinder, and Google engineers. The company has gone on to raise more than $50 million according to Crunchbase from a luminary list of investors including Peter Fenton at Benchmark, Mike Volpi at Index, and Satish Dharmaraj at Redpoint, along with GV and Sequoia.

While web applications have their own peculiar data needs, the rise of the internet of things (IoT) created a whole new set of data challenges. How can streams of data from potentially millions of devices be stored in an easily analyzable manner? How could companies build real-time systems to respond to that data?

Mike Freedman and Ajay Kulkarni saw that problem increasingly manifesting itself in 2015. The two had been roommates at MIT in the late 90s, and then went on separate paths into academia and industry respectively. Freedman went to Stanford for a PhD in computer science, and nearly joined the spinout of Nicira, which sold to VMware in 2012 for $1.26 billion. Kulkarni joked that “Mike made the financially wise decision of not joining them,” and Freedman eventually went to Princeton as an assistant professor, and was awarded tenure in 2013. Kulkarni founded and worked at a variety of startups including GroupMe, as well as receiving an MBA from MIT.

The two had startup dreams, and tried building an IoT platform. As they started building it though, they realized they would need a real-time database to process the data streams coming in from devices. “There are a lot of time series databases, [so] let’s grab one off the shelf, and then we evaluated a few,” Kulkarni explained. They realized what they needed was a hybrid of SQL and NoSQL, and nothing they could find offered the feature set they required to power their platform. That challenge became the problem to be solved, and Timescale was born.

In many ways, Timescale is how you build a database in 2018. Rather than starting de novo, the team decided to build on top of Postgres, a popular open-source SQL database. “By building on top of Postgres, we became the more reliable option,” Kulkarni said of their thinking. In addition, the company opted to make the database fully open source. “In this day and age, in order to get wide adoption, you have to be an open source database company,” he said.

Since the project’s first public git commit on October 18, 2016, the company’s database has received nearly 4,500 stars on Github, and it has raised $16.1 million from Benchmark and NEA .

Far more important though are their customers, who are definitely not the typical tech startup roster and include companies from oil and gas, mining, and telecommunications. “You don’t think of them as early adopters, but they have a need, and because we built it on top of Postgres, it integrates into an ecosystem that they know,” Freedman explained. Kulkarni continued, “And the problem they have is that they have all of this time series data, and it isn’t sitting in the corner, it is integrated with their core service.”

New York has been a strong home for the two founders. Freedman continues to be a professor at Princeton, where he has built a pipeline of potential grads for the company. More widely, Kulkarni said, “Some of the most experienced people in databases are in the financial industry, and that’s here.” That’s evident in one of their investors, hedge fund Two Sigma. “Two Sigma had been the only venture firm that we talked to that already had built out their own time series database,” Kulkarni noted.

The two also benefit from paying customers. “I think the Bay Area is great for open source adoption, but a lot of Bay Area companies, they develop their own database tech, or they use an open source project and never pay for it,” Kulkarni said. Being in New York has meant closer collaboration with customers, and ultimately more revenues.

Open source plus revenues. It’s the database way, and the next wave of innovation in the NYC enterprise infrastructure ecosystem.

Feb
15
2018
--

MongoDB gets support for multi-document ACID transactions

 MongoDB is finally getting support for multi-document ACID (atomicity, consistency, isolation, durability) transactions. That’s something the MongoDB community has been asking for for years and MongoDB Inc, the company behind the project, is now about to make this a reality. As the company will announce at an event later today, support for ACID transactions will launch when it ships… Read More

Feb
06
2018
--

Rubrik acquires Datos IO to expand into NoSQL database management support

 Rubrik, the enterprise startup that provides data backup and recovery services across cloud and on-premise environments, is putting some of the funding that it raised last year at a $1.3 billion valuation to use. Rubrik has acquired NoSQL data backup specialist Datos IO, the company announced today, in what appears to be Rubrik’s first acquisition. The financial terms of the deal are… Read More

Sep
07
2017
--

Percona Live Europe Featured Talks: NoSQL Best Practices for PostgreSQL with Dmitry Dolgov

Colin Charles

Percona Live EuropeWelcome to another post our series of interview blogs for the upcoming Percona Live Europe 2017 in Dublin. This series highlights a number of talks that will be at the conference and gives a short preview of what attendees can expect to learn from the presenter.

This blog post is with Dmitry Dolgov, a software engineer at Zalando SE. His talk is titled NoSQL Best Practices for PostgreSQL. One of PostgreSQL’s most attractive features is the Jsonb data type. It allows efficient work with semi-structured data without sacrificing strong consistency and ability to use all the power of proven relational technology.  In our conversation, we discussed how to use this NoSQL feature in PostgreSQL:

Percona: How did you get into databases? What do you love about it?  

Dmitry: I grew extremely interested in databases not so long ago, mostly due to the influence of Oleg Bartunov, who is a longtime contributor to PostgreSQL. Initially, I just implemented one patch for the Jsonb data type that was eventually included in the core. After that I couldn’t stop. So I still try to help the PostgreSQL community as much as I can.

What I love is just that: PostgreSQL has an awesome community. And I mean it, there are a lot of people that are excited about databases and possess valuable expertise in this area. My most vivid memory so far about the community was someone asking a question in the hackers mailing list that got answered within minutes – even before I started to type my own reply.

Percona: How can NoSQL Jsonb data type get used effectively with PostgreSQL?

Dmitry: The trick is that you don’t need to do anything supernatural for that. Jsonb is already effective enough right out of the box. But as always there are some limitations, implementation details and tricks (which I’ll show in my talk).

Percona: What do you want attendees to take away from your session? Why should they attend?

Dmitry: The biggest idea behind this talk is to show that we live in interesting times. It’s not that easy to stick with only one data model/data storage. And to mitigate this issue, most modern databases are trying to provide more that one approach. We have to evaluate them each carefully.

Or you can attend if you expect a holy war of PostgreSQL vs. MongoDB vs. MySQL vs. whatever else. But you won’t see anything like that, because we’re all grown up people. ?

Percona: What are you most looking forward to at Percona Live Europe 2017?

Dmitry: I look forward to meeting a lot of interesting people to collaborate with, and to share my own experiences.

Want to find out more about Dmitry and PostgreSQL and the Jsonb data type? Register for Percona Live Europe 2017, and see his talk NoSQL Best Practices for PostgreSQL. Register now to get the best price! Use discount code SeeMeSpeakPLE17 to get 10% off your registration.

Percona Live Open Source Database Conference Europe 2017 in Dublin is the premier European open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, MariaDB, MongoDB, time series database, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Open Source Database Conference Europe will be September 25-27, 2017 at the Radisson Blu Royal Hotel, Dublin.

Mar
08
2017
--

Migrating MongoDB Away from MMAPv1

MMAPv1

MMAPv1This is another post in the series of blogs on the Percona Server for MongoDB 3.4 bundle release. In this blog post, we’ll discuss moving away from the MMAPv1 storage engine.

Introduction

WIth the MongoDB v3.0 release in February of 2015, the long-awaited ability to choose storage engines became a reality. As of version 3.0, you could choose two engines in MongoDB Community Server and, if you use Percona Server for MongoDB, you could choose from four. Here’s a table for ease of consumption:

Here’s a table for easy consumption:

Storage Engine Percona Server for MongoDB MongoDB Community Server MongoDB Enterprise Server (licensed)
MMAPv1

?

?

?

WiredTiger

?

?

?

MongoRocks

?

In-memory

?

?

Encrypted

?

 

Why change engines?

With increased possibilities comes an increase in the decision-making process difficult (a concept that gets reinforced every time I take my mother out a restaurant with a large menu – ordering is never quick). In all seriousness, the introduction of the storage engine API to MongoDB is possibly the single greatest feature MongoDB, Inc has released to-date.

One of the biggest gripes from the pre-v3.0 days was MongoDB’s lack of scale. This was mostly due to the MMAPv1 storage engine, which suffered from a very primitive locking scheme. If you would like a illustration of the problem, think of the world’s biggest supermarket with one checkout line – you might be able to fit in lots of shoppers, but they’re not going to accomplish their goal quickly. So, the ability to increase performance and concurrency with a simple switch is huge! Additionally, modern storage engines support compression. This should reduce your space utilization when switching by at least 50%.

All the way up to MongoDB v3.2, the default storage engine was MMAPv1. If you didn’t make a conscious decision about what storage engine to choose when you started using MongoDB, there is a good chance that MMAPv1 is what you’re on. If you’d like to find out for sure what engine you’re using, simply run the command below. The output will be the name of the storage engine. As you can see, I was running the MMAPv1 storage engine on this machine. Now that we understand where we’re at, let’s get into where we can be in the future.

db.serverStatus().storageEngine.name
mmapv1

Public Service Announcement

Before we get into what storage engine(s) to evaluate, we need to talk about testing. In my experience, a majority of the MySQL and MongoDB community members are rolling out changes to production without planning or testing. If you’re in the same boat, you’re in very good company (or at least in a great deal of company). However, you should stop this practice. It’s basic “sample size” in statistics – when engaged in risk-laden behavior, the optimal time to stop increasing the sample size is prior to the probability of failure reaching “1”. In other words, start your testing and planning process today!

At Percona, we recommend that you thoroughly test any database changes in a testing or development environment before you decide to roll them into production. Additionally, prior to rolling the changes into production (with a well thought out plan, of course), you’ll need to have a roll-back plan in case of unintended consequences. Luckily, with MongoDB’s built-in replication and election protocols, both are fairly easy. The key here is to plan. This is doubly true if you are undertaking a major version upgrade, or are jumping over major versions. With major version upgrades (or version jumps) comes the increased likelihood of a change in database behavior as it relates to your application’s response time (or even stability).

What should I think about?

In the table above, we listed the pre-packaged storage engine options that are available to us and other distributions. We also took a look at why you should consider moving off of MMAPv1 in the preceding section. To be clear, in my opinion a vast majority of MongoDB users that are on MMAPv1 can benefit from a switch. Which engine to switch to is the pressing question. Your first decision should be to evaluate whether or not your workload fits into the sweet spot for MMAPv1 by reading the section below. If that section doesn’t describe your application, then the additional sections should help you narrow down your choices.

Now, let’s take a look at what workloads match up with what storage engines.

MMAPv1

Believe it or not, there are some use cases where MMAPv1 is likely to give you as good (or better) performance as any other engine. If you’re not worried about the size of your database on disk, then you may not want to bother changing engines. Users that are likely to see no benefit from changing have read-heavy (or 100%) read applications. Also, certain update-heavy use cases, where you’re updating small amounts of data or performing $set operations, are likely to be faster on MMAPv1.

WiredTiger

WiredTiger is a the new default storage engine for MongoDB. It is a good option for general workloads that are currently running on MMAPv1. WiredTiger will give you good performance for most workloads and will reduce your storage utilization with compression that’s enabled by default. If you have a write-heavy workload, or are approaching high I/O utilization (>55%) with plans for it to rise, then you might benefit from a migration to WiredTiger.

MongoRocks (RocksDB from Facebook)

This is Facebook’s baby, which was forged in the fires of the former Parse business unit. MongoRocks, which uses LSM indexing, is advertised as “designed to work with fast storage.” Don’t let this claim fool you. For workloads that are heavy on writes, highly concurrent or approaching disk bound, MongoRocks could give you great benefits. In terms of compression, MongoRocks has the ability to efficiently handle deeper compression algorithms, which should further decrease your storage requirements.

In-Memory

The in-memory engine, whether we’re speaking about the MongoDB or Percona implementation, should be used for workloads where extreme low latency is the most important requirement. The types of applications that I’m talking about are usually low-latency, “real-time” apps — like decision making or user session tracking. The in-memory engine is not persistent, so it operates strictly out of the cache allocated to MongoDB. Consequently, the data may (and likely will) be lost if the server crashes.

Encrypted

This is for applications in highly secure environments where on-disk encryption is necessary for compliance. This engine will protect the MongoDB data in the case that a disk or server is stolen. On the flip side, this engine will not protect you from a hacker that has access to the server (MongoDB shell), or can intercept your application traffic. Another way to achieve the same level of encryption for compliance is using volume level encryption like LUKS. An additional benefit of volume level encryption, since it works outside the database, is re-use on all compliant servers (not just MongoDB).

Getting to your new engine

Switching to the new engine is actually pretty easy, especially if you’re running a replica set. One important caveat is that unlike MySQL, the storage engine can only be defined per mongod process (not per database or collection). This means that it’s an all or nothing operation on a single MongoDB process. You’ll need to reload your data on that server. This is necessary because the data files from one engine are not compatible with another engine. Thus reloading the data to transform from one engine format to another is necessary. Here are the high-level steps (assuming you’re running a replica set):

  1. Make sure you’re not in your production environment
  2. Backup your data (it can’t hurt)
  3. Remove a replica set member
  4. Rename (or delete) the old data directory. The member will re-sync with the replica set
    • Make sure you have enough disk space if you’re going to keep a copy of the old data directory
  5. Update the mongo.conf file to use a new storage engine. Here’s an example for RocksDB from our documentation:
    storage:
     engine: rocksdb
     rocksdb:
       cacheSizeGB: 4
       compression: snappy
  6. Start the MongoDB process again
  7. Join the member to the replica set (initial sync will happen)
  8. When the updated member is all caught up, pick another member and repeat the process.
  9. Continue until the primary is the only server left. At this point, you should step down the primary, but hold off switching storage engines until you are certain that the new storage engine meets your needs.

The Wrap Up

At this point I’ve explained how you can understand your options, where you can gain additional performance and what engines to evaluate. Please don’t forget to test your application with the new setup before launching into production. Please drop a comment below if you found this helpful or, on the other hand, if there’s something that would make it more useful to you. Chances are, if you’d find something helpful, the rest of the community will as well.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com