Dec
06
2017
--

MongoDB 3.6 Community Is Here!

MongoDB 3.6 Community

MongoDB 3.6 CommunityBy now you surely know MongoDB 3.6 Community became generally available on Dec 5, 2017. Of course, this is great news: it has some big ticket items that we are all excited about! But I want to also talk about my general thoughts on this release.

It is always a good idea for your internal teams to study and consider new versions. This is crucial for understanding if and when you should start using it. After deciding to use it, there is the question of if you want your developers using the new features (or are they not suitably implemented yet to be used)?

So what is in MongoDB 3.6 Community? Check it out:

  • Sessions
  • Change Streams
  • Retryable Writes
  • Security Improvement
  • Major love for Arrays in Aggregation
  • A better balancer
  • JSON Schema Validation
  • Better Time management
  • Compass is Community
  • Major WiredTiger internal overhaul

As you can see, this is an extensive list. But there are 1400+ implemented Jira tickets just on the server itself (not even in the WiredTigers project).

To that end, I thought we should break my review into a few areas. We will have blog posts out soon covering these areas in more depth. This blog is more about my thoughts on the topics above.

Expected blogs (we will link to them as they become available):

  • Change Streams –  Nov 11 2017
  • Sessions
  • Transactions and new functions
  • Aggregation improvements
  • Security Controls to use ASAP
  • Other changes from diagnostics to Validation

Today let’s quickly recap the above areas.

Sessions

We will have a blog on this (it has some history). This move has been long-awaited by anyone using MongoDB before 2.4. There were connection changes in that release that made it complicated for load balancers due to the inability to “re-attach” to the same session.  If you were not careful in 2.4+, you could easily use a load-balancer and have very odd behavior: from broken to invisibly incomplete getMores (big queries).

Sessions aim is to change this. Now, the client drivers know about the internal session to the database used for reads and writes. Better yet, MongoDB tracks these sessions so even if an election occurs, when your drive fails over so will the session. For anyone who’s applications handled fail-overs badly, this is an amazing improvement. Some of the other new features that make 3.6 a solid release require this new behavior.

Does this mean this solution is perfect and works everywhere? No! It, like newer features we have seen, leave MMAPv1 out in the cold due to its inability without major re-work to support logic that is so native to Wired Tiger. Talking with engine engineers, it’s clear that some of the logic behind the underlying database snapshots and rollbacks added here can cause issues (which we will talk about more in the transactions blog).

Change streams

As one of the most talked about (but most specialized) features, I can see its appeal in a very specific use case (but it is rather limited). However, I am getting ahead of myself! Let’s talk about what it is first and where it came from.

Before this feature, people streamed data out of MySQL and MongoDB into Elastic and Hadoop stacks. I mention MySQL, as this was the primer for the initial method MongoDB used. The tools would read the MySQL binlogs – typically saved off somewhere – and then apply those operations to some other technology. When they went to do the same thing in MongoDB, there was a big issue: if writes are not sent to the majority of the nodes, it can cause a rollback. In fact, such rollbacks were not uncommon. The default was w:1 (meaning the primary only needed to have the write), which resulted in data existing in Elastic that had been removed from MongoDB. Hopefully, everyone can see the issue here, and why a better solution was needed than just reading the oplog.

Enter

$changeStream

, which in the shell has a helper called

.watch()

 . This is a method that uses a multi-node consistent read to ensure the data is on the majority of nodes before the command returns the data in a tailable cursor. For this use case, this is amazing as it allows the data replicating tool much more assurance that data is not going to vanish. 

$changeStream

 is not without limits: if we have 10k collections and we want to watch them all, this is 10k separate operations and cursors. This puts a good deal of strain on the systems, so MongoDB Inc. suggests you do this on no more than 1000 namespaces at a time to prevent overhead issues.

Sadly it is not currently possible to take a mongod-wide snapshot to support this under the hood, as this is done on each namespace to implement the snapshot directly inside WiredTiger’s engine. So for anyone with a very large collection count, this will not be your silver bullet yet. This also means streams between collections and databases are not guaranteed to be in sync. This could be an issue for someone even with a smaller number of namespaces that expect this. Please don’t get me wrong: it’s a step in the correct direction, but it falls short.

I had very high hopes for this to simplify backups in MongoDB. Percona Labs’s GitHub has a tool called MongoDB-Consistent-Backup, which tails multiple oplogs to get a consistent sharded backup without the need to pay for MongoDB’s backup service or use the complicated design that is Ops Manager (when you host it yourself). Due to the inability to do a system-wide change stream, this type of tool still needs to use the oplog. If you are not using

w:majority

  it could trigger a failure if you have an election or if a rollback occurs. Still, I have hopes this will be something that can be considered in the future to make things better for everyone.

Retryable writes

Unlike change streams, this feature is much more helpful to the general MongoDB audience. I am very excited for it. If you have not watched this video, please do right now! Samantha does a good job explaining the issue and solution in detail. However, for now just know there has been a problem that where a write that has an issue (network, app shutdown, DB shutdown, election), you had no way to know if the write failed or not. This unknown situation was terrible for a developer, and they would not know if they needed to run the command again or not. This is especially true if you have an ordering system and you’re trying to remove stock from your inventory system. Sessions, as discussed before, allowed the client to reconnect to a broken connection and keep getting results to know what happened or didn’t. To me, this is the second best feature of 3.6. Only Security is more important to me personally.

Security improvement

In speaking of security, there is one change that the security community wanted (which I don’t think is that big of a deal). For years now, the MongoDB packaging for all OSs (and even the Percona Server for MongoDB packing) by default would limit the bindIP setting to localhost. This was to prevent unintended situations where you had a database open to the public. With 3.6 now the binaries also default to this. So, yes, it will help some. But I would (and have) argued that when you install a database from binaries or source, you are taking more ownership of its setup compared to using Docker, Yum or Apt.

The other major move forward, however, is something I have been waiting for since 2010. Seriously, I am not joking! It offers the ability to limit users to specific CIDR or IP address ranges. Please note MySQL has had this since at least 1998. I can’t recall if it’s just always been there, so let’s say two decades.

This is also known as “authenticationRestriction” and it’s an array you can put into the user document when creating a document. The manual describes it as:

The authentication restrictions the server enforces on the created user. Specifies a list of IP addresses and CIDR ranges from which the user is allowed to connect to the server or from which the server can accept users.

I can not overstate how huge this is. MongoDB Inc. did a fantastic job on it. Not only does it support the classic client address matching, it supports an array of these with matching on the server IP/host also. This means supporting multiple IP segments with a single user is very easy. By extension, I could see a future where you could even limit some actions by range – allowing dev/load test to drop collections, but production apps would not be allowed to. While they should have separate users, I regularly see clients who have one password everywhere. That extension would save them from unplanned outages due to human errors of connecting to the wrong thing.

We will have a whole blog talking about these changes, their importance and using them in practice. Yes, security is that important!

Major love for array and more in Aggregation

This one is a bit easier to summarize. Arrays and dates get some much-needed aggregation love in particular. I could list all the new operators here, but I feel it’s better served in a follow-up blog where we talk about each operator and how to get the most of it. I will say my favorite new option is the $hint. Finally, I can try to control the work plan a bit if it’s making bad decisions, which sometimes happens in any technology.

A better balancer

Like many other areas, there was a good deal that went into balancer improvements. However, there are a few key things that continue the work of 3.4’s parallel balancing improvements.

Some of it makes a good deal of sense for anyone in a support role, such as FTDC now also existing in mongos’. If you do not know what this is, basically MongoDB collects some core metrics and state data and puts it into binary files in dbPath for engineers at companies like Percona and MongoDB Inc. to review. That is not to say you can’t use this data also. However, think of it as a package of performance information if a bug happens. Other diagnostic type improvements include moveChunk, which provides data about what happened when it runs in its output. Previously you could get this data from the config.changeLog or config.actionLog collections in the config servers. Obviously, more and more people are learning MongoDB’s internals and this should be made more available to them.

Having talked about diagnostic items, let’s move more into the operations wheelhouse. The single biggest frustration about sharding and replica-sets is the sensitivity to time variations that cause odd issues, even when using ntpd. To this point, as of 3.6 there is now a logical clock in MongoDB. For the geekier in the crowd, this was implemented using a Lamport Clock (great video of them). For the less geeky, think of it as a cluster-wide clock preventing some of the typical issues related to varying system times. In fact, if you look closer at the oplog record format in 3.6 there is a new wt field for tracking this. Having done that, the team at MongoDB Inc. considered what other things were an issue. At times like elections of the config servers, meta refreshes did not try enough times and could cause a mongos to stop working or fail. Those days are gone! Now it will check three times as much, for a total of ten attempts before giving up. This makes the system much more resilient.

A final change that is still somewhat invisible to the user but helps make dropping collections more stable, is that they remove the issue MongoDB had about dropping and recreating sharded collections. Your namespaces look as they always have. Behind the scenes, however, they have UUID’s attached to them so that if foo.bar drops and gets recreated, it would be a different UUID. This allows for less-blocking drops. Additionally, it prevents confusion in a distributed system if we are talking about the current or old collection.

JSON Schema validation

Some non-developers might not know much about something called JSON Schema. It allows you to set rules on schema design more efficiently and strictly than MongoDB’s internal rules. With 3.6, you can use this directly. Read more about it here. Some key points:

  • Describes your existing data format
  • Clear, human- and machine-readable documentation
  • Complete structural validation, useful for:
    • Automated testing
    • Validating client-submitted data
You can even make it reject when certain fields are missing. As for MySQL DBAs, you might ask why this is a big deal? You could always have a DBA define a schema in an RDBMS, and the point of MongoDB was to be flexible. That’s a fair and correct view. However, the big point of using it is you could apply this in production, not in development. This gives developers the freedom to move quickly, but provides operational teams with control methods to understand when mistakes or bugs are present before production is adversely affected. Taking a step back, its all about bridging the control and freedom ravines to ensure both camps are happy and able to do their jobs efficiently.

Compass is Community

If you have never used Compass, you might think this isn’t that great. You could use things like RoboMongo and such. You absolutely could, but Compass can do visualization as well as CRUD operations. It’s also a very fluid experience that everyone should know is available for use. This is especially true for QA teams who want to review how often some types of data are present, or a business analyst who needs to understand in two seconds what your demographics are.

Major WiredTiger internal overhaul

There is so much here that I recommend any engineer-minded person take a look at this deck, presented by one of the great minds behind WiredTiger. It does a fantastic job explaining all the reasons behind some of the 3.2 and 3.4 scaling issues MongoDB had on WiredTiger. Of particular interest is why it tended to have worse and worse performance as you added more collections and indexes. It then goes into how they fixed those issues:

  • Some key points on what they did
  • Made Evictions Smarts, as they are not collection uniform
  • Improved assumption around the handle cache
  • Made Checkpoints better in all ways
  • Aggressively cleaned up old handles

I hope this provides a peek into the good, bad, and ugly in MongoDB 3.6 Community! Please check back as we publish more in-depth blogs on how these features work in practice, and how to best use them.

Oct
31
2016
--

Webinar Wednesday November 2: MongoDB Backups, All Grown Up!

MongoDB Backups

MongoDB BackupsPlease join us on Wednesday, November 2, 2016 at 10:00 am PDT / 1:00pm EDT (UTC-7) for the webinar MongoDB Backups, All Grown Up, featuring David Murphy, Percona’s Mongo Practice Manager. 
 

It has been a long road to stable and dependable backups in the MongoDB space. This webinar covers the current types of backups and their limitations when it comes to sharding. From there we will move into why you can’t be consistent with a single node, and how you can take sharded or unsharded consistent backups. 

The webinar also covers more about the “mongodb_consistent_backup.py” tool, and the features it offers to the open source community: how to use it, what it looks like and why it’s our preferred backup methodology.

Click here to register for the webinar MongoDB Backups, All Grown Up!

MongoDB BackupsDavid Murphy, MongoDB Practice Manager

David joined Percona in October 2015 as Practice Manager for MongoDB. Prior to that, David joined the ObjectRocket by Rackspace team as the Lead DBA in Sept 2013. With the growth involved with a any recently acquired startup, David’s role covered a wide range from evangelism, research, run book development, knowledgebase design, consulting, technical account management, mentoring and much more.

Prior to the world of MongoDB, David was a MySQL and NoSQL architect at Electronic Arts. There, he worked with some of the largest titles in the world like FIFA, SimCity, and Battle Field providing tuning, design, and technology choice responsibilities. David maintains an active interest in database speaking and exploring new technologies.

Oct
19
2015
--

Percona Welcomes David Murphy as Practice Manager for MongoDB

Welcome-on-boardWe are excited this week to welcome David Murphy to Percona as our Practice Manager for MongoDB®. A veteran of ObjectRocket, Electronic Arts, and Rackspace, David is a welcome addition to our team and will provide vision, direction and best practices for all things MongoDB. David also has an extensive background in MySQL and is a well-known Open Source advocate and contributor.

At ObjectRocket, David was the Lead DBA and MongoDB Master. He was the first DBA hired by the company and built the DBA team, developed training plans for them, and provided support and guidance to other groups within the company on the use of new technologies. During this time, he helped build out systems for online migration for MongoDB, consulting with clients on database design and shard key analysis. As expected with any enterprise-grade operations team, he oversaw database testing and readiness, MongoDB/NoSQL/MySQL escalation support, and version and bug tracking. Additionally, he has contributed to the MongoDB source code and is known for advocating for operational stability and improvements, even when it did not align with MongoDB Inc. desired communications. Throughout, he gave talks to assist others in the real-world running of MongoDB, released scripts to aid others in operational areas, and gave tutorials on complex topics such as sharding and its after effects.

At Electronic Arts, David was the subject-matter expert for major sports titles like FIFA, which accounted for 27 percent of the company’s net revenue. From there he took on the additional responsibilities to plan the company’s entry into the NoSQL data space as the DB Architect – NoSQL/MySQL. Among many duties, he handled major sports titles, support-service architecture, and proof-of-concept planning for new technologies, including MongoDB, MySQL Cluster, and Clustrix, to help determine the feasibility and logistical fit within EA projects. As a trusted expert in the company, he was regularly sent along with another architect to “parachute” into major issues and bring them back into a successful outcome.

At Rackspace, David was a Cloud Sites – Linux Systems Operations Administrator, where he maintained, deployed and managed hundreds of Linux, MySQL, PHP, and storage clusters. David also held administrator and engineering positions at Lead Geniuses, ByDomino, McAfee and The Planet.

A true master of the art and science of MongoDB, David will help Percona become a disruptive force in the MongoDB space by bringing an operational-centric view to everything we do. We look forward to some very exciting months and years ahead.

The post Percona Welcomes David Murphy as Practice Manager for MongoDB appeared first on MySQL Performance Blog.

Aug
26
2015
--

ObjectRocket’s David Murphy talks about MongoDB, Percona Live Amsterdam

Say hello to David Murphy, lead DBA and MongoDB Master at ObjectRocket (a Rackspace company). David works on sharding, tool building, very large-scale issues and high-performance MongoDB architecture. Prior to ObjectRocket he was a MySQL/NoSQL architect at Electronic Arts. David enjoys large-scale operational tool building, high performance OS and database tuning. He is also a core code contributor to MongoDB. He’ll be speaking next month at Percona Live Amsterdam, which runs Sept. 21-13. Enter promo code “BlogInterview” at registration to save €20!


Tom: David, your 3-hour tutorial is titled “Mongo Sharding from the trench: A Veterans field guide.” Did your experience in working with vast amounts of data at Rackspace give you a unique perspective, in view, that now puts you into a position to help people just getting started? Can you give a couple examples?

David: I think this has been something organically I grew into from the days of supporting Cpanel type MySQL instances to today. I have worked for a few verticals from hosts to advertising to gaming, finally entering into the platform service. The others give me a host of knowledge around how customer need systems to work, and then the number and range of workloads we see at Rackspace re-enforces this.

ObjectRocket's David Murphy talks MongoDB & Percona Live Amsterdam

ObjectRocket’s David Murphy

Many times the unique perspective comes with the scale such as someone calling up a single node to the multi-terabyte range. When they go to “shard” they can find the process that is normally very light and unnoticeable to most Mongo sharding can severally lock the metadata for an extended time. In other cases, the “balancer” might not be able to keep up with the amount of working being asked of it.

Toward the smaller end of the spectrum, having seen so many workloads from big to small. I can see similar thought processes and trends. When this happens having worked with some many of these workloads, and honestly having learned along the evolution of mongo helps me explain to clients the good, bad, and the hairy. Many times discussions come down to people not using connection pooling, non-indexed sorting, or complex operators such as $in, $nin, and more. In these cases, I can talk to people about the balance of using these concepts and when they will become bigger issues for them. My goal is to give them the enough knowledge to help determine when it is correct to use development resource to fix and issue, and when it’s manageable and that development could be better spent elsewhere.

 

Tom: The title of your tutorial also sounds like the perfect title of a book. Do you have any for one?

David: What an excellent question! I have thought about this. However, I think the goal of a book if I can find the time to do it. A working title might be “Mongo from the trenches: Surviving the minefield to get ahead”. I think the book might be broken into three sections:  “When should you use or not user Mongo”,  “Schema and Operatorators in the NoSQL world”, “Sharding”. I would do this as this could be a great mini book on its own the community really could use a level of depth similar to the MySQL 5.0 certification guides.  I liked these books as it helped someone understand all the bits of what to consider with your schema design and how it affects the application as much as the database hosts. Then in the second half more administration geared it took those same schema and design choices to help you manage them with confidence.

In the end, Mongo is a good product that works well for most people as it matures we need more and discussion. On topics such as what should you monitor, how you should predict issues, and how valuable are regular audits. Especially in an ecosystem where it’s easy to spin something up, launch it, and move on to the next project.

 

Tom: When and why would you recommend using MongoDB instead of MySQL?

David: I am glad I mentioned this is worthy of a book already, as it is such a complex topic and one that gets me very excited.

I feel there is a bit or misinformation on both sides of this field. Many in the MySQL camp of experts know when someone says they can’t get more than 1000 TPS via MySQL. 9 out of 10 times and design, not a technology issue,  the Mongo crowd love this and due to inherit sharding nature of Mongo they can sidestep these types of issues. Conversely in the Mongo camp you will hear how bad the  SQL standard is, however, omitting transactions for a moment, the same types of operations exist in MySQL and Mongo.  There are some interesting powers in the Mongo aggregation. However, SQL is more powerful and just as complex as some map reduce jobs and aggregations I have written.

As to your question, MySQL will always win in regards to repeatable reads to the database in a transaction. There is some talk of limited transactions in Mongo. However, these will likely not become global and cluster wide anytime soon if ever.  I don’t trust floats in Mongo for financials; it’s not that Mongo doesn’t do them but rather JavaScript type floats are present. Sometimes you need to store data as a 64-bit integer and do math in the app to make it a high precision float. MySQL, on the other hand, has excellent support for precision.

Another area is simply looking at the history of Mongo and MySQL.  Mongo until WiredTiger and  RocksDB were very similar to MyISAM from a locking behavior and support perspective. With the advent of the new storage system, we will-will see major leaps forward in types of flows you will want in Mongo. With the writer lock issue is gone, and locking between the systems is becoming more and more similar making deciding which much harder.

The news is not all use. However, subdocuments and array support in Mongo is amazing there are so many things  I can do in Mongo that even in bitwise SET/ENUM operators I could not do. So if you need that type of system, or you want to create a semi denormalize for of a view in the database. Mongo can do this with ease and on the fly. MySQL, on the other hand, would take careful planning and need whole tables updated.  In this regard I feel more people could use Mongo and is ability to have a versioned document schema allowing more incremental changes to documents. With new code  releases, allowing the application to read old version and “upgrade” them to the latest form. Removing a whole flurry of maintenance related pains that RDBMs have to the frustration of developers who just want to launch the new product.

The last thing I would want to say here is you need not choose, why not use both. Mongo can be very powerful for keeping a semi denormalized version of the data that is nimble to allow fast application or system updates and features. Leaving MySQL for a very specific workload that need the precision are simple are not expected to have schema changes.  I am a huge fan of keeping the transactional portions in MySQL, and the rest in Mongo. Allowing you to scale quickly up and down the build of your data needs, and more slowly change the parts that need to be 100% consistent all of the time with no room for eventual consistency.

 

Tom: What another session(s) are you most looking forward to besides your own at Percona Live Amsterdam?

David: There are a few that are near and dear to me.

Turtles all the way down: tuning Linux for database workloads” looks like a great one. It is one view I have always had, and DBA’s should be DBA’s,  SysAdmins, and Storage people rolled into one. That way they can understand the impacts of the application down to the blocks the database reads.

TokuDB internals” is another one. I have used TokuDB in MySQL and Mongo to some degree but as it has never had in-depth documentation. A topic like that is a great way to fill any gaps for experienced and new people alike.

Database Reliability Engineering” looks like a great talk from a great speaker.

As an InnoDB geek, I like the idea around “Understanding InnoDB locks: case studies.”

I see a huge amount of potential for MaxScale if anyone else is curious, “Anatomy of a Proxy Server: MaxScale Internals” should be good for R/W splits and split writing type cases.

Finally, one of my favorite people is Charity as she always is so energetic and can get to the heart of the matter. If you are not going to “Upgrade your database: without losing your data, your perf or your mind” you are missing out!

 

Tom: Thanks for speaking with me, David! Is there anything else you’d like to add: either about Rackspace or Percona Live Amsterdam?

David: In regards to Rackspace, I urge everyone to check out the Data Services group.  We handle everything from Redis to Hadoop with a goal of augmenting your groups or providing experts to help keep your uptime as high as possible. With options for dedicated hosts to platform type services, there is something that helps everyone. Rackspace is not just a cloud company but a real support company that provides amazing hardware to use, or support for other hardware location that is growing rapidly.

With Percona Amsterdam, everyone should come the group of speakers is simply amazing, I for one am excited by so many topics because they are all so compelling. Outside of that you will it hard find another a gathering of database experts with multiple technologies under their belt and who truly believe in the move to picking the right technology for the right use case.

The post ObjectRocket’s David Murphy talks about MongoDB, Percona Live Amsterdam appeared first on Percona Data Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com