Aug
16
2018
--

MongoDB: how to use the JSON Schema Validator

JSON Schema Validator for MongoDB

JSON Schema Validator for MongoDBThe flexibility of MongoDB as a schemaless database is one of its strengths. In early versions, it was left to application developers to ensure that any necessary data validation is implemented. With the introduction of JSON Schema Validator there are new techniques to enforce data integrity for MongoDB. In this article, we use examples to show you how to use the JSON Schema Validator to introduce validation checks at the database level—and consider the pros and cons of doing so.

Why validate?

MongoDB is a schemaless database. This means that we don’t have to define a fixed schema for a collection. We just need to insert a JSON document into a collection and that’s all. Documents in the same collection can have a completely different set of fields, and even the same fields can have different types on different documents. The same object can be a string in some documents and can be a number in other documents.

The schemaless feature has given MongoDB great flexibility and the capability to adapt the database to the changing needs of applications. Let’s say that this flexibility is one of the main reasons to use MongoDB. Relational databases are not so flexible: you always need to define a schema at first. Then, when you need to add new columns, create new tables or change existing architecture to respond to the needs of the applications it’s sometimes a very hard task.

The real world can often be messy and MongoDB can really help, but in most cases the real world requires some kind of backbone architecture too. In real applications built on MongoDB there is always some kind of “fixed schema” or “validation rules” in collections and in documents. It’s possible to have in a collection two documents that represent two completely different things.

Well, it’s technically possible, but it doesn’t make sense in most cases for the application. Most of the arguments for enforcing a schema on the data are well known: schemas maintain structure, giving a clear idea of what’s going into the database, reducing preventable bugs and allowing for cleaner code. Schemas are a form of self-documenting code, as they describe exactly what type of data something should be, and they let you know what checks will be performed. It’s good to be flexible, but behind the scenes we need some strong regulations.

So, what we need to do is to find a balance between flexibility and schema validation. In real world applications, we need to define a sort of “backbone schema” for our data and retain the possibility to be flexible to manage specific particularities. In the past developers implemented schema validation in their applications, but starting from version 3.6, MongoDB supports the JSON Schema Validator. We can rely on it to define a fixed schema and validation rules directly into the database and free the applications to take care of it.

Let’s have a look at how it works.

JSON Schema Validator

In fact, a “Validation Schema” was already introduced in 3.2 but the new “JSON Schema Validator” introduced in the 3.6 release is by far the best and a friendly way to manage validations in MongoDB.

What we need to do is to define the rules using the operator $jsonSchema in the db.createCollection command. The $jsonSchema operator requires a JSON document where we specify all the rules to be applied on each inserted or updated document: for example what are the required fields, what type the fields must be, what are the ranges of the values, what pattern a specific field must have, and so on.

Let’s have a look at the following example where we create a collection people defining validation rules with JSON Schema Validator.

db.createCollection( "people" , {
   validator: { $jsonSchema: {
      bsonType: "object",
      required: [ "name", "surname", "email" ],
      properties: {
         name: {
            bsonType: "string",
            description: "required and must be a string" },
         surname: {
            bsonType: "string",
            description: "required and must be a string" },
         email: {
            bsonType: "string",
            pattern: "^.+\@.+$",
            description: "required and must be a valid email address" },
         year_of_birth: {
            bsonType: "int",
            minimum: 1900,
            maximum: 2018,
            description: "the value must be in the range 1900-2018" },
         gender: {
            enum: [ "M", "F" ],
            description: "can be only M or F" }
      }
   }
}})

Based on what we have defined, only 3 fields are strictly required in every document of the collection: name, surname, and email. In particular, the email field must have a specific pattern to be sure the content is a valid address. (Note: to validate an email address you need a more complex regular expression, here we just use a simpler version just to check there is the @ symbol). Other fields are not required but in case someone inserts them, we have defined a validation rule.

Let’s try to do some example inserting documents to test if everything is working as expected.

Insert a document with one of the required fields missing:

MongoDB > db.people.insert( { name : "John", surname : "Smith" } )
    WriteResult({
      "nInserted" : 0,
      "writeError" : {
      "code" : 121,
      "errmsg" : "Document failed validation"
   }
})

Insert a document with all the required fields but with an invalid email address

MongoDB > db.people.insert( { name : "John", surname : "Smith", email : "john.smith.gmail.com" } )
   WriteResult({
      "nInserted" : 0,
      "writeError" : {
      "code" : 121,
      "errmsg" : "Document failed validation"
   }
})

Finally, insert a valid document

MongoDB > db.people.insert( { name : "John", surname : "Smith", email : "john.smith@gmail.com" } )
WriteResult({ "nInserted" : 1 })

Let’s try now to do more inserts including of other fields.

MongoDB > db.people.insert( { name : "Bruce", surname : "Dickinson", email : "bruce@gmail.com", year_of_birth : NumberInt(1958), gender : "M" } )
WriteResult({ "nInserted" : 1 })
MongoDB > db.people.insert( { name : "Corrado", surname : "Pandiani", email : "corrado.pandiani@percona.com", year_of_birth : NumberInt(1971), gender : "M" } )
WriteResult({ "nInserted" : 1 })
MongoDB > db.people.insert( { name : "Marie", surname : "Adamson", email : "marie@gmail.com", year_of_birth : NumberInt(1992), gender : "F" } )
WriteResult({ "nInserted" : 1 })

The records were inserted correctly because all the rules on the required fields, and on the other not required fields, were satisfied. Let’s see now a case where the year_of_birth or gender fields are not correct.

MongoDB > db.people.insert( { name : "Tom", surname : "Tom", email : "tom@gmail.com", year_of_birth : NumberInt(1980), gender : "X" } )
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
MongoDB > db.people.insert( { name : "Luise", surname : "Luise", email : "tom@gmail.com", year_of_birth : NumberInt(1899), gender : "F" } )
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})

In the first query gender is X, but the valid values are only M or F. In the second query year of birth is outside the permitted range.

Let’s try now to insert documents with arbitrary extra fields that are not in the JSON Schema Validator.

MongoDB > db.people.insert( { name : "Tom", surname : "Tom", email : "tom@gmail.com", year_of_birth : NumberInt(2000), gender : "M", shirt_size : "XL", preferred_band : "Coldplay" } )
WriteResult({ "nInserted" : 1 })
MongoDB > db.people.insert( { name : "Luise", surname : "Luise", email : "tom@gmail.com", gender : "F", shirt_size : "M", preferred_band : "Maroon Five" } )
WriteResult({ "nInserted" : 1 })
MongoDB > db.people.find().pretty()
{
"_id" : ObjectId("5b6b12e0f213dc83a7f5b5e8"),
"name" : "John",
"surname" : "Smith",
"email" : "john.smith@gmail.com"
}
{
"_id" : ObjectId("5b6b130ff213dc83a7f5b5e9"),
"name" : "Bruce",
"surname" : "Dickinson",
"email" : "bruce@gmail.com",
"year_of_birth" : 1958,
"gender" : "M"
}
{
"_id" : ObjectId("5b6b1328f213dc83a7f5b5ea"),
"name" : "Corrado",
"surname" : "Pandiani",
"email" : "corrado.pandiani@percona.com",
"year_of_birth" : 1971,
"gender" : "M"
}
{
"_id" : ObjectId("5b6b1356f213dc83a7f5b5ed"),
"name" : "Marie",
"surname" : "Adamson",
"email" : "marie@gmail.com",
"year_of_birth" : 1992,
"gender" : "F"
}
{
"_id" : ObjectId("5b6b1455f213dc83a7f5b5f0"),
"name" : "Tom",
"surname" : "Tom",
"email" : "tom@gmail.com",
"year_of_birth" : 2000,
"gender" : "M",
"shirt_size" : "XL",
"preferred_band" : "Coldplay"
}
{
"_id" : ObjectId("5b6b1476f213dc83a7f5b5f1"),
"name" : "Luise",
"surname" : "Luise",
"email" : "tom@gmail.com",
"gender" : "F",
"shirt_size" : "M",
"preferred_band" : "Maroon Five"
}

As we can see, we have the flexibility to add new fields with no restrictions on the permitted values.

Having a really fixed schema

The behavior we have seen so far to permit the addition of extra fields that are not in the validation rules is the default. If we would like to be more restrictive and have a really fixed schema for the collection we need to add the additionalProperties: false parameter in the createCollection command.

In the following example, we create a validator to permit only the required fields. No other extra fields are permitted.

db.createCollection( "people2" , {
   validator: {
     $jsonSchema: {
        bsonType: "object",
        additionalProperties: false,
        properties: {
           _id : {
              bsonType: "objectId" },
           name: {
              bsonType: "string",
              description: "required and must be a string" },
           age: {
              bsonType: "int",
              minimum: 0,
              maximum: 100,
              description: "required and must be in the range 0-100" }
        }
     }
}})

Note a couple of differences:

  • we don’t need to specify the list of required fields; using additionalProperties: false forces all the fields to be required by default
  • we need to put explicitly even the _id field

As you can notice in the following test, we are no longer allowed to add extra fields. We are forced to insert documents always with the same two fields name and age.

MongoDB > db.people2.insert( {name : "George", age: NumberInt(30)} )
WriteResult({ "nInserted" : 1 })
MongoDB > db.people2.insert( {name : "Maria", age: NumberInt(35), surname: "Peterson"} )
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})

In this case we don’t have flexibility, and that is the main benefit of having a NoSQL database like MongoDB.

Well, it’s up to you to use it or not. It depends on the nature and goals of your application. I wouldn’t recommend it in most cases.

Add validation to existing collections

We have seen so far how to create a new collection with validation rules, But what about the existing collections? How can we add rules?

This is quite trivial. The syntax to use in $jsonSchema remains the same, we just need to use the collMod command instead of createCollection. The following example shows how to create validation rules on an existing collection.

First we create a simple new collection people3, inserting some documents.

MongoDB > db.people3.insert( {name: "Corrado", surname: "Pandiani", year_of_birth: NumberLong(1971)} )
WriteResult({ "nInserted" : 1 })
MongoDB > db.people3.insert( {name: "Tom", surname: "Cruise", year_of_birth: NumberLong(1961), gender: "M"} )
WriteResult({ "nInserted" : 1 })
MongoDB > db.people3.insert( {name: "Kevin", surname: "Bacon", year_of_birth: NumberLong(1964), gender: "M", shirt_size: "L"} )
WriteResult({ "nInserted" : 1 })

Let’s create the validator.

MongoDB > db.runCommand( { collMod: "people3",
   validator: {
      $jsonSchema : {
         bsonType: "object",
         required: [ "name", "surname", "gender" ],
         properties: {
            name: {
               bsonType: "string",
               description: "required and must be a string" },
            surname: {
               bsonType: "string",
               description: "required and must be a string" },
            gender: {
               enum: [ "M", "F" ],
               description: "required and must be M or F" }
         }
       }
},
validationLevel: "moderate",
validationAction: "warn"
})

The two new options validationLevel and validationAction are important in this case.

validationLevel can have the following values:

  • “off” : validation is not applied
  • “strict”: it’s the default value. Validation applies to all inserts and updates
  • “moderated”: validation applies to all the valid existing documents. Not valid documents are ignored.

When creating validation rules on existing collections, the “moderated” value is the safest option.

validationAction can have the following values:

  • “error”: it’s the default value. The document must pass the validation in order to be written
  • “warn”: a document that doesn’t pass the validation is written but a warning message is logged

When adding validation rules to an existing collection the safest option is “warn”

These two options can be applied even with createCollection. We didn’t use them because the default values are good in most of the cases.

How to investigate a collection definition

In case we want to see how a collection was defined, and, in particular, what the validator rules are, the command db.getCollectionInfos() can be used. The following example shows how to investigate the “schema” we have created for the people collection.

MongoDB > db.getCollectionInfos( {name: "people"} )
[
  {
    "name" : "people",
    "type" : "collection",
    "options" : {
      "validator" : {
        "$jsonSchema" : {
          "bsonType" : "object",
          "required" : [
            "name",
            "surname",
            "email"
          ],
          "properties" : {
            "name" : {
              "bsonType" : "string",
              "description" : "required and must be a string"
            },
            "surname" : {
              "bsonType" : "string",
              "description" : "required and must be a string"
            },
            "email" : {
              "bsonType" : "string",
              "pattern" : "^.+@.+$",
              "description" : "required and must be a valid email address"
             },
             "year_of_birth" : {
               "bsonType" : "int",
               "minimum" : 1900,
               "maximum" : 2018,
               "description" : "the value must be in the range 1900-2018"
             },
             "gender" : {
               "enum" : [
                 "M",
                 "F"
               ],
             "description" : "can be only M or F"
        }
      }
    }
  }
},
"info" : {
  "readOnly" : false,
  "uuid" : UUID("5b98c6f0-2c9e-4c10-a3f8-6c1e7eafd2b4")
},
"idIndex" : {
  "v" : 2,
  "key" : {
    "_id" : 1
  },
"name" : "_id_",
"ns" : "test.people"
}
}
]

Limitations and restrictions

Validators cannot be defined for collections in the following databases: admin, local, config.

Validators cannot be defined for system.* collections.

A limitation in the current implementation of JSON Schema Validator is that the error messages are not very good in terms of helping you to understand which of the rules are not satisfied by the document. This should be confirmed manually by doing some tests, and that’s not so easy when dealing with complex documents. Having more specific error strings, hopefully taken from the validator definition, could be very useful when debugging application errors and warnings. This is definitely something that should be improved in the next releases.

While waiting for improvements, someone has developed a wrapper for the mongo client to gather more defined error strings. You can have a look at https://www.npmjs.com/package/mongo-schemer. You can test it and use it, but pay attention to the clause “Running in prod is not recommended due to the overhead of validating documents against the schema“.

Conclusions

Doing schema validation in the application remains, in general, a best practice, but JSON Schema Validator is a good tool to enforce validation directly into the database.

Hence even though it needs some improvements, the JSON Schema feature is good enough for most of the common cases. We suggest to test it and use it when you really need to create a backbone structure for your data.

While you are here…

You might also enjoy these other articles about MongoDB 3.6

 

The post MongoDB: how to use the JSON Schema Validator appeared first on Percona Database Performance Blog.

Jun
13
2016
--

Webinar Thursday, June 16: MongoDB Schema Design

MongoDB Schema Design

MongoDB Schema DesignPlease join Jon Tobin, Director of Solutions Engineering at Percona on Thursday, June 16, 2016 10:00am PDT (UTC-7) for a webinar on “MongoDB® Schema Design.”

Jon will discuss the most common misconception when evaluating the use of MongoDB: that it is “schemaless.” THIS IS NOT TRUE. MongoDB has a document structure, and thus, a schema. While the structure is much more dynamic than that of most relational database models, choices that you make can and will pay themselves forward (or haunt you forever).

In this webinar, we’ll cover what a document is, how they can be structured, and what structures work (and don’t work) for a particular use case. We will also touch on design decisions and how they affect the ability of the cluster to scale in the future. Some of the topics that will be covered are:

  • Document Structure
  • Embedding vs Referencing
  • Normalization vs De-Normalization
  • Atomicity
  • MongoDB Sharding

Register here.

MongoDB Schema DesignJon TobinDirector of Solution Engineering

When not saving kittens from sequoias or helping the elderly across busy intersections, Jon Tobin is Percona’s Director of Solutions Engineering. He has spent over 15 years in the IT industry. For the last 6 years, Jon has been helping innovative IT companies assess and address customer’s business needs through well-designed solutions.

 

Aug
01
2013
--

Schema Design in MongoDB vs Schema Design in MySQL

For people used to relational databases, using NoSQL solutions such as MongoDB brings interesting challenges. One of them is schema design: while in the relational world, normalization is a good way to start, how should we design our collections when creating a new MongoDB application?

Let’s see with a simple example how we would create a data structure for MySQL (or any relational database) and for MongoDB. We will assume in this post that we want to store people information (their name) and the details from their passport (country and validity date).

Relational Design

In the relational world, the basic idea is to try to stick to the 3rd normal form and create two tables (I’ll omit indexes and foreign keys for clarity – MongoDB supports indexes but not foreign keys):

mysql> select * from people;
+----+------------+
| id | name       |
+----+------------+
|  1 | Stephane   |
|  2 | John       |
|  3 | Michael    |
|  4 | Cinderella |
+----+------------+
mysql> select * from passports;
+----+-----------+---------+-------------+
| id | people_id | country | valid_until |
+----+-----------+---------+-------------+
|  4 |         1 | FR      | 2020-01-01  |
|  5 |         2 | US      | 2020-01-01  |
|  6 |         3 | RU      | 2020-01-01  |
+----+-----------+---------+-------------+

One of the good things with such a design is that it’s equally easy to run any query (as long as we don’t consider joins as something difficult to use):

  • Do you want the number of people?
    SELECT count(*) FROM people
  • Do you want to know the validity date of Stephane’s passport?
    SELECT valid_until from passports ps join people pl ON ps.people_id = pl.id WHERE name = 'Stephane'
  • Do you want to know how many people do not have a passport? Run
    SELECT name FROM people pl LEFT JOIN passports ps ON ps.people_id = pl.id WHERE ps.id IS NULL
  • etc

MongoDB design

Now how should we design our collections in MongoDB to make querying easy?

Using the 3rd normal form is of course possible, but that would probably be inefficient as all joins should be done in the application. So out of the 3 queries above, only the query #1 could be easily run. So which other designs could we have?

A first option would be to store everything in the same collection:

> db.people_all.find().pretty()
{
	"_id" : ObjectId("51f7be1cd6189a56c399d3bf"),
	"name" : "Stephane",
	"country" : "FR",
	"valid_until" : ISODate("2019-12-31T23:00:00Z")
}
{
	"_id" : ObjectId("51f7be3fd6189a56c399d3c0"),
	"name" : "John",
	"country" : "US",
	"valid_until" : ISODate("2019-12-31T23:00:00Z")
}
{
	"_id" : ObjectId("51f7be4dd6189a56c399d3c1"),
	"name" : "Michael",
	"country" : "RU",
	"valid_until" : ISODate("2019-12-31T23:00:00Z")
}
{ "_id" : ObjectId("51f7be5cd6189a56c399d3c2"), "name" : "Cinderella" }

By the way, we can see here that MongoDB is schemaless: there is no problem in storing documents that do not have the same structure.

The drawback is that it is no longer clear which attributes belong to the passport, so if you want to get all passport information for Michael, you will need to correctly understand the whole data structure.

A second option would be to embed passport information inside people information – MongoDB supports rich documents:

> db.people_embed.find().pretty()
{
	"_id" : ObjectId("51f7c0048ded44d5ebb83774"),
	"name" : "Stephane",
	"passport" : {
		"country" : "FR",
		"valid_until" : ISODate("2019-12-31T23:00:00Z")
	}
}
{
	"_id" : ObjectId("51f7c70e8ded44d5ebb83775"),
	"name" : "John",
	"passport" : {
		"country" : "US",
		"valid_until" : ISODate("2019-12-31T23:00:00Z")
	}
}
{
	"_id" : ObjectId("51f7c71b8ded44d5ebb83776"),
	"name" : "Michael",
	"passport" : {
		"country" : "RU",
		"valid_until" : ISODate("2019-12-31T23:00:00Z")
	}
}
{ "_id" : ObjectId("51f7c7258ded44d5ebb83777"), "name" : "Cinderella" }

Or we could embed the other way (however this looks a bit dubious as some people may not have a passport like Cinderella in our example):

> db.passports_embed.find().pretty()
{
	"_id" : ObjectId("51f7c7e58ded44d5ebb8377b"),
	"country" : "FR",
	"valid_until" : ISODate("2019-12-31T23:00:00Z"),
	"person" : {
		"name" : "Stephane"
	}
}
{
	"_id" : ObjectId("51f7c7ec8ded44d5ebb8377c"),
	"country" : "US",
	"valid_until" : ISODate("2019-12-31T23:00:00Z"),
	"person" : {
		"name" : "John"
	}
}
{
	"_id" : ObjectId("51f7c7fa8ded44d5ebb8377d"),
	"country" : "RU",
	"valid_until" : ISODate("2019-12-31T23:00:00Z"),
	"person" : {
		"name" : "Michael"
	}
}
{
	"_id" : ObjectId("51f7c8058ded44d5ebb8377e"),
	"person" : {
		"name" : "Cinderella"
	}
}

That’s a lot of options! How can we choose? Here is where you should be aware of a fundamental difference between MongoDB and relational databases when it comes to schema design:

Collections inside MongoDB should be designed with the most frequent access patterns of the application in mind, while in the relational world, you can forget how data will be accessed if your tables are normalized.

So…

  • If you read people information 99% of the time, having 2 separate collections can be a good solution: it avoids keeping in memory data is almost never used (passport information) and when you need to have all information for a given person, it may be acceptable to do the join in the application.
  • Same thing if you want to display the name of people on one screen and the passport information on another screen.
  • But if you want to display all information for a given person, storing everything in the same collection (with embedding or with a flat structure) is likely to be the best solution.

Conclusion

We saw in this post one of the fundamental differences between MySQL and MongoDB when it comes to creating the right data structure for an application: with MongoDB, you need to know the data access pattern of the application. This should not be neglected as creating a wrong schema design is a recipe for disaster: queries will be difficult to write and to optimize, they will be slow and they will sometimes need to be replaced by custom code. All that can lead to low performance and frustration.

The next question is: which way is better? And of course, there is no definite answer: MongoDB fans will say that by making all access patterns equal, normalization make them equally bad, and normalization fans will say that a normalized schema provides good performance for most applications and that you can always denormalize to help a few queries run faster.

The post Schema Design in MongoDB vs Schema Design in MySQL appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com