Aug
17
2018
--

Replication from Percona Server for MySQL to PostgreSQL using pg_chameleon

postgres mysql replication using pg_chameleon

postgres mysql replication using pg_chameleonReplication is one of the well-known features that allows us to build an identical copy of a database. It is supported in almost every RDBMS. The advantages of replication may be huge, especially HA (High Availability) and load balancing. But what if we need to build replication between 2 heterogeneous databases like MySQL and PostgreSQL? Can we continuously replicate changes from a MySQL database to a PostgreSQL database? The answer to this question is pg_chameleon.

For replicating continuous changes, pg_chameleon uses the mysql-replication library to pull the row images from MySQL, which are transformed into a jsonb object. A pl/pgsql function in postgres decodes the jsonb and replays the changes into the postgres database. In order to setup this type of replication, your mysql binlog_format must be “ROW”.

A few points you should know before setting up this tool :

  1. Tables that need to be replicated must have a primary key.
  2. Works for PostgreSQL versions > 9.5 and MySQL > 5.5
  3. binlog_format must be ROW in order to setup this replication.
  4. Python version must be > 3.3

When you initialize the replication, pg_chameleon pulls the data from MySQL using the CSV format in slices, to prevent memory overload. This data is flushed to postgres using the COPY command. If COPY fails, it tries INSERT, which may be slow. If INSERT fails, then the row is discarded.

To replicate changes from mysql, pg_chameleon mimics the behavior of a mysql slave. It creates the schema in postgres, performs the initial data load, connects to MySQL replication protocol, stores the row images into a table in postgres. Now, the respective functions in postgres decode those rows and apply the changes. This is similar to storing relay logs in postgres tables and applying them to a postgres schema. You do not have to create a postgres schema using any DDLs. This tool automatically does that for the tables configured for replication. If you need to specifically convert any types, you can specify this in the configuration file.

The following is just an exercise that you can experiment with and implement if it completely satisfies your requirement. We performed these tests on CentOS Linux release 7.4.

Prepare the environment

Set up Percona Server for MySQL

InstallMySQL 5.7 and add appropriate parameters for replication.

In this exercise, I have installed Percona Server for MySQL 5.7 using YUM repo.

yum install http://www.percona.com/downloads/percona-release/redhat/0.1-6/percona-release-0.1-6.noarch.rpm
yum install Percona-Server-server-57
echo "mysql ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
usermod -s /bin/bash mysql
sudo su - mysql

pg_chameleon requires the following the parameters to be set in your my.cnf file (parameter file of your MySQL server). You may add the following parameters to /etc/my.cnf

binlog_format= ROW
binlog_row_image=FULL
log-bin = mysql-bin
server-id = 1

Now start your MySQL server after adding the above parameters to your my.cnf file.

$ service mysql start

Fetch the temporary root password from mysqld.log, and reset the root password using mysqladmin

$ grep "temporary" /var/log/mysqld.log
$ mysqladmin -u root -p password 'Secret123!'

Now, connect to your MySQL instance and create sample schema/tables. I have also created an emp table for validation.

$ wget http://downloads.mysql.com/docs/sakila-db.tar.gz
$ tar -xzf sakila-db.tar.gz
$ mysql -uroot -pSecret123! < sakila-db/sakila-schema.sql
$ mysql -uroot -pSecret123! < sakila-db/sakila-data.sql
$ mysql -uroot -pSecret123! sakila -e "create table emp (id int PRIMARY KEY, first_name varchar(20), last_name varchar(20))"

Create a user for configuring replication using pg_chameleon and give appropriate privileges to the user using the following steps.

$ mysql -uroot -p
create user 'usr_replica'@'%' identified by 'Secret123!';
GRANT ALL ON sakila.* TO 'usr_replica'@'%';
GRANT RELOAD, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'usr_replica'@'%';
FLUSH PRIVILEGES;

While creating the user in your mysql server (‘usr_replica’@’%’), you may wish to replace % with the appropriate IP or hostname of the server on which pg_chameleon is running.

Set up PostgreSQL

Install PostgreSQL and start the database instance.

You may use the following steps to install PostgreSQL 10.x

yum install https://yum.postgresql.org/10/redhat/rhel-7.4-x86_64/pgdg-centos10-10-2.noarch.rpm
yum install postgresql10*
su - postgres
$/usr/pgsql-10/bin/initdb
$ /usr/pgsql-10/bin/pg_ctl -D /var/lib/pgsql/10/data start

As seen in the following logs, create a user in PostgreSQL using which pg_chameleon can write changed data to PostgreSQL. Also create the target database.

postgres=# CREATE USER usr_replica WITH ENCRYPTED PASSWORD 'secret';
CREATE ROLE
postgres=# CREATE DATABASE db_replica WITH OWNER usr_replica;
CREATE DATABASE

Steps to install and setup replication using pg_chameleon

Step 1: In this exercise, I installed Python 3.6 and pg_chameleon 2.0.8 using the following steps. You may skip the python install steps if you already have the desired python release. We can create a virtual environment if the OS does not include Python 3.x by default.

yum install gcc openssl-devel bzip2-devel wget
cd /usr/src
wget https://www.python.org/ftp/python/3.6.6/Python-3.6.6.tgz
tar xzf Python-3.6.6.tgz
cd Python-3.6.6
./configure --enable-optimizations
make altinstall
python3.6 -m venv venv
source venv/bin/activate
pip install pip --upgrade
pip install pg_chameleon

Step 2: This tool requires a configuration file to store the source/target server details, and a directory to store the logs. Use the following command to let pg_chameleon create the configuration file template and the respective directories for you.

$ chameleon set_configuration_files

The above command would produce the following output, which shows that it created some directories and a file in the location where you ran the command.

creating directory /var/lib/pgsql/.pg_chameleon
creating directory /var/lib/pgsql/.pg_chameleon/configuration/
creating directory /var/lib/pgsql/.pg_chameleon/logs/
creating directory /var/lib/pgsql/.pg_chameleon/pid/
copying configuration example in /var/lib/pgsql/.pg_chameleon/configuration//config-example.yml

Copy the sample configuration file to another file, lets say, default.yml

$ cd .pg_chameleon/configuration/
$ cp config-example.yml default.yml

Here is how my default.yml file looks after adding all the required parameters. In this file, we can optionally specify the data type conversions, tables to skipped from replication and the DML events those need to skipped for selected list of tables.

---
#global settings
pid_dir: '~/.pg_chameleon/pid/'
log_dir: '~/.pg_chameleon/logs/'
log_dest: file
log_level: info
log_days_keep: 10
rollbar_key: ''
rollbar_env: ''
# type_override allows the user to override the default type conversion into a different one.
type_override:
  "tinyint(1)":
    override_to: boolean
    override_tables:
      - "*"
#postgres  destination connection
pg_conn:
  host: "localhost"
  port: "5432"
  user: "usr_replica"
  password: "secret"
  database: "db_replica"
  charset: "utf8"
sources:
  mysql:
    db_conn:
      host: "localhost"
      port: "3306"
      user: "usr_replica"
      password: "Secret123!"
      charset: 'utf8'
      connect_timeout: 10
    schema_mappings:
      sakila: sch_sakila
    limit_tables:
#      - delphis_mediterranea.foo
    skip_tables:
#      - delphis_mediterranea.bar
    grant_select_to:
      - usr_readonly
    lock_timeout: "120s"
    my_server_id: 100
    replica_batch_size: 10000
    replay_max_rows: 10000
    batch_retention: '1 day'
    copy_max_memory: "300M"
    copy_mode: 'file'
    out_dir: /tmp
    sleep_loop: 1
    on_error_replay: continue
    on_error_read: continue
    auto_maintenance: "disabled"
    gtid_enable: No
    type: mysql
    skip_events:
      insert:
#        - delphis_mediterranea.foo #skips inserts on the table delphis_mediterranea.foo
      delete:
#        - delphis_mediterranea #skips deletes on schema delphis_mediterranea
      update:

Step 3: Initialize the replica using this command:

$ chameleon create_replica_schema --debug

The above command creates a schema and nine tables in the PostgreSQL database that you specified in the .pg_chameleon/configuration/default.yml file. These tables are needed to manage replication from source to destination. The same can be observed in the following log.

db_replica=# \dn
List of schemas
Name | Owner
---------------+-------------
public | postgres
sch_chameleon | target_user
(2 rows)
db_replica=# \dt sch_chameleon.t_*
List of relations
Schema | Name | Type | Owner
---------------+------------------+-------+-------------
sch_chameleon | t_batch_events | table | target_user
sch_chameleon | t_discarded_rows | table | target_user
sch_chameleon | t_error_log | table | target_user
sch_chameleon | t_last_received | table | target_user
sch_chameleon | t_last_replayed | table | target_user
sch_chameleon | t_log_replica | table | target_user
sch_chameleon | t_replica_batch | table | target_user
sch_chameleon | t_replica_tables | table | target_user
sch_chameleon | t_sources | table | target_user
(9 rows)

Step 4: Add the source details to pg_chameleon using the following command. Provide the name of the source as specified in the configuration file. In this example, the source name is mysql and the target is postgres database defined under pg_conn.

$ chameleon add_source --config default --source mysql --debug

Once you run the above command, you should see that the source details are added to the t_sources table.

db_replica=# select * from sch_chameleon.t_sources;
-[ RECORD 1 ]-------+----------------------------------------------
i_id_source | 1
t_source | mysql
jsb_schema_mappings | {"sakila": "sch_sakila"}
enm_status | ready
t_binlog_name |
i_binlog_position |
b_consistent | t
b_paused | f
b_maintenance | f
ts_last_maintenance |
enm_source_type | mysql
v_log_table | {t_log_replica_mysql_1,t_log_replica_mysql_2}
$ chameleon show_status --config default
Source id Source name Type Status Consistent Read lag Last read Replay lag Last replay
----------- ------------- ------ -------- ------------ ---------- ----------- ------------ -------------
1 mysql mysql ready Yes N/A N/A

Step 5: Initialize the replica/slave using the following command. Specify the source from which you are replicating the changes to the PostgreSQL database.

$ chameleon init_replica --config default --source mysql --debug

Initialization involves the following tasks on the MySQL server (source).

1. Flush the tables with read lock
2. Get the master’s coordinates
3. Copy the data
4. Release the locks

The above command creates the target schema in your postgres database automatically.
In the default.yml file, we mentioned the following schema_mappings.

schema_mappings:
sakila: sch_sakila

So, now it created the new schema scott in the target database db_replica.

db_replica=# \dn
List of schemas
Name | Owner
---------------+-------------
public | postgres
sch_chameleon | usr_replica
sch_sakila | usr_replica
(3 rows)

Step 6: Now, start replication using the following command.

$ chameleon start_replica --config default --source mysql

Step 7: Check replication status and any errors using the following commands.

$ chameleon show_status --config default
$ chameleon show_errors

This is how the status looks:

$ chameleon show_status --source mysql
Source id Source name Type Status Consistent Read lag Last read Replay lag Last replay
----------- ------------- ------ -------- ------------ ---------- ----------- ------------ -------------
1 mysql mysql running No N/A N/A
== Schema mappings ==
Origin schema Destination schema
--------------- --------------------
sakila sch_sakila
== Replica status ==
--------------------- ---
Tables not replicated 0
Tables replicated 17
All tables 17
Last maintenance N/A
Next maintenance N/A
Replayed rows
Replayed DDL
Skipped rows

Now, you should see that the changes are continuously getting replicated from MySQL to PostgreSQL.

Step 8:  To validate, you may insert a record into the table in MySQL that we created for the purpose of validation and check that it is replicated to postgres.

$ mysql -u root -pSecret123! -e "INSERT INTO sakila.emp VALUES (1,'avinash','vallarapu')"
mysql: [Warning] Using a password on the command line interface can be insecure.
$ psql -d db_replica -c "select * from sch_sakila.emp"
 id | first_name | last_name
----+------------+-----------
  1 | avinash    | vallarapu
(1 row)

In the above log, we see that the record that was inserted to the MySQL table was replicated to the PostgreSQL table.

You may also add multiple sources for replication to PostgreSQL (target).

Reference : http://www.pgchameleon.org/documents/

Please refer to the above documentation to find out about the many more options that are available with pg_chameleon

The post Replication from Percona Server for MySQL to PostgreSQL using pg_chameleon appeared first on Percona Database Performance Blog.

Aug
17
2018
--

Klarity uses AI to strip drudgery from contract review

Klarity, a member of the Y Combinator 2018 Summer class, wants to automate much of the contract review process by applying artificial intelligence, specifically natural language processing.

Company co-founder and CEO Andrew Antos has experienced the pain of contract reviews first hand. After graduating from Harvard Law, he landed a job spending 16 hours a day reviewing contract language, a process he called mind-numbing. He figured there had to be a way to put technology to bear on the problem and Klarity was born.

“A lot of companies are employing internal or external lawyers because their customers, vendors or suppliers are sending them a contract to sign,” Antos explained They have to get somebody to read it, understand it and figure out whether it’s something that they can sign or if it requires specific changes.

You may think that this kind of work would be difficult to automate, but Antos said that  contracts have fairly standard language and most companies use ‘playbooks.’ “Think of the playbook as a checklist for NDAs, sales agreements and vendor agreements — what they are looking for and specific preferences on what they agree to or what needs to be changed,” Antos explained.

Klarity is a subscription cloud service that checks contracts in Microsoft Word documents using NLP. It makes suggestions when it sees something that doesn’t match up with the playbook checklist. The product then generates a document, and a human lawyer reviews and signs off on the suggested changes, reducing the review time from an hour or more to 10 or 15 minutes.

Screenshot: Klarity

They launched the first iteration of the product last year and have 14 companies using it with 4 paying customers so far including one of the world’s largest private equity funds. These companies signed on because they have to process huge numbers of contracts. Klarity is helping them save time and money, while applying their preferences in a consistent fashion, something that a human reviewer can have trouble doing.

He acknowledges the solution could be taking away work from human lawyers, something they think about quite a bit. Ultimately though, they believe that contract reviewing is so tedious, it is freeing up lawyers for work that requires a greater level of intellectual rigor and creativity.

Antos met his co-founder and CTO, Nischal Nadhamuni, at an MIT entrepreneurship class in 2016 and the two became fast friends. In fact, he says that they pretty much decided to start a company the first day. “We spent 3 hours walking around Cambridge and decided to work together to solve this real problem people are having.”

They applied to Y Combinator two other times before being accepted in this summer’s cohort. The third time was the charm. He says the primary value of being in YC is the community and friendships they have formed and the help they have had in refining their approach.

“It’s like having a constant mirror that helps you realize any mistakes or any suboptimal things in your business on a high speed basis,” he said.

Aug
17
2018
--

This Week in Data with Colin Charles 49: MongoDB Conference Opportunities and Serverless Aurora MySQL

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

Beyond the MongoDB content that will be at Percona Live Europe 2018, there is also a bit of an agenda for MongoDB Europe 2018, happening on November 8 in London—a day after Percona Live in Frankfurt. I expect you’ll see a diverse set of MongoDB content at Percona Live.

The Percona Live Europe Call for Papers closes TODAY! (Friday August 17, 2018)

From Amazon, there have been some good MySQL changes. You now have access to time delayed replication as a strategy for your High Availability and disaster recovery. This works with versions 5.7.22, 5.6.40 and later. It is worth noting that this isn’t documented as working for MariaDB (yet?). It arrived in MariaDB Server in 10.2.3.

Another MySQL change from Amazon? Aurora Serverless MySQL is now generally available. You can build and run applications without thinking about instances: previously, the database function was not all that focused on serverless. This on-demand auto-scaling serverless Aurora should be fun to use. Only Aurora MySQL 5.6 is supported at the moment and also, be aware that this is not available in all regions yet (e.g. Singapore).

Releases

  • pgmetrics is described as an open-source, zero-dependency, single-binary tool that can collect a lot of information and statistics from a running PostgreSQL server and display it in easy-to-read text format or export it as JSON for scripting.
  • PostgreSQL 10.5, 9.6.10, 9.5.14, 9.4.19, 9.3.24, And 11 Beta 3 has two fixed security vulnerabilities may inspire an upgrade.

Link List

Industry Updates

  • Martin Arrieta (LinkedIn) is now a Site Reliability Engineer at Fastly. Formerly of Pythian and Percona.
  • Ivan Zoratti (LinkedIn) is now Director of Product Management at Neo4j. He was previously on founding teams, was the CTO of MariaDB Corporation (then SkySQL), and is a long time MySQL veteran.

Upcoming Appearances

Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

 

The post This Week in Data with Colin Charles 49: MongoDB Conference Opportunities and Serverless Aurora MySQL appeared first on Percona Database Performance Blog.

Aug
16
2018
--

Work-Bench enterprise report predicts end of SaaS could be coming

Work-Bench, a New York City venture capital firm that spends a lot of time around Fortune 1000 companies, has put together The Work-Bench Enterprise Almanac: 2018 Edition, which you could think of as a State of the Enterprise report. It’s somewhat like Mary Meeker’s Internet Trends report, but with a focus on the tools and technologies that will be having a major impact on the enterprise in the coming year.

Perhaps the biggest take-away from the report could be that the end of SaaS as we’ve known could be coming if modern tools make it easier for companies to build software themselves. More on this later.

While the report writers state that their findings are based at least partly on anecdotal evidence, it is clearly an educated set of observations and predictions related to the company’s work with enterprise startups and the large companies they tend to target.

As they wrote in their Medium post launching the report, “Our primary aim is to help founders see the forest from the trees. For Fortune 1000 executives and other players in the ecosystem, it will help cut through the noise and marketing hype to see what really matters.” Whether that’s the case will be in the eye of the reader, but it’s a comprehensive attempt to document the state of the enterprise as they see it, and there are not too many who have done that.

The big picture

The report points out the broader landscape in which enterprise companies — startups and established players alike — are operating today. You have traditional tech companies like Cisco and HP, the mega cloud companies like Amazon, Microsoft and Google, the Growth Guard with companies like Snowflake, DataDog and Sumo Logic and the New Guard, those early stage enterprise companies gunning for the more established players.

 

As the report states, the mega cloud players are having a huge impact on the industry by providing the infrastructure services for startups to launch and grow without worrying about building their own data centers or scaling to meet increasing demand as a company develops.

The mega clouders also scoop up a fair number of startups. Yet they don’t devote quite the level of revenue to M&A as you might think based on how acquisitive the likes of Salesforce, Microsoft and Oracle have tended to be over the years. In fact, in spite of all the action and multi-billion deals we’ve seen, Work-Bench sees room for even more.

It’s worth pointing out that Work-Bench predicts Salesforce itself could become a target for mega cloud M&A action. They are predicting that either Amazon or Microsoft could buy the CRM giant. We saw such speculation several years ago and it turned out that Salesforce was too rich for even these company’s blood. While they may have more cash to spend, the price has probably only gone up as Salesforce acquires more and more companies and its revenue has surpassed $10 billion.

About those mega trends

The report dives into 4 main areas of coverage, none of which are likely to surprise you if you read about the enterprise regularly in this or other publications:

  • Machine Learning
  • Cloud
  • Security
  • SaaS

While all of these are really interconnected as SaaS is part of the cloud and all need security and will be (if they aren’t already) taking advantage of machine learning. Work-Bench is not seeing it in such simple terms, of course, diving into each area in detail.

The biggest take-away is perhaps that infrastructure could end up devouring SaaS in the long run. Software as a Service grew out of couple of earlier trends, the first being the rise of the Web as a way to deliver software, then the rise of mobile to move it beyond the desktop. The cloud-mobile connection is well documented and allowed companies like Uber and Airbnb, as just a couple of examples, to flourish by providing scalable infrastructure and a computer in our pockets to access their services whenever we needed them. These companies could never have existed without the combination of cloud-based infrastructure and mobile devices.

End of SaaS dominance?

But today, Work-Bench is saying that we are seeing some other trends that could be tipping the scales back to infrastructure. That includes containers and microservices, serverless, Database as a Service and React for building front ends. Work-Bench argues that if every company is truly a software company, these tools could make it easier for companies to build these kind of services cheaply and easily, and possibly bypass the SaaS vendors.

What’s more, they suggest that if these companies are doing mass customization to these services, then it might make more sense to build instead of buy, at least on one level. In the past, we have seen what happens when companies try to take these kinds of massive software projects on themselves and it hardly ever ended well. They were usually bulky, difficult to update and put the companies behind the curve competitively. Whether simplifying the entire developer tool kit would change that remains to be seen.

They don’t necessarily see companies running wholesale away from SaaS just yet to do this, but they do wonder if developers could push this trend inside of organizations as more tools appear on the landscape to make it easier to build your own.

The remainder of the report goes in depth into each of these trends, and this article just has scratched the surface of the information you’ll find there. The entire report is embedded below.

Aug
16
2018
--

Cisco’s $2.35 billion Duo acquisition front and center at earnings call

When Cisco bought Ann Arbor, Michigan security company, Duo for a whopping $2.35 billion earlier this month, it showed the growing value of security and security startups in the view of traditional tech companies like Cisco.

In yesterday’s earnings report, even before the ink had dried on the Duo acquisition contract, Cisco was reporting that its security business grew 12 percent year over year to $627 million. Given those numbers, the acquisition was top of mind in CEO Chuck Robbins’ comments to analysts.

“We recently announced our intent to acquire Duo Security to extend our intent-based networking portfolio into multi- cloud environments. Duo’s SaaS delivered solution will expand our cloud security capabilities to help enable any user on any device to securely connect to any application on any network,” he told analysts.

Indeed, security is going to continue to take center stage moving forward. “Security continues to be our customers number one concern and it is a top priority for us. Our strategy is to simplify and increase security efficacy through an architectural approach with products that work together and share analytics and actionable threat intelligence,” Robbins said.

That fits neatly with the Duo acquisition, whose guiding philosophy has been to simplify security. It is perhaps best known for its two-factor authentication tool. Often companies send a text with a code number to your phone after you change a password to prove it’s you, but even that method has proven vulnerable to attack.

What Duo does is send a message through its app to your phone asking if you are trying to sign on. You can approve if it’s you or deny if it’s not, and if you can’t get the message for some reason you can call instead to get approval. It can also verify the health of the app before granting access to a user. It’s a fairly painless and secure way to implement two-factor authentication, while making sure employees keep their software up-to-date.

Duo Approve/Deny tool in action on smartphone.

While Cisco’s security revenue accounted for a fraction of the company’s overall $12.8 billion for the quarter, the company clearly sees security as an area that could continue to grow.

Cisco hasn’t been shy about using its substantial cash holdings to expand in areas like security beyond pure networking hardware to provide a more diverse recurring revenue stream. The company currently has over $54 billion in cash on hand, according to Y Charts.

Cisco spent a fair amount money on Duo, which according to reports has $100 million in annual recurring revenue, a number that is expected to continue to grow substantially. It had raised over $121 million in venture investment since inception. In its last funding round in September 2017, the company raised $70 million on a valuation of $1.19 billion.

The acquisition price ended up more than doubling that valuation. That could be because it’s a security company with recurring revenue, and Cisco clearly wanted it badly as another piece in its security solutions portfolio, one it hopes can help keep pushing that security revenue needle ever higher.

Aug
16
2018
--

MongoDB: how to use the JSON Schema Validator

JSON Schema Validator for MongoDB

JSON Schema Validator for MongoDBThe flexibility of MongoDB as a schemaless database is one of its strengths. In early versions, it was left to application developers to ensure that any necessary data validation is implemented. With the introduction of JSON Schema Validator there are new techniques to enforce data integrity for MongoDB. In this article, we use examples to show you how to use the JSON Schema Validator to introduce validation checks at the database level—and consider the pros and cons of doing so.

Why validate?

MongoDB is a schemaless database. This means that we don’t have to define a fixed schema for a collection. We just need to insert a JSON document into a collection and that’s all. Documents in the same collection can have a completely different set of fields, and even the same fields can have different types on different documents. The same object can be a string in some documents and can be a number in other documents.

The schemaless feature has given MongoDB great flexibility and the capability to adapt the database to the changing needs of applications. Let’s say that this flexibility is one of the main reasons to use MongoDB. Relational databases are not so flexible: you always need to define a schema at first. Then, when you need to add new columns, create new tables or change existing architecture to respond to the needs of the applications it’s sometimes a very hard task.

The real world can often be messy and MongoDB can really help, but in most cases the real world requires some kind of backbone architecture too. In real applications built on MongoDB there is always some kind of “fixed schema” or “validation rules” in collections and in documents. It’s possible to have in a collection two documents that represent two completely different things.

Well, it’s technically possible, but it doesn’t make sense in most cases for the application. Most of the arguments for enforcing a schema on the data are well known: schemas maintain structure, giving a clear idea of what’s going into the database, reducing preventable bugs and allowing for cleaner code. Schemas are a form of self-documenting code, as they describe exactly what type of data something should be, and they let you know what checks will be performed. It’s good to be flexible, but behind the scenes we need some strong regulations.

So, what we need to do is to find a balance between flexibility and schema validation. In real world applications, we need to define a sort of “backbone schema” for our data and retain the possibility to be flexible to manage specific particularities. In the past developers implemented schema validation in their applications, but starting from version 3.6, MongoDB supports the JSON Schema Validator. We can rely on it to define a fixed schema and validation rules directly into the database and free the applications to take care of it.

Let’s have a look at how it works.

JSON Schema Validator

In fact, a “Validation Schema” was already introduced in 3.2 but the new “JSON Schema Validator” introduced in the 3.6 release is by far the best and a friendly way to manage validations in MongoDB.

What we need to do is to define the rules using the operator $jsonSchema in the db.createCollection command. The $jsonSchema operator requires a JSON document where we specify all the rules to be applied on each inserted or updated document: for example what are the required fields, what type the fields must be, what are the ranges of the values, what pattern a specific field must have, and so on.

Let’s have a look at the following example where we create a collection people defining validation rules with JSON Schema Validator.

db.createCollection( "people" , {
   validator: { $jsonSchema: {
      bsonType: "object",
      required: [ "name", "surname", "email" ],
      properties: {
         name: {
            bsonType: "string",
            description: "required and must be a string" },
         surname: {
            bsonType: "string",
            description: "required and must be a string" },
         email: {
            bsonType: "string",
            pattern: "^.+\@.+$",
            description: "required and must be a valid email address" },
         year_of_birth: {
            bsonType: "int",
            minimum: 1900,
            maximum: 2018,
            description: "the value must be in the range 1900-2018" },
         gender: {
            enum: [ "M", "F" ],
            description: "can be only M or F" }
      }
   }
}})

Based on what we have defined, only 3 fields are strictly required in every document of the collection: name, surname, and email. In particular, the email field must have a specific pattern to be sure the content is a valid address. (Note: to validate an email address you need a more complex regular expression, here we just use a simpler version just to check there is the @ symbol). Other fields are not required but in case someone inserts them, we have defined a validation rule.

Let’s try to do some example inserting documents to test if everything is working as expected.

Insert a document with one of the required fields missing:

MongoDB > db.people.insert( { name : "John", surname : "Smith" } )
    WriteResult({
      "nInserted" : 0,
      "writeError" : {
      "code" : 121,
      "errmsg" : "Document failed validation"
   }
})

Insert a document with all the required fields but with an invalid email address

MongoDB > db.people.insert( { name : "John", surname : "Smith", email : "john.smith.gmail.com" } )
   WriteResult({
      "nInserted" : 0,
      "writeError" : {
      "code" : 121,
      "errmsg" : "Document failed validation"
   }
})

Finally, insert a valid document

MongoDB > db.people.insert( { name : "John", surname : "Smith", email : "john.smith@gmail.com" } )
WriteResult({ "nInserted" : 1 })

Let’s try now to do more inserts including of other fields.

MongoDB > db.people.insert( { name : "Bruce", surname : "Dickinson", email : "bruce@gmail.com", year_of_birth : NumberInt(1958), gender : "M" } )
WriteResult({ "nInserted" : 1 })
MongoDB > db.people.insert( { name : "Corrado", surname : "Pandiani", email : "corrado.pandiani@percona.com", year_of_birth : NumberInt(1971), gender : "M" } )
WriteResult({ "nInserted" : 1 })
MongoDB > db.people.insert( { name : "Marie", surname : "Adamson", email : "marie@gmail.com", year_of_birth : NumberInt(1992), gender : "F" } )
WriteResult({ "nInserted" : 1 })

The records were inserted correctly because all the rules on the required fields, and on the other not required fields, were satisfied. Let’s see now a case where the year_of_birth or gender fields are not correct.

MongoDB > db.people.insert( { name : "Tom", surname : "Tom", email : "tom@gmail.com", year_of_birth : NumberInt(1980), gender : "X" } )
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
MongoDB > db.people.insert( { name : "Luise", surname : "Luise", email : "tom@gmail.com", year_of_birth : NumberInt(1899), gender : "F" } )
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})

In the first query gender is X, but the valid values are only M or F. In the second query year of birth is outside the permitted range.

Let’s try now to insert documents with arbitrary extra fields that are not in the JSON Schema Validator.

MongoDB > db.people.insert( { name : "Tom", surname : "Tom", email : "tom@gmail.com", year_of_birth : NumberInt(2000), gender : "M", shirt_size : "XL", preferred_band : "Coldplay" } )
WriteResult({ "nInserted" : 1 })
MongoDB > db.people.insert( { name : "Luise", surname : "Luise", email : "tom@gmail.com", gender : "F", shirt_size : "M", preferred_band : "Maroon Five" } )
WriteResult({ "nInserted" : 1 })
MongoDB > db.people.find().pretty()
{
"_id" : ObjectId("5b6b12e0f213dc83a7f5b5e8"),
"name" : "John",
"surname" : "Smith",
"email" : "john.smith@gmail.com"
}
{
"_id" : ObjectId("5b6b130ff213dc83a7f5b5e9"),
"name" : "Bruce",
"surname" : "Dickinson",
"email" : "bruce@gmail.com",
"year_of_birth" : 1958,
"gender" : "M"
}
{
"_id" : ObjectId("5b6b1328f213dc83a7f5b5ea"),
"name" : "Corrado",
"surname" : "Pandiani",
"email" : "corrado.pandiani@percona.com",
"year_of_birth" : 1971,
"gender" : "M"
}
{
"_id" : ObjectId("5b6b1356f213dc83a7f5b5ed"),
"name" : "Marie",
"surname" : "Adamson",
"email" : "marie@gmail.com",
"year_of_birth" : 1992,
"gender" : "F"
}
{
"_id" : ObjectId("5b6b1455f213dc83a7f5b5f0"),
"name" : "Tom",
"surname" : "Tom",
"email" : "tom@gmail.com",
"year_of_birth" : 2000,
"gender" : "M",
"shirt_size" : "XL",
"preferred_band" : "Coldplay"
}
{
"_id" : ObjectId("5b6b1476f213dc83a7f5b5f1"),
"name" : "Luise",
"surname" : "Luise",
"email" : "tom@gmail.com",
"gender" : "F",
"shirt_size" : "M",
"preferred_band" : "Maroon Five"
}

As we can see, we have the flexibility to add new fields with no restrictions on the permitted values.

Having a really fixed schema

The behavior we have seen so far to permit the addition of extra fields that are not in the validation rules is the default. If we would like to be more restrictive and have a really fixed schema for the collection we need to add the additionalProperties: false parameter in the createCollection command.

In the following example, we create a validator to permit only the required fields. No other extra fields are permitted.

db.createCollection( "people2" , {
   validator: {
     $jsonSchema: {
        bsonType: "object",
        additionalProperties: false,
        properties: {
           _id : {
              bsonType: "objectId" },
           name: {
              bsonType: "string",
              description: "required and must be a string" },
           age: {
              bsonType: "int",
              minimum: 0,
              maximum: 100,
              description: "required and must be in the range 0-100" }
        }
     }
}})

Note a couple of differences:

  • we don’t need to specify the list of required fields; using additionalProperties: false forces all the fields to be required by default
  • we need to put explicitly even the _id field

As you can notice in the following test, we are no longer allowed to add extra fields. We are forced to insert documents always with the same two fields name and age.

MongoDB > db.people2.insert( {name : "George", age: NumberInt(30)} )
WriteResult({ "nInserted" : 1 })
MongoDB > db.people2.insert( {name : "Maria", age: NumberInt(35), surname: "Peterson"} )
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})

In this case we don’t have flexibility, and that is the main benefit of having a NoSQL database like MongoDB.

Well, it’s up to you to use it or not. It depends on the nature and goals of your application. I wouldn’t recommend it in most cases.

Add validation to existing collections

We have seen so far how to create a new collection with validation rules, But what about the existing collections? How can we add rules?

This is quite trivial. The syntax to use in $jsonSchema remains the same, we just need to use the collMod command instead of createCollection. The following example shows how to create validation rules on an existing collection.

First we create a simple new collection people3, inserting some documents.

MongoDB > db.people3.insert( {name: "Corrado", surname: "Pandiani", year_of_birth: NumberLong(1971)} )
WriteResult({ "nInserted" : 1 })
MongoDB > db.people3.insert( {name: "Tom", surname: "Cruise", year_of_birth: NumberLong(1961), gender: "M"} )
WriteResult({ "nInserted" : 1 })
MongoDB > db.people3.insert( {name: "Kevin", surname: "Bacon", year_of_birth: NumberLong(1964), gender: "M", shirt_size: "L"} )
WriteResult({ "nInserted" : 1 })

Let’s create the validator.

MongoDB > db.runCommand( { collMod: "people3",
   validator: {
      $jsonSchema : {
         bsonType: "object",
         required: [ "name", "surname", "gender" ],
         properties: {
            name: {
               bsonType: "string",
               description: "required and must be a string" },
            surname: {
               bsonType: "string",
               description: "required and must be a string" },
            gender: {
               enum: [ "M", "F" ],
               description: "required and must be M or F" }
         }
       }
},
validationLevel: "moderate",
validationAction: "warn"
})

The two new options validationLevel and validationAction are important in this case.

validationLevel can have the following values:

  • “off” : validation is not applied
  • “strict”: it’s the default value. Validation applies to all inserts and updates
  • “moderated”: validation applies to all the valid existing documents. Not valid documents are ignored.

When creating validation rules on existing collections, the “moderated” value is the safest option.

validationAction can have the following values:

  • “error”: it’s the default value. The document must pass the validation in order to be written
  • “warn”: a document that doesn’t pass the validation is written but a warning message is logged

When adding validation rules to an existing collection the safest option is “warn”

These two options can be applied even with createCollection. We didn’t use them because the default values are good in most of the cases.

How to investigate a collection definition

In case we want to see how a collection was defined, and, in particular, what the validator rules are, the command db.getCollectionInfos() can be used. The following example shows how to investigate the “schema” we have created for the people collection.

MongoDB > db.getCollectionInfos( {name: "people"} )
[
  {
    "name" : "people",
    "type" : "collection",
    "options" : {
      "validator" : {
        "$jsonSchema" : {
          "bsonType" : "object",
          "required" : [
            "name",
            "surname",
            "email"
          ],
          "properties" : {
            "name" : {
              "bsonType" : "string",
              "description" : "required and must be a string"
            },
            "surname" : {
              "bsonType" : "string",
              "description" : "required and must be a string"
            },
            "email" : {
              "bsonType" : "string",
              "pattern" : "^.+@.+$",
              "description" : "required and must be a valid email address"
             },
             "year_of_birth" : {
               "bsonType" : "int",
               "minimum" : 1900,
               "maximum" : 2018,
               "description" : "the value must be in the range 1900-2018"
             },
             "gender" : {
               "enum" : [
                 "M",
                 "F"
               ],
             "description" : "can be only M or F"
        }
      }
    }
  }
},
"info" : {
  "readOnly" : false,
  "uuid" : UUID("5b98c6f0-2c9e-4c10-a3f8-6c1e7eafd2b4")
},
"idIndex" : {
  "v" : 2,
  "key" : {
    "_id" : 1
  },
"name" : "_id_",
"ns" : "test.people"
}
}
]

Limitations and restrictions

Validators cannot be defined for collections in the following databases: admin, local, config.

Validators cannot be defined for system.* collections.

A limitation in the current implementation of JSON Schema Validator is that the error messages are not very good in terms of helping you to understand which of the rules are not satisfied by the document. This should be confirmed manually by doing some tests, and that’s not so easy when dealing with complex documents. Having more specific error strings, hopefully taken from the validator definition, could be very useful when debugging application errors and warnings. This is definitely something that should be improved in the next releases.

While waiting for improvements, someone has developed a wrapper for the mongo client to gather more defined error strings. You can have a look at https://www.npmjs.com/package/mongo-schemer. You can test it and use it, but pay attention to the clause “Running in prod is not recommended due to the overhead of validating documents against the schema“.

Conclusions

Doing schema validation in the application remains, in general, a best practice, but JSON Schema Validator is a good tool to enforce validation directly into the database.

Hence even though it needs some improvements, the JSON Schema feature is good enough for most of the common cases. We suggest to test it and use it when you really need to create a backbone structure for your data.

While you are here…

You might also enjoy these other articles about MongoDB 3.6

 

The post MongoDB: how to use the JSON Schema Validator appeared first on Percona Database Performance Blog.

Aug
15
2018
--

Webinar Thurs 16/8: Developing an App on MongoDB: Tips and Tricks

Please join Percona’s Sr. Technical Operations Architect Tim Vaillancourt as he presents Developing an App on MongoDB: Tips and Tricks on Thursday, August 16th, 2018, at 10:00 AM PDT (UTC-7) / 1:00 PM EDT (UTC-4).

A lot of developers prefer using MongoDB to other open source databases when developing applications. But why? How do you work with MongoDB to create a well-functioning application?

This webinar will help developers understand what MongoDB does and how it processes requests from applications.

In this webinar, we will cover:

  • Data, Queries and Indexes
  • Using indices efficiently
  • Reducing index and storage size with correct data types
  • The aggregation framework
  • Using the Explain and Operation Profiling features
  • MongoDB features to avoid
  • Using Read and Write Concerns for Integrity
  • Performance
  • Scaling read queries using Read Preference
  • What is MongoDB Sharding?
  • Using Percona Monitoring and Management (PMM) to visualize database usage
  • MongoDB users and built-in roles for application security
  • Using SRV DNS record support

By the end of the lesson, you will know how to avoid common problems with MongoDB in the application stage, instead of fixing it in production.

Register Now

The post Webinar Thurs 16/8: Developing an App on MongoDB: Tips and Tricks appeared first on Percona Database Performance Blog.

Aug
15
2018
--

RunSafe could eliminate an entire class of infrastructure malware attacks

RunSafe, a Mclean Virginia startup, got started doing research for DARPA on how to defend critical infrastructure. They built a commercial product based on that initial research that they claim eliminates an entire class of attacks. Today, the company released a product called Alkemist that enables customers to install the solution without help from RunSafe.

RunSafe co-founder and CEO Joe Saunders says that the product began with the DoD research and a simple premise: “If you assume hardware in the supply chain is compromised, can you still build trusted software on top of untrusted hardware. And so we came up with techniques that we have since greatly expanded to protect the software from compromise. We eliminate an entire class of attacks and greatly reduce the attack surface for software across critical infrastructure,” he told TechCrunch.

Saunders uses a data center cooling system as an example. If someone were able to control the cooling systems, they could cause the whole data center to overheat in order to shut it down. RunSafe is designed to prevent that from happening whether it’s a data center, a power plant or water works.

The way they do this is by hardening the software binary so malware and exploitations can’t find the tools they need to execute across the infrastructure. In the data center example, that means the attacker could find their way in, and attack a single machine, but couldn’t replicate the attack across multiple machines.

“They’re looking for functions and memory and different things that they can use in their exploitation. What we do is we make it very difficult for the attack tool to find that information, and without the ability to find the memory or the functions, they can’t execute their attack,” he said.

He says that they do this by making every instance “functionally identical but logically unique” by relocating where functions and memory exist at a low level in the software. “When an exploit is looking for memory or function to exploit the software product, it can’t locate them,” Saunders said. And that makes it practically impossible to move across the system, he explained.

He points out this is a far different approach from how most security vendors approach the problem. “Other solutions that are leveraging intrusion detection or monitoring or analytics are detecting when there’s a compromise, but they’re not solving the problem — you still can be breached and the exploit can still execute. We’re eliminating the exploit,” he said.

The company works with hardware manufacturers to install their solution at the factory before they get deployed, and with customers like data center operators to protect their critical infrastructure. Prior to the release of Alkemist, the installation required some hand-holding from RunSafe. With today’s release, the customer can install the product themselves and that could increase their customer base.

RunSafe launched at the end of 2015 and released the first version of the product last year. They currently count a dozen customers and are protecting hundreds of thousands machines across their customer base and expect to cross one million protected machines by the end of the year, according to Saunders.

The company has raised $2.4 million in seed investment.

Written by in: Enterprise,Zend Developer | Tags:
Aug
15
2018
--

Oracle open sources Graphpipe to standardize machine learning model deployment

Oracle, a company not exactly known for having the best relationship with the open source community, is releasing a new open source tool today called Graphpipe, which is designed to simplify and standardize the deployment of machine learning models.

The tool consists of a set of libraries and tools for following the standard.

Vish Abrams, whose background includes helping develop OpenStack at NASA and later helping launch Nebula, an OpenStack startup in 2011, is leading the project. He says as his team dug into the machine learning workflow, they found a gap. While teams spend lots of energy developing a machine learning model, it’s hard to actually deploy the model for customers to use. That’s where Graphpipe comes in.

He points out that it’s common with newer technologies like machine learning for people to get caught up in the hype. Even though the development process keeps improving, he says that people often don’t think about deployment.

“Graphpipe is what’s grown out of our attempt to really improve deployment stories for machine learning models, and to create an open standard around having a way of doing that to improve the space,” Abrams told TechCrunch.

As Oracle dug into this, they identified three main problems. For starters, there is no standard way to serve APIs, leaving you to use whatever your framework provides. Next, there is no standard deployment mechanism, which leaves developers to build custom ones every time. Finally, they found existing methods leave performance as an afterthought, which in machine learning could be a major problem.

“We created Graphpipe to solve these three challenges. It provides a standard, high-performance protocol for transmitting tensor data over the network, along with simple implementations of clients and servers that make deploying and querying machine learning models from any framework a breeze,” Abrams wrote in a blog post announcing the release of Graphpipe.

The company decided to make this a standard and to open source it to try and move machine learning model deployment forward. “Graphpipe sits on that intersection between solving a business problems and pushing the state of the art forward, and I think personally, the best way to do that is by have an open source approach. Often, if you’re trying to standardize something without going for the open source bits, what you end up with is a bunch of competing technologies,” he said.

Abrams acknowledged the tension that has existed between Oracle and the open source community over the years, but says they have been working to change the perception recently with contributions to Kubernetes and Oracle FN, their open source Serverless Functions Platform as examples. Ultimately he says, if the technology is interesting enough, people will give it a chance, regardless of who is putting it out there. And of course, once it’s out there, if a community builds around it, they will adapt and change it as open source projects tend to do. Abrams hopes that happens.

“We care more about the standard becoming quite broadly adopted, than we do about our particular implementation of it because that makes it easier for everyone. It’s really up to the community decide that this is valuable and interesting.” he said.

Graphpipe is available starting today on the Oracle GitHub Graphpipe page.

Aug
15
2018
--

Twistlock snares $33 million Series C investment to secure cloud native environments

As the world shifts to a cloud native approach, the way you secure applications as they get deployed is changing too. Twistlock, a company built from the ground up to secure cloud native environments, announced a $33 million Series C round today led by Iconiq Capital.

Previous investors YL Ventures, TenEleven, Rally Ventures, Polaris Partners and Dell Technologies Capital also participated in the round. The company reports it has received a total of $63 million in venture investment to date.

Twistlock is solving a hard problem around securing containers and serverless, which are by their nature ephemeral. They can live for fractions of seconds making it hard track problems when they happen. According to company CEO and co-founder Ben Bernstein, his company came out of the gate building a security product designed to protect a cloud-native environment with the understanding that while containers and serverless computing may be ephemeral, they are still exploitable.

“It’s not about how long they live, but about the fact that the way they live is more predictable than a traditional computer, which could be running for a very long time and might have humans actually using it,” Bernstein said.

Screenshot: Twistlock

As companies move to a cloud native environment using Dockerized containers and managing them with Kubernetes and other tools, they create a highly automated system to deal with the deployment volume. While automation simplifies deployment, it can also leave companies vulnerable to host of issues. For example, if a malicious actor were to get control of the process via a code injection attack, they could cause a lot of problems without anyone knowing about it.

Twistlock is built to help prevent that, while also helping customers recognize when an exploit happens and performing forensic analysis to figure out how it happened.

It’s not a traditional Software as a Service as we’ve come to think of it. Instead, it is a service that gets installed on whatever public or private cloud that the customer is using. So far, they count just over 200 customers including Walgreens and Aetna and a slew of other companies you would definitely recognize, but they couldn’t name publicly.

The company, which was founded in 2015, is based in Portland, Oregon with their R&D arm in Israel. They currently have 80 employees. Bernstein said from a competitive standpoint, the traditional security vendors are having trouble reacting to cloud native, and while he sees some startups working at it, he believes his company has the most mature offering, at least for now.

“We don’t have a lot of competition right now, but as we start progressing we will see more,” he said. He plans to use the money they receive today to help expand their marketing and sales arm to continue growing their customer base, but also engineering to stay ahead of that competition as the cloud-native security market continues to develop.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com