Apr
18
2018
--

Restore a MongoDB Logical Backup

MongoDB Logical Backup

MongoDB Logical BackupIn this article, we will explain how to restore a MongoDB logical backup performed via ‘mongodump’ to a mongod instance.

MongoDB logical backup requires the use of the ‘mongorestore‘ tool to perform the restore backup. This article focuses on this tool and process.

Note: Percona develops a backup tool named Percona-Lab/mongodb-consistent-backup, which is a wrapper for ‘mongodump‘, adding cluster-wide backup consistency. The backups created by mongodb_consistent_backup (in Dump/Mongodump mode) can be restored using the exact same steps as a regular ‘mongodump’ backup – no special steps!

Mongorestore Command Flags

–host/–port (and –user/–password)

Required, even if you’re using the default host/port (localhost:27017). If authorization is enabled, add –user/–password flags also.

–drop

This is almost always required. This causes ‘mongodump‘ to drop the collection that is being restored before restoring it. Without this flag, the documents from the backup are inserted one at a time and if they already exist the restore fails.

–oplogReplay

This is almost always required. Replays the oplog that was dumped by mongodump. It is best to include this flag on replset-based backups unless there is a specific reason not to. You can tell if the backup was from a replset by looking for the file ‘oplog.bson‘ at the base of the dump directory.

–dir

Required. The path to the mongodump data.

–gzip

Optional. For mongodump >= 3.2, enables inline compression on the restore. This is required if ‘mongodump‘ used the –gzip flag (look for *.bson.gz files if you’re not sure if the collection files have no .gz suffix, don’t use –gzip).

–numParallelCollections=<number>

Optional. For mongodump >= 3.2 only, sets the number of collections to insert in parallel. By default four threads are used, and if you have a large server and you want to restore faster (more resource usage though), you could increase this number. Note that each thread uncompresses bson if the ‘–gzip‘ flag is used, so consider this when raising this number.

Steps

  1. (Optional) If the backup is archived (mongodb_consistent_backup defaults to creating tar archives), un-archive the backup so that ‘mongorestore‘ can access the .bson/.bson.gz files:
    $ tar -C /opt/mongodb/backup/testbackup/20160809_1306 -xvf /opt/mongodb/backup/testbackup/20160809_1306/test1.tar
    test1/
    test1/dump/
    test1/dump/wikipedia/
    test1/dump/wikipedia/pages.metadata.json.gz
    test1/dump/wikipedia/pages.bson.gz
    test1/dump/oplog.bson

    ** This command un-tars the backup to ‘/opt/mongodb/backup/testbackup/20160809_1306/test1/dump’ **

  2. Check (and then check again!) that you’re restoring the right backup to the right host. When in doubt, it is safer to ask the customer or others.
    1. The Percona ‘mongodb_consistent_backup‘ tool names backup subdirectories by replica set name, so you can ensure you’re restoring the right backup by checking the replica set name of the node you’re restoring to, if it exists.
    2. If you’re restoring to a replica set you will need to restore to the PRIMARY member and there needs to be a majority (so writes are accepted – some exceptions if you override write-concern, but not advised).
  3. Use ‘mongorestore‘ to restore the data by dropping/restoring each collection (–drop flag) and replay the oplog changes (–oplogReplay flag), specifying the restore dir explicitly (–dir flag) to the ‘mongorestore‘ command. In this example I also used authorization (–user/–password flags) and un-compression (–gzip flag):
    $ mongorestore --drop --host localhost --port 27017 --user secret --password secret --oplogReplay --gzip --dir /opt/mongodb/backup/testbackup/20160809_1306/test1/dump
    2016-08-09T14:23:04.057+0200    building a list of dbs and collections to restore from /opt/mongodb/backup/testbackup/20160809_1306/test1/dump dir
    2016-08-09T14:23:04.065+0200    reading metadata for wikipedia.pages from /opt/mongodb/backup/testbackup/20160809_1306/test1/dump/wikipedia/pages.metadata.json.gz
    2016-08-09T14:23:04.067+0200    restoring wikipedia.pages from /opt/mongodb/backup/testbackup/20160809_1306/test1/dump/wikipedia/pages.bson.gz
    2016-08-09T14:23:07.058+0200    [#######.................]  wikipedia.pages  63.9 MB/199.0 MB  (32.1%)
    2016-08-09T14:23:10.058+0200    [###############.........]  wikipedia.pages  127.7 MB/199.0 MB  (64.1%)
    2016-08-09T14:23:13.060+0200    [###################.....]  wikipedia.pages  160.4 MB/199.0 MB  (80.6%)
    2016-08-09T14:23:16.059+0200    [#######################.]  wikipedia.pages  191.5 MB/199.0 MB  (96.2%)
    2016-08-09T14:23:19.071+0200    [########################]  wikipedia.pages  223.5 MB/199.0 MB  (112.3%)
    2016-08-09T14:23:22.062+0200    [########################]  wikipedia.pages  255.6 MB/199.0 MB  (128.4%)
    2016-08-09T14:23:25.067+0200    [########################]  wikipedia.pages  271.4 MB/199.0 MB  (136.4%)
    ...
    ...
    2016-08-09T14:24:19.058+0200    [########################]  wikipedia.pages  526.9 MB/199.0 MB  (264.7%)
    2016-08-09T14:24:22.058+0200    [########################]  wikipedia.pages  558.9 MB/199.0 MB  (280.8%)
    2016-08-09T14:24:23.521+0200    [########################]  wikipedia.pages  560.6 MB/199.0 MB  (281.6%)
    2016-08-09T14:24:23.522+0200    restoring indexes for collection wikipedia.pages from metadata
    2016-08-09T14:24:23.528+0200    finished restoring wikipedia.pages (32725 documents)
    2016-08-09T14:24:23.528+0200    replaying oplog
    2016-08-09T14:24:23.597+0200    done
    1. If you encounter problems with ‘mongorestore‘, carefully read the error message or rerun with several ‘-v‘ flags, e.g.: ‘-vvv‘. Once you have an error, attempt to troubleshoot the cause.
  4. Check to see that you saw “replaying oplog” and “done” after the restore (last two lines in the example). If you don’t see this, there is a problem.

As you notice, using this tool for MongoDB logical backup is very simple. However, when using sharding please note that –oplog is not available and the mongodump uses the primaries for each shard. As this is not advised typically in production, you might consider looking at Percona-Lab/mongodb-consistent-backup to ensure you are consistent and hitting secondary nodes, like mongodump with replica sets, will work.

If MongoDB and topics like this interest you, please see the document below, we are hiring!

{
  hiring: true,
  role: "Consultant",
  tech: "MongoDB",
  location: "USA",
  moreInfo: "https://www.percona.com/about-percona/careers/mongodb-consultant-usa-based"
}

The post Restore a MongoDB Logical Backup appeared first on Percona Database Performance Blog.

Apr
18
2018
--

Stripe debuts Radar anti-fraud AI tools for big businesses, says it has halted $4B in fraud to date

Cybersecurity continues to be a growing focus and problem in the digital world, and now Stripe is launching a new paid product that it hopes will help its customers better battle one of the bigger side-effects of data breaches: online payment fraud. Today, Stripe is announcing Radar for Fraud Teams, an expansion of its free AI-based Radar service that runs alongside Stripe’s core payments API to help identify and block fraudulent transactions.

And there are further efforts that Stripe is planning in coming months. Michael Manapat, Stripe’s engineering manager for Radar and machine learning, said the company is going to soon launch a private beta of a “dynamic authentication” that will bring in two-factor authentication. This is on top of Stripe’s first forays into using biometric factors in payments, made via partners like Apple and Google. With these and others, fingerprints and other physical attributes have become increasingly popular ways to identify mobile and other users.

The initial iteration of Radar launched in October 2016, and since then, Manapat tells me that it has prevented $4 billion in fraud for its “hundreds of thousands” of customers.

Considering the wider scope of how much e-commerce is affected by fraud — one study estimates $57.8 billion in e-commerce fraud across eight major verticals in a one-year period between 2016 and 2017 — this is a decent dent, but there is a lot more work to be done. And Stripe’s position of knowing four out of every five payment card numbers globally (on account of the ubiquity of its payments API) gives it a strong position to be able to tackle it.

The new paid product comes alongside an update to the core, free product that Stripe is dubbing Radar 2.0, which Stripe claims will have more advanced machine learning built into it and can therefore up its fraud detection by some 25 percent over the previous version.

New features for the whole product (free and paid) will include being able to detect when a proxy VPN is being used (which fraudsters might use to appear like they are in one country when they are actually in another) and ingesting billions of data points to train its model, which is now being updated on a daily basis automatically — itself an improvement on the slower and more manual system that Manapat said Stripe has been using for the past couple of years.

Meanwhile, the paid product is an interesting development.

At the time of the original launch, Stripe co-founder John Collison hinted that the company would be considering a paid product down the line. Stripe has said multiple times that it’s in no rush to go public — and statement that a spokesperson reiterated this week — but it’s notable that a paid tier is a sign of how Stripe is slowly building up more monetization and revenue generation.

Stripe is valued at around $9.2 billion as of its last big round in 2016. Most recently, it raised $150 million back in that November 2016 round. A $44 million from March of this year, noted in Pitchbook, was actually related to issuing stock related to its quiet acquisition of point-of-sale payments startup Index in that month — incidentally another interesting move for Stripe to expand its position and placement in the payments ecosystem. Stripe has raised around $450 million in total.

The Teams product, aimed at businesses that are big enough to have dedicated fraud detection staff, will be priced at an additional $0.02 per transaction, on top of Stripe’s basic transaction fees of a 2.9 percent commission plus 30 cents per successful card charge in the U.S. (fees vary in other markets).

The chief advantage of taking the paid product will be that teams will be able to customise how Radar works with their own transactions.

This will include a more complete set of data for teams that review transactions, and a more granular set of tools to determine where and when sales are reviewed, for example based on usage patterns or the size of the transaction. There are already a set of flags the work to note when a card is used in frequent succession across disparate geographies; but Manapat said that newer details such as analysing the speed at which payment details are entered and purchases are made will now also factor into how it flags transactions for review.

Similarly, teams will be able to determine the value at which a transaction needs to be flagged. This is the online equivalent of when certain purchases require or waive you to enter a PIN or provide a signature to seal the deal. (And it’s interesting to see that some e-commerce operations are potentially allowing some dodgy sales to happen simply to keep up the user experience for the majority of legitimate transactions.)

Users of the paid product will also be able to now use Radar to help with their overall management of how it handles fraud. This will include being able to keep lists of attributes, names and numbers that are scrutinised, and to check against them with analytics also created by Stripe to help identify trending issues, and to plan anti-fraud activities going forward.

Updated with further detail about Stripe’s funding.

Apr
18
2018
--

Webinar Thursday, April 19, 2018: Running MongoDB in Production, Part 1

Running MongoDB

Running MongoDBPlease join Percona’s Senior Technical Operations Architect, Tim Vaillancourt as he presents Running MongoDB in Production, Part 1 on Thursday, April 19, 2018, at 10:00 am PDT (UTC-7) / 1:00 pm EDT (UTC-4).

Are you a seasoned MySQL DBA that needs to add MongoDB to your skills? Are you used to managing a small environment that runs well, but want to know what you might not know yet? This webinar helps you with running MongoDB in production environments.

MongoDB works well, but when it has issues, the number one question is “where should I go to solve a problem?”

This tutorial will cover:

Backups
– Logical vs Binary-level backups
– Sharding and Replica-Set Backup strategies
Security
– Filesystem and Network Security
– Operational Security
– External Authentication features of Percona Server for MongoDB
– Securing connections with SSL and MongoDB Authorization
– Encryption at Rest
– New Security features in 3.6
Monitoring
– Monitoring Strategy
– Important metrics to monitor in MongoDB and Linux
– Percona Monitoring and Management

Register for the webinar now.

Part 2 of this series will take place on Thursday, April 26, 2018, at 10:00 am PDT (UTC-7) / 1:00 pm EDT (UTC-4). Register for the second part of this series here.

Running MongoDBTimothy Vaillancourt, Senior Technical Operations Architect

Tim joined Percona in 2016 as Sr. Technical Operations Architect for MongoDB, with the goal to make the operations of MongoDB as smooth as possible. With experience operating infrastructures in industries such as government, online marketing/publishing, SaaS and gaming combined with experience tuning systems from the hard disk all the way up to the end-user, Tim has spent time in nearly every area of the modern IT stack with many lessons learned. Tim is based in Amsterdam, NL and enjoys traveling, coding and music.

Prior to Percona Tim was the Lead MySQL DBA of Electronic Arts’ DICE studios, helping some of the largest games in the world (“Battlefield” series, “Mirrors Edge” series, “Star Wars: Battlefront”) launch and operate smoothly while also leading the automation of MongoDB deployments for EA systems. Before the role of DBA at EA’s DICE studio, Tim served as a subject matter expert in NoSQL databases, queues and search on the Online Operations team at EA SPORTS. Before moving to the gaming industry, Tim served as a Database/Systems Admin operating a large MySQL-based SaaS infrastructure at AbeBooks/Amazon Inc.

The post Webinar Thursday, April 19, 2018: Running MongoDB in Production, Part 1 appeared first on Percona Database Performance Blog.

Apr
18
2018
--

Cloud Foundry Foundation looks east as Alibaba joins as a gold member

Cloud Foundry is among the most successful open source project in the enterprise right now. It’s a cloud-agnostic platform-as-a-service offering that helps businesses develop and run their software more efficiently. In many enterprises, it’s now the standard platform for writing new applications. Indeed, half of the Fortune 500 companies now use it in one form or another.

With the imminent IPO of Pivotal, which helped birth the project and still sits at the core of its ecosystem, Cloud Foundry is about to gets its first major moment in the spotlight outside of its core audience. Over the course of the last few years, though, the project and the foundation that manages it have also received the sponsorship of  companies like Cisco, IBM, SAP, SUSE, Google, Microsoft, Ford, Volkswagen and Huawei.

Today, China’s Alibaba Group is joining the Cloud Foundry Foundation as a gold member. Compared to AWS, Azure and Google Cloud, the Alibaba Cloud gets relatively little press, but it’s among the largest clouds in the world. Starting today, Cloud Foundry is also available on the Alibaba Cloud, with support for both the Cloud Foundry application and container runtimes.

Cloud Foundry CTO Chip Childers told me that he expects Alibaba to become an active participant in the open source community. He also noted that Cloud Foundry is seeing quite a bit of growth in China — a sentiment that I’ve seen echoed by other large open source projects, including the likes of OpenStack.

Open source is being heavily adopted in China and many companies are now trying to figure out how to best contribute to these kind of projects. Joining a foundation is an obvious first step. Childers also noted that many traditional enterprises in China are now starting down the path of digital transformation, which is driving the adoption of both open source tools and cloud in general.

Apr
18
2018
--

Cloud.gov makes Cloud Foundry easier to adopt for government agencies

At the Cloud Foundry Summit in Boston, the team behind the U.S. government’s cloud.gov application platform announced that it is now a certified Cloud Foundry platform that is guaranteed to be compatible with other certified providers, like Huawei, IBM, Pivotal, SAP and — also starting today — SUSE. With this, cloud.gov becomes the first government agency to become Cloud Foundry-certified.

The point behind the certification is to ensure that all of the various platforms that support Cloud Foundry are compatible with each other. In the government context, this means that agencies can easily move their workloads between clouds (assuming they have all the necessary government certifications in place). But what’s maybe even more important is that it also ensures skills portability, which should make hiring and finding contractors easier for these agencies. Given that the open source Cloud Foundry project has seen quite a bit of adoption in the private sector, with half of the Fortune 500 companies using it, that’s often an important factor for deciding which platform to build on.

From the outset, cloud.gov, which was launched by the General Services Administration’s 18F office to improve the U.S. government’s public-facing websites and applications, was built on top of Cloud Foundry. Similar agencies in Australia and the U.K. have made the same decision to standardize on the Cloud Foundry platform. Cloud Foundry launched its certification program a few years ago; last year it added another program for certifying the skills of individual developers.

To be able to run government workloads, a cloud platform has to offer a certain set of security requirements. As Cloud Foundry Foundation CTO Chip Childers told me, the work 18F did to get the FedRAMP authorization for cloud.gov helped bring better controls to the upstream project, too, and he stressed that all of the governments that have adopted the platform have contributed to the overall project.

Apr
18
2018
--

Squarefoot raises $7M to give offices an easier way to find space

While smaller companies are seeing a lot of new options for distributed office space, or can pick up a couple offices in a WeWork, eventually they get big enough and have to find a bigger office — but that can end up as one of the weirdest and most annoying challenges for an early-stage CEO.

Finding that space is a whole other story, outside of just searching on Google and crossing your fingers. It’s why Jonathan Wasserstrum started Squarefoot, which looks to not only create a hub for these vacant offices, but also have the systems in place — including brokers — to help companies eventually land that office space. Eventually companies as they grow have to graduate into increasingly larger and larger spots, but there’s a missing sweet spot for mid-stage companies that are looking for space but don’t necessarily have the relationships with those big office brokers just yet, and instead are just looking through a friend of a friend. The company said today that it has raised $7 million in a new financing round led by Rosecliff Ventures, with RRE Ventures, Triangle Peak Partners, Armory Square Ventures, and others participating.

“If you talk to any CEO and you ask what they think about commercial real estate brokers, they’ll say, ‘oh, the guys that send an email every week,’” co-founder Jonathan Wasserstrum said. “The industry has been slow to adopt because the average person who owns the building is fine. They don’t wake up every morning and say this process sucks. But the people who wake up and say the process sucks are looking for space. That was kind of one fo the early things that we kind of figured out and focused a lot of attention on aggregating that tenant demand.

Squarefoot starts off on the buyer side as an aggregation platform that localizes open office space into one spot. While companies used to have to Google search something along the lines of “Chelsea office space” in New York — especially for early-stage companies that are just starting to outgrow their early offices — the goal is to always have Squarefoot come up as a result for that. It already happens thanks to a lot of efforts on the marketing front, but eventually with enough inventory and demand the hope is that building owners will be coming to Squarefoot in the first place. (That you see an ad for Squarefoot as a result for a lot of these searches already is, for example, no accident.)

Squarefoot is also another company that is adopting a sort of hybrid model that includes both a set of tools and algorithms to aggregate together all that space into one spot, but keep consultants and brokers in the mix in order to actually close those deals. It’s a stance that the venture community seems to be increasingly softening on as more and more companies launch with the idea that the biggest deals need to have an actual human on the other end in order to manage that relationship.

“We’re not trying to remove brokers, we have them on staff, we think there’s a much better way to go through the process,” Wasserstrum said. “When I am buying a ticket to Chicago, I’m fine going to Kayak and I don’t need a travel agent. But when I’m the CEO of a company and about to sign a three-year lease that’s a $1.5 million liability, and I’ve never done this before, shouldn’t I want someone to help me out? I do not see in the near future this e-commerce experience for commercial real estate. You don’t put it in your shopping cart.”

And, to be sure, there are a lot of platforms that already focus on the consumer side, like Redfin for home search. But this is a big market, and there already is some activity — it just hasn’t picked up a ton of traction just yet because it is a slog to get everything all in one place. One of the original examples is 42Floors, but even then that company early on faced a lot of troubles trying to get the model working and in 2015 cut its brokerage team. That’s not a group of people Wasserstrum is looking to leave behind, simply because the end goal is to actually get these companies signing leases and not just serving as a search engine.

Apr
18
2018
--

Wonolo picks up $13M to create a way to connect temp workers with companies

AJ Brustein was out spending time with a member of his merchandising team when a nearby store ran out of stock of some goods — but there was no one on staff responsible for that location. Fortunately, the employee he was with had already showed him how to restock the shelves, and he offered to peel off and do it himself.

But that gap in the workforce may have just continued, leading directly to potential lost revenue for companies that sell products in those stores. That’s why Brustein and Yong Kim started Wonolo, a tool to connect companies with temporary workers in order to fill the unexpected demand those companies might face in those same out-of-stock situations. Wonolo employees sign up for the platform, and the companies that partner with the startup have an opportunity to grab the necessary workers they need on a more flexible basis. Wonolo today said it has raised $13 million in a new financing round led by Sequoia Capital, including existing investors PivotNorth and Crunchfund, and new investor Base10. Sequoia Capital’s Jess Lee is joining the company’s board of directors as part of the financing.

“There’s a big opportunity  helping people fill in their schedule with shifts,” Brustein said. “We really found there’s this huge untapped market of people who are looking for work who are underemployed. Let’s say Mary is a great worker and has a great job at the Home Depot, but no matter how good she, is she can only get 29 hours of work. It’s hard to manage schedules between different employers that want you to work the same hours. That’s the market we’ve really focused on, the underemployed market, which is a growing unfortunate trend in the U.S. That’s changed a little bit about the types of jobs we have on the platform.”

Wonolo is essentially looking to replace the typical temp agency experience, which helps workers find positions with companies that need a more limited amount of time. Meanwhile, those workers get an opportunity to fill in extra shifts that they might need for additional income on a more flexible schedule. Once a company posts a job to Wonolo, employees will get notified that it’s available and then get a chance to pick up those shifts, and when the job is approved those workers get paid right away.

While the jobs that Wonolo is suited for are more along the lines of merchandising, events staff, or more general labor, the hope is that the service will also expose those employees to a variety of companies who may actually end up wanting to hire them at some point. It allows them to get a good snapshot of all the work that’s available, and theoretically would help offer them an additional step on a career path that could get them to a direct full-time job with any of the companies from which they might end up accepting jobs.

“We thought we could address [the idea of being able to deal with unpredictability] better than temp staffing, and we realized the antidote was flexibility on the worker side,” Brustein said. “We could match them with these jobs that would unpredictably pop up. When we dug into it, we realized flexibility was something that was just completely lacking for workers. We took a very different approach to the way that people will often recruit talent for staffing agencies or their own employees. We are looking at character traits.”

Wonolo was born out of Brustein and Kim’s experience at Coca-Cola, where they had an opportunity to work with a major brand for a number of years. After a while, they got an opportunity to start working on a more entrepreneurial project, and that’s when that whole merchandising scenario played out and prompted them to start working on Wonolo. That part about character traits is an important part for Wonolo, Brustein said — because as long as someone can complete a job, they don’t have to be an absolute expert, as long as they are there ready and good to go.

There are, of course, companies trying to create platforms for temporary workers, like TrueBlue, and Brustein said Wonolo will inevitably have to compete with more local players as it looks to expand. But the hope is that aiming to tap the same kind of flexibility that made Uber so popular for temporary staffers — and potentially that pathway to a big career opportunity — will be one that attracts them to their service.

Apr
17
2018
--

Using Hints to Analyze Queries

Hints to Analyze Queries

Hints to Analyze QueriesIn this blog post, we’ll look at using hints to analyze queries.

There are a lot of things that you can do wrong when writing a query, which means that there a lot of things that you can do to make it better. From my personal experience there are two things you should review first:

  1. The table join order
  2. Which index is being used

Why only those two? Because many other alternatives that are more expensive, and at the end query optimization is a cost-effectiveness analysis. This is why we must start with the simplest fixes. We can control this with the hints “straight_join” and “force index”. These allow us to execute the query with the plan that we would like to test.

Join Order

In a query where we use multiple tables or subqueries, we have some particular fields that we are going to use to join the tables. Those fields could be the Primary Key of the table, the first part of a secondary index, neither or both. But before we analyze possible scenarios, table structure or indexes, we need to establish what is the best order for that query to join the tables.

When we talked about join order and the several tables to join, one possible scenario is that a table is using a primary key to join a table, and another field to join to other tables. For instance:

select
  table_a.id, table_b.value1, table_c.value1
from
  table_a join
  table_b on table_a.id = table_b.id join
  table_c on table_b.id_c = table_c.id
where
  table_a.value1=10;

We get this explain:

+----+-------------+---------+--------+----------------+---------+---------+------------------------------------+------+-------------+
| id | select_type | table   | type   | possible_keys  | key     | key_len | ref                                | rows | Extra       |
+----+-------------+---------+--------+----------------+---------+---------+------------------------------------+------+-------------+
|  1 | SIMPLE      | table_a | ref    | PRIMARY,value1 | value1  | 5       | const                              |    1 | Using index |
|  1 | SIMPLE      | table_b | eq_ref | PRIMARY        | PRIMARY | 4       | bp_query_optimization.table_a.id   |    1 | Using where |
|  1 | SIMPLE      | table_c | eq_ref | PRIMARY        | PRIMARY | 4       | bp_query_optimization.table_b.id_c |    1 | NULL        |
+----+-------------+---------+--------+----------------+---------+---------+------------------------------------+------+-------------+

It is filtering by value1 on table_a, which joins with table_b with the primary key, and table_c uses the value of id_c which it gets from table_b.

But we can change the table order and use straight_join:

select straight_join
  table_a.id, table_b.value1, table_c.value1
from
  table_c join
  table_b on table_b.id_c = table_c.id join
  table_a on table_a.id = table_b.id
where
  table_a.value1=10;

The query is semantically the same, but now we get this explain:

+----+-------------+---------+--------+----------------+---------+---------+----------------------------------+------+-------------+
| id | select_type | table   | type   | possible_keys  | key     | key_len | ref                              | rows | Extra       |
+----+-------------+---------+--------+----------------+---------+---------+----------------------------------+------+-------------+
|  1 | SIMPLE      | table_c | ALL    | PRIMARY        | NULL    | NULL    | NULL                             |    1 | NULL        |
|  1 | SIMPLE      | table_b | ref    | PRIMARY,id_c   | id_c    | 5       | bp_query_optimization.table_c.id |    1 | NULL        |
|  1 | SIMPLE      | table_a | eq_ref | PRIMARY,value1 | PRIMARY | 4       | bp_query_optimization.table_b.id |    1 | Using where |
+----+-------------+---------+--------+----------------+---------+---------+----------------------------------+------+-------------+

In this case, we are performing a full table scan over table_c, which then joins with table_b using index over id_c to finally join table_a using the primary key.

Sometimes the optimizer chooses the incorrect join order because of bad statistics. I found myself reviewing the first query with the second explain plan, where the only thing that I did to find the query problem was to add “STRAIGHT_JOIN” to the query.

Taking into account that the optimizer could fail on this task, we found a practical way to force it to do what we want (change the join order).

It is also useful to find out when an index is missing. For example:

SELECT costs.id as cost_id, spac_types.id as spac_type_id
FROM
spac_types INNER JOIN
costs_spac_types ON costs_spac_types.spac_type_id = spac_types.id INNER JOIN
costs ON costs.id = costs_spac_types.cost_id
WHERE spac_types.place_id = 131;

The explain plan shows:

+----+-------------+------------------+--------+----------------------------------------------------+----------------------------------------------------+---------+-----------------------------------+-------+-------------+
| id | select_type | table            | type  | possible_keys                                       | key                                                | key_len | ref                               | rows  | Extra       |
+----+-------------+------------------+--------+----------------------------------------------------+----------------------------------------------------+---------+-----------------------------------+-------+-------------+
|  1 | SIMPLE      | costs_spac_types | index  | index_costs_spac_types_on_cost_id_and_spac_type_id | index_costs_spac_types_on_cost_id_and_spac_type_id | 8       | NULL                              | 86408 | Using index |
|  1 | SIMPLE      | spac_types       | eq_ref | PRIMARY,index_spac_types_on_place_id_and_spac_type | PRIMARY                                            | 4       | pms.costs_spac_types.spac_type_id |     1 | Using where |
|  1 | SIMPLE      | costs            | eq_ref | PRIMARY                                            | PRIMARY                                            | 4       | pms.costs_spac_types.cost_id      |     1 | Using index |
+----+-------------+------------------+--------+----------------------------------------------------+----------------------------------------------------+---------+-----------------------------------+-------+-------------+

It is starting with costs_spac_types and then using the clustered index for the next two tables. The explain doesn’t look bad!

However, it was taking longer than this:

SELECT STRAIGHT_JOIN costs.id as cost_id, spac_types.id as spac_type_id
FROM
spac_types INNER JOIN
costs_spac_types ON costs_spac_types.spac_type_id = spac_types.id INNER JOIN
costs ON costs.id = costs_spac_types.cost_id
WHERE spac_types.place_id = 131;

0.17 sec versus 0.09 sec. This is the explain plan:

+----+-------------+------------------+--------+----------------------------------------------------+----------------------------------------------------+---------+------------------------------+-------+-----------------------------------------------------------------+
| id | select_type | table            | type   | possible_keys                                      | key                                                | key_len | ref                          | rows  | Extra                                                           |
+----+-------------+------------------+--------+----------------------------------------------------+----------------------------------------------------+---------+------------------------------+-------+-----------------------------------------------------------------+
|  1 | SIMPLE      | spac_types       | ref    | PRIMARY,index_spac_types_on_place_id_and_spac_type | index_spac_types_on_place_id_and_spac_type         | 4      | const                         |    13 | Using index                                                     |
|  1 | SIMPLE      | costs_spac_types | index  | index_costs_spac_types_on_cost_id_and_spac_type_id | index_costs_spac_types_on_cost_id_and_spac_type_id | 8      | NULL                          | 86408 | Using where; Using index; Using join buffer (Block Nested Loop) |
|  1 | SIMPLE      | costs            | eq_ref | PRIMARY                                            | PRIMARY                                            | 4      | pms.costs_spac_types.cost_id  |     1 | Using index                                                     |
+----+-------------+------------------+--------+----------------------------------------------------+----------------------------------------------------+---------+------------------------------+-------+-----------------------------------------------------------------+

Reviewing the table structure:

CREATE TABLE costs_spac_types (
  id int(11) NOT NULL AUTO_INCREMENT,
  cost_id int(11) NOT NULL,
  spac_type_id int(11) NOT NULL,
  PRIMARY KEY (id),
  UNIQUE KEY index_costs_spac_types_on_cost_id_and_spac_type_id (cost_id,spac_type_id)
) ENGINE=InnoDB AUTO_INCREMENT=172742 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

I saw that the unique index was over cost_id and then spac_type_id. After adding this index:

ALTER TABLE costs_spac_types ADD UNIQUE KEY (spac_type_id,cost_id);

Now, the explain plan without STRIGHT_JOIN is:

+----+-------------+------------------+--------+-----------------------------------------------------------------+--------------------------------------------+---------+------------------------------+------+-------------+
| id | select_type | table            | type   | possible_keys                                                   | key                                        | key_len | ref                          | rows | Extra       |
+----+-------------+------------------+--------+-----------------------------------------------------------------+--------------------------------------------+---------+------------------------------+------+-------------+
|  1 | SIMPLE      | spac_types       | ref    | PRIMARY,index_spac_types_on_place_id_and_spac_type              | index_spac_types_on_place_id_and_spac_type | 4      | const                         |   13 | Using index |
|  1 | SIMPLE      | costs_spac_types | ref    | index_costs_spac_types_on_cost_id_and_spac_type_id,spac_type_id | spac_type_id                               | 4      | pms.spac_types.id             |   38 | Using index |
|  1 | SIMPLE      | costs            | eq_ref | PRIMARY                                                         | PRIMARY                                    | 4      | pms.costs_spac_types.cost_id  |    1 | Using index |
+----+-------------+------------------+--------+-----------------------------------------------------------------+--------------------------------------------+---------+------------------------------+------+-------------+

Which is much better, as it is scanning fewer rows and the query time is just 0.01 seconds.

Indexes

The optimizer has the choice of using a clustered index, a secondary index, a partial secondary index or no index at all, which means that it uses the clustered index.

Sometimes the optimizer ignores the use of an index because it thinks reading the rows directly is faster than an index lookup:

mysql> explain select * from table_c where id=1;
+----+-------------+---------+-------+---------------+---------+---------+-------+------+-------+
| id | select_type | table   | type  | possible_keys | key     | key_len | ref   | rows | Extra |
+----+-------------+---------+-------+---------------+---------+---------+-------+------+-------+
|  1 | SIMPLE      | table_c | const | PRIMARY       | PRIMARY | 4       | const |    1 | NULL  |
+----+-------------+---------+-------+---------------+---------+---------+-------+------+-------+
mysql> explain select * from table_c where value1=1;
+----+-------------+---------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table   | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+---------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | table_c | ALL  | NULL          | NULL | NULL    | NULL |    1 | Using where |
+----+-------------+---------+------+---------------+------+---------+------+------+-------------+

In both cases, we are reading directly from the clustered index.

Then, we have secondary indexes that are partially used or/and that are partially useful for the query. This means that we are going to scan the index and then we are going to lookup in the clustered index. YES! TWO STRUCTURES WILL BE USED! We usually don’t realize any of this, but this is like an extra join between the secondary index and the clustered index.

Finally, the covering index, which is simple to identify as “Using index” in the extra column:

mysql> explain select value1 from table_a where value1=1;
+----+-------------+---------+------+---------------+--------+---------+-------+------+-------------+
| id | select_type | table   | type | possible_keys | key    | key_len | ref   | rows | Extra       |
+----+-------------+---------+------+---------------+--------+---------+-------+------+-------------+
|  1 | SIMPLE      | table_a | ref  | value1        | value1 | 5       | const |    1 | Using index |
+----+-------------+---------+------+---------------+--------+---------+-------+------+-------------+

Index Analysis

As I told you before, this is a cost-effectiveness analysis from the point of view of query performance. Most of the time it is faster to use covering indexes than secondary indexes, and finally the clustered index. However, usually covering indexes are more expensive for writes, as you need more fields to cover the query needs. So we are going to use a secondary index that also uses the clustered index. If the amount of rows is not large and it is selecting most of the rows, however, it could be even faster to perform a full table scan. Another thing to take into account is that the amount of indexes affects the write rate.

Let’s do an analysis. This is a common query:

mysql> explain select * from table_index_analisis_1 t1, table_index_analisis_2 t2 where t1.id = t2.value1;
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------+------+-------------+
| id | select_type | table | type   | possible_keys | key     | key_len | ref                             | rows | Extra       |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------+------+-------------+
|  1 | SIMPLE      | t2    | ALL    | NULL          | NULL    | NULL    | NULL                            |   64 | Using where |
|  1 | SIMPLE      | t1    | eq_ref | PRIMARY       | PRIMARY | 4       | bp_query_optimization.t2.value1 |    1 | NULL        |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------+------+-------------+

It is using all the fields of each table.

This is more restrictive:

mysql> explain select t1.id, t1.value1, t1.value2, t2.value2 from table_index_analisis_1 t1, table_index_analisis_2 t2 where t1.id = t2.value1;
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------+------+-------------+
| id | select_type | table | type   | possible_keys | key     | key_len | ref                             | rows | Extra       |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------+------+-------------+
|  1 | SIMPLE      | t2    | ALL    | NULL          | NULL    | NULL    | NULL                            |   64 | Using where |
|  1 | SIMPLE      | t1    | eq_ref | PRIMARY       | PRIMARY | 4       | bp_query_optimization.t2.value1 |    1 | NULL        |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------+------+-------------+

But it is performing a full table scan over t2, and then is using t2.value1 to lookup on t1 using the clustered index.

Let’s add an index on table_index_analisis_2 over value1:

mysql> alter table table_index_analisis_2 add key (value1);
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

The explain shows that it is not being used, not even when we force it:

mysql> explain select * from table_index_analisis_1 t1, table_index_analisis_2 t2 where t1.id = t2.value1;
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------+------+-------------+
| id | select_type | table | type   | possible_keys | key     | key_len | ref                             | rows | Extra       |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------+------+-------------+
|  1 | SIMPLE      | t2    | ALL    | value1        | NULL    | NULL    | NULL                            |   64 | Using where |
|  1 | SIMPLE      | t1    | eq_ref | PRIMARY       | PRIMARY | 4       | bp_query_optimization.t2.value1 |    1 | NULL        |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------+------+-------------+
mysql> explain select * from table_index_analisis_1 t1, table_index_analisis_2 t2 force key (value1) where t1.id = t2.value1;
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------+------+-------------+
| id | select_type | table | type   | possible_keys | key     | key_len | ref                             | rows | Extra       |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------+------+-------------+
|  1 | SIMPLE      | t2    | ALL    | value1        | NULL    | NULL    | NULL                            |   64 | Using where |
|  1 | SIMPLE      | t1    | eq_ref | PRIMARY       | PRIMARY | 4       | bp_query_optimization.t2.value1 |    1 | NULL        |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------+------+-------------+

This is because the optimizer considers performing a full table scan better than using a part of the index.

Now we are going to add an index over value1 and value2:

mysql> alter table table_index_analisis_2 add key (value1,value2);
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> explain select t1.id, t1.value1, t1.value2, t2.value2 from table_index_analisis_1 t1, table_index_analisis_2 t2 where t1.id = t2.value1;
+----+-------------+-------+--------+-----------------+----------+---------+---------------------------------+------+--------------------------+
| id | select_type | table | type   | possible_keys   | key      | key_len | ref                             | rows | Extra                    |
+----+-------------+-------+--------+-----------------+----------+---------+---------------------------------+------+--------------------------+
|  1 | SIMPLE      | t2    | index  | value1,value1_2 | value1_2 | 10      | NULL                            |   64 | Using where; Using index |
|  1 | SIMPLE      | t1    | eq_ref | PRIMARY         | PRIMARY  | 4       | bp_query_optimization.t2.value1 |    1 | NULL                     |
+----+-------------+-------+--------+-----------------+----------+---------+---------------------------------+------+--------------------------+

We can see that now it is using the index, and in the extra column says “Using index” — which means that it is not using the clustered index.

Finally, we are going to add an index over table_index_analisis_1, in the best way that it is going to be used for this query:

mysql> alter table table_index_analisis_1 add key (id,value1,value2);
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> explain select t1.id, t1.value1, t1.value2, t2.value2 from table_index_analisis_1 t1, table_index_analisis_2 t2 where t1.id = t2.value1;
+----+-------------+-------+--------+-----------------+----------+---------+---------------------------------+------+--------------------------+
| id | select_type | table | type   | possible_keys   | key      | key_len | ref                             | rows | Extra                    |
+----+-------------+-------+--------+-----------------+----------+---------+---------------------------------+------+--------------------------+
|  1 | SIMPLE      | t2    | index  | value1,value1_2 | value1_2 | 10      | NULL                            |   64 | Using where; Using index |
|  1 | SIMPLE      | t1    | eq_ref | PRIMARY,id      | PRIMARY  | 4       | bp_query_optimization.t2.value1 |    1 | NULL                     |
+----+-------------+-------+--------+-----------------+----------+---------+---------------------------------+------+--------------------------+
2 rows in set (0.00 sec)

However, it is not selected by the optimizer. That is why we need to force it:

mysql> explain select t1.id, t1.value1, t1.value2, t2.value2 from table_index_analisis_1 t1 force index(id), table_index_analisis_2 t2 where t1.id = t2.value1;
+----+-------------+-------+-------+-----------------+----------+---------+---------------------------------+------+--------------------------+
| id | select_type | table | type  | possible_keys   | key      | key_len | ref                             | rows | Extra                    |
+----+-------------+-------+-------+-----------------+----------+---------+---------------------------------+------+--------------------------+
|  1 | SIMPLE      | t2    | index | value1,value1_2 | value1_2 | 10      | NULL                            |   64 | Using where; Using index |
|  1 | SIMPLE      | t1    | ref   | id              | id       | 4       | bp_query_optimization.t2.value1 |    1 | Using index              |
+----+-------------+-------+-------+-----------------+----------+---------+---------------------------------+------+--------------------------+
2 rows in set (0.00 sec)

Now, we are just using the secondary index in both cases.

Conclusions

There are many more hints to analyze queries we could review, like handler used, table design, etc. However, in my opinion, it is useful to focus on these at the beginning of the analysis.

I will also like to point out that using hints is not a long-term solution! Hints should be used just in the analysis phase.

The post Using Hints to Analyze Queries appeared first on Percona Database Performance Blog.

Written by in: MySQL,Zend Developer |
Apr
17
2018
--

Enterprise AI will make the leap — who will reap the benefits?

This year, artificial intelligence will further elevate the enterprise by transforming the way we work, securing digital assets, increasing collaboration and ushering in a new era of AI-powered innovation. Enterprise AI is rapidly moving beyond hype and into reality, and is primed to become one of the most consequential technological segments. Although startups have already realized AI’s power in redefining industries, enterprise executives are still in the process of understanding how it will transform their business and reshape their teams across all departments.

Throughout the past year, early adopting businesses of all sizes and industries began to reap benefits. AI applications with AI-powered capabilities introduced opportunities to change the way the enterprise engaged customers, segmented markets, assessed sales leads and engaged influencers. Enterprises are on the edge of taking this a step further because of the amount of knowledge and tools leveraging the potential of AI within their entire organization.

“New breakthroughs in AI, enabled by new hardware architectures, will create new intelligent business models for enterprises,” says Nigel Toon, co-founder and CEO at U.K.-based Graphcore. “Companies that can build an initial knowledge model and launch an initial intelligent service or product, then use this first product to capture new data and improve the knowledge model on a continuing basis, will quickly create clear class-leading products and services that competitors will struggle to keep up with.”

The category is evolving, and large companies are finding distinct ways to innovate. They can uniquely tap into decades of industry experience to develop horizontal AI, built for specific industries like healthcare, financial services, automotive, retail and more. These implementations, though, require deep industry expertise and industry-specific design, training, monitoring, security and implementation to meet the high-stakes IT requirements of global organizations.

“In 2018, AI is entering the enterprise. I believe we will see many enterprises adopt AI technology, but the (few) leaders will be those that can align AI with their strategic business goals,” says Ronny Fehling, associate director of Gamma Artificial Intelligence at BCG.

2018: AI will start separating the winners from the losers

Early industry successes (and failures) proved AI’s inevitability, but also the reality that wide-scale adoption would come through incremental progress only. This year, we’ll see AI move from influencing product or business functions to an organization-wide AI strategy. Expect the winners to move fast and remain nimble to keep implementing off-the-shelf and proprietary AI.

The companies that win the AI talent war will gain exponential advantages, given the category’s rapid growth.

Hans-Christian Boos, CEO and founder of Germany-based Arago, adds: “2018 will be a make or break year for enterprise and the established economy in general. I believe AI is the only viable path for innovation, new business models and digital disruption in companies from the industrial era. General AI can enable these enterprises to finally make use of the only advantage they have in the battle against new business models and giants from the Silicon Valley, or rather giants from the new age of knowledge based business models.”

The AI talent challenge

A boon in enterprise AI will also mean a further shortage of talent. Industries like telecommunications, financial services and manufacturing will feel the talent squeeze the most. The companies that win the AI talent war will gain exponential advantages, given the category’s rapid growth.

Hence, enterprises will try to attract talent by offering a powerful vision, a track record of product success, a bench of early client implementations and the potential to impact the masses. It’s about developing high-functioning and reliable solutions that become a new foundation for clients.

Developers and data scientists, however, are only the beginning. Winning enterprises must adopt their organizational structures that attract a new generation of product managers, sales, marketing, communications and other delivery teams that understand AI. This requires an informed, passionate and forward-thinking group of professionals that will help customers understand the future of work and customer engagement powered by AI.

AI adoption and employee training

Digital transformation, powered in large part by new AI capabilities, requires enterprises to understand how to extract data and utilize data-driven intelligence. Data is one of the greatest assets and essentials in maximizing the value in an AI application, yet data is often underutilized and misunderstood. Executives must establish teams and hold individuals across departments accountable for the successful and ongoing implementation of digital tools that extract full value from available internal and external data.

This transformation into an AI-native organization requires it to hire, train and re-skill all levels of employees, and provide the resources for individuals to adopt AI-powered disciplines that enhance their performance. Most workforce, from top to bottom, should be encouraged to rethink and evolve their role by incorporating new digital tools, often enabled by AI itself.

Expect AI and other digital technologies to become more prevalent in all business disciplines, not only at the application layer, as Vishal Chatrath, co-founder and CEO of U.K.-based Prowler.io emphasises. “Decision-making in enterprise is dominated by expert-systems that are born obsolete. The AI tools available till now that rely on deep-neural nets which are great for classification problems (identifying cats, dogs, words etc.) are not really fit for purpose for decision-making in large, complex and dynamic environments, because they are very data inefficient (needs millions of data points) and effectively act like black-boxes. 2018 will see Enterprise AI move beyond classification to decision-making.”

What’s next

However, the spotlight will shine on data governance as businesses adjust entire departments and workflows around data. In turn, data management and integrity will be an essential component of success as consumers and enterprises gain greater awareness about how companies use customers’ data. This opens a large field of opportunities, but also will require transparency in how companies are using, sharing and building applications on top of customer data to ensure trust.

“Every single industry will be enhanced with AI in the coming years. In the last years there was a lot of foundation work on gathering standardized data and now we can start to use some of the advanced AI techniques to bring huge efficiency and quality gains to enterprise companies,” says Rasmus Rothe, co-founder and CTO of Germany-based research lab and venture builder Merantix. “Enterprises should therefore thoroughly analyze their business units to understand how AI can help them to improve. Partnering with external AI experts instead of trying to build everything yourself is often more capital efficient and also leads to better results.”

The shift toward AI-native enterprises is in a defining phase. The pie of the AI-enabled market will continue to grow and everyone has an opportunity to take a slice. Enterprises need to quickly leverage their assets and extract the value of their data as AI algorithms themselves will become the most valuable part when data has become a commodity. The question is, who will move first, and who will have the biggest appetite.

Apr
17
2018
--

Google Cloud releases Dialogflow Enterprise Edition for building chat apps

Building conversational interfaces is a hot new area for developers. Chatbots can be a way to reduce friction in websites and apps and to give customers quick answers to commonly asked questions in a conversational framework. Today, Google announced it was making Dialogflow Enterprise Edition generally available. It had previously been in beta.

This technology came to them via the API.AI acquisition in 2016. Google wisely decided to change the name of the tool along the way, giving it a moniker that more closely matched what it actually does. The company reports that hundreds of thousands of developers are using the tool already to build conversational interfaces.

This isn’t just an all-Google tool, though. It works across voice interface platforms, including Google Assistant, Amazon Alexa and Facebook Messenger, giving developers a tool to develop their chat apps once and use them across several devices without having to change the underlying code in a significant way.

What’s more, with today’s release the company is providing increased functionality and making it easier to transition to the enterprise edition at the same time.

“Starting today, you can combine batch operations that would have required multiple API calls into a single API call, reducing lines of code and shortening development time. Dialogflow API V2 is also now the default for all new agents, integrating with Google Cloud Speech-to-Text, enabling agent management via API, supporting gRPC, and providing an easy transition to Enterprise Edition with no code migration,” Dan Aharon, Google’s product manager for Cloud AI, wrote in a company blog post announcing the tool.

The company showed off a few new customers using Dialogflow to build chat interfaces for their customers, including KLM Royal Dutch Airlines, Domino’s and Ticketmaster.

The new tool, which is available today, supports more than 30 languages and as a generally available enterprise product comes with a support package and service level agreement (SLA).

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com