Apr
30
2019
--

Oculus announces a VR subscription service for enterprises

Oculus is getting serious about monetizing VR for enterprise.

The company has previously sold specific business versions of the headsets, but now they’re adding a pricey annual device-management subscription.

Oculus Go for business starts at $599 (64 GB) and the enterprise Oculus Quest starts at $999 (128 GB). These fees include the first year of enterprise device management and support, which goes for $180 per year per device.

Here’s what that fee gets you:

This includes a dedicated software suite offering device setup and management tools, enterprise-grade service and support, and a new user experience customized for business use cases.

The new Oculus for Business launches in the fall.

Apr
30
2019
--

Upcoming Webinar Wed 5/1: Horizontally scale MySQL with TiDB while avoiding sharding

horizontally scale MySQL with TiDB

horizontally scale MySQL with TiDBJoin Percona CEO Peter Zaitsev and PingCAP Senior Product and Community Manager Morgan Tocker as they present How to horizontally scale MySQL with TiDB while avoiding sharding issues on Wednesday, May 1st, 2019, at 11:00 AM PDT (UTC-7) / 2:00 PM EDT (UTC-4).

Register Now

In this joint webinar, PingCAP will provide an introduction and overview of TiDB, tailored for those with a strong background in MySQL. PingCAP will use MySQL as an example to explain various implementation details of TiDB and translate terminology to MySQL/InnoDB terms.

Percona will then discuss sharding issues within MySQL and how TiDB resolves many of those issues through horizontal scalability.

In this webinar, the following topics will be covered:

– TiDB and the problems it resolves
– Issues with MySQL sharding, including cautionary use cases
– Benchmark results from different versions of TiDB
– MySQL compatibility with TiDB
– Summary, and question and answer

Register for this webinar on how to scale MySQL with TiDB.

Apr
30
2019
--

Facebook Messenger will get desktop apps, co-watching, emoji status

To win chat, Facebook Messenger must be as accessible as SMS, yet more entertaining than Snapchat. Today, Messenger pushes on both fronts with a series of announcements at Facebook’s F8 conference. Those include that it will launch Mac and PC desktop apps, a faster and smaller mobile app, simultaneous video co-watching and a revamped Friends tab, where friends can use an emoji to tell you what they’re up to or down for.

Facebook is also beefing up its tools for the 40 million active businesses and 300,000 businesses on Messenger, up from 200,000 businesses a year ago. Merchants will be able to let users book appointments at salons and masseuses, collect information with new lead generation chatbot templates and provide customer service to verified customers through authenticated m.me links. Facebook hopes this will boost the app beyond the 20 billion messages sent between people and businesses each month, which is up 10X from December 2017.

“We believe you can build practically any utility on top of messaging,” says Facebook’s head of Messenger Stan Chudnovsky. But he stresses that “All of the engineering behind it is has been redone” to make it more reliable, and to comply with CEO Mark Zuckerberg’s directive to unite the backends of Messenger, WhatsApp and Instagram Direct. “Of course, if we didn’t have to do all that, we’d be able to invest more in utilities. But we feel that utilities will be less functional if we don’t do that work. They need to go hand-in-hand together. Utilities will be more powerful, more functional and more desired if built on top of a system that’s interoperable and end-to-end encrypted.”

Here’s a look at the major Messenger announcements and why they’re important:

Messenger Desktop – A stripped-down version of Messenger focused on chat, audio and video calls will debut later this year. Chudnovsky says it will remove the need to juggle and resize browser tabs by giving you an always-accessible version of Messenger that can replace some of the unofficial knock-offs. Especially as Messenger focuses more on businesses, giving them a dedicated desktop interface could convince them to invest more in lead generation and customer service through Messenger.

Facebook Messenger’s upcoming desktop app

Project Lightspeed – Messenger is reengineering its app to cut 70 mb off its download size so people with low-storage phones don’t have to delete as many photos to install it. In testing, the app can cold start in merely 1.3 seconds, which Chudnovsky says is just 25 percent of where Messenger and many other apps are today. While Facebook already offers Messenger Light for the developing world, making the main app faster for everyone else could help Messenger swoop in and steal users from the status quo of SMS. The Lightspeed update will roll out later this year.

Video Co-Watching – TechCrunch reported in November that Messenger was building a Facebook Watch Party-style experience that would let users pick videos to watch at the same time as a friend, with reaction cams of their faces shown below the video. Now in testing before rolling out later this year, users can pick any Facebook video, invite one or multiple friends and laugh together. Unique capabilities like this could make Messenger more entertaining between utilitarian chat threads and appeal to a younger audience Facebook is at risk of losing.

Watch Videos Together on Messenger

Business Tools – After a rough start to its chatbot program a few years ago, where bots couldn’t figure out users’ open-ended responses, Chudnovsky says the platform is now picking up steam with 300,000 developers on board. One option that’s worked especially well is lead-generation templates, which teach bots to ask people standardized questions to collect contact info or business intent, so Messenger is adding more of those templates with completion reminders and seamless hand-off to a live agent.

To let users interact with appointment-based businesses through a platform they’re already familiar with, Messenger launched a beta program for barbers, dentists and more that will soon open to let any business handle appointment booking through the app. And with new authenticated m.me links, a business can take a logged-in user on their website and pass them to Messenger while still knowing their order history and other info. Getting more businesses hooked on Messenger customer service could be very lucrative down the line.

Appointment booking on Messenger

Close Friends and Emoji Status – Perhaps the most interesting update to Messenger, though, is its upcoming effort to help you make offline plans. Messenger is in the early stages of rebuilding its Friends tab into “Close Friends,” which will host big previews of friends’ Stories, photos shared in your chats, and let people overlay an emoji on their profile pic to show friends what they’re doing. We first reported this “Your Emoji” status update feature was being built a year ago, but it quietly cropped up in the video for Messenger Close Friends. This iteration lets you add an emoji like a home, barbell, low battery or beer mug, plus a short text description, to let friends know you’re back from work, at the gym, might not respond or are interested in getting a drink. These will show up atop the Close Friends tab as well as on location-sharing maps and more once this eventually rolls out.

Messenger’s upcoming Close Friends tab with Your Emoji status

Facebook Messenger is the best poised app to solve the loneliness problem. We often end up by ourselves because we’re not sure which of our friends are free to hang out, and we’re embarrassed to look desperate by constantly reaching out. But with emoji status, Messenger users could quietly signal their intentions without seeming needy. This “what are you doing offline” feature could be a whole social network of its own, as apps like Down To Lunch have tried. But with 1.3 billion users and built-in chat, Messenger has the ubiquity and utility to turn a hope into a hangout.

Click below to check out all of TechCrunch’s Facebook conference coverage from today:

Apr
30
2019
--

Docker looks to partners and packages to ease container implementation

Docker appears to be searching for ways to simplify the core value proposition of the company — creating, deploying and managing containers. While most would agree it has revolutionized software development, like many technology solutions, it takes a certain level of expertise and staffing to pull off. At DockerCon, the company’s customer conference taking place this week in San Francisco, Docker announced several ways it could help customers with the tough parts of implementing a containerized solution.

For starters, the company announced a beta of Docker Enterprise 3.0 this morning. That update is all about making life simpler for developers. As companies move to containerized environments, it’s a challenge for all but the largest organizations like Google, Amazon and Facebook, all of whom have massive resource requirements and correspondingly large engineering teams.

Most companies don’t have that luxury though, and Docker recognizes if it wants to bring containerization to a larger number of customers, it has to create packages and programs that make it easier to implement.

Docker Enterprise 3.0 is a step toward providing a solution that lets developers concentrate on the development aspects, while working with templates and other tools to simplify the deployment and management side of things.

The company sees customers struggling with implementation and how to configure and build a containerized workflow, so it is working with systems integrators to help smooth out the difficult parts. Today, the company announced Docker Enterprise as a Service, with the goal of helping companies through the process of setting up and managing a containerized environment, using the Docker stack and adjacent tooling like Kubernetes.

The service provider will take care of operational details like managing upgrades, rolling out patches, doing backups and undertaking capacity planning — all of those operational tasks that require a high level of knowledge around enterprise container stacks.

Capgemini will be the first go-to-market partner. “Capgemini has a combination of automation, technology tools, as well as services on the back end that can manage the installation, provisioning and management of the enterprise platform itself in cases where customers don’t want to do that, and they want to pay someone to do that for them,” Scott Johnston, chief product officer at Docker told TechCrunch.

The company has released tools in the past to help customers move legacy applications into containers without a lot of fuss. Today, the company announced a solution bundle called Accelerate Greenfield, a set of tools designed to help customers get up and running as container-first development companies.

“This is for those organizations that may be a little further along. They’ve gone all-in on containers committing to taking a container-first approach to new application development,” Johnston explained. He says this could be cloud native microservices or even a LAMP stack application, but the point is that they want to put everything in containers on a container platform.

Accelerate Greenfield is designed to do that. “They get the benefits where they know that from the developer to the production end point, it’s secure. They have a single way to define it all the way through the life cycle. They can make sure that it’s moving quickly, and they have that portability built into the container format, so they can deploy [wherever they wish],” he said.

These programs and products are all about providing a level of hand-holding, either by playing a direct consultative role, working with a systems integrator or providing a set of tools and technologies to walk the customer through the containerization life cycle. Whether they provide a sufficient level of help that customers require is something we will learn over time as these programs mature.

Apr
30
2019
--

Docker updates focus on simplifying containerization for developers

Over the last five years, Docker has become synonymous with software containers, but that doesn’t mean every developer understands the technical details of building, managing and deploying them. At DockerCon this week, the company’s customer conference taking place in San Francisco, it announced new tools that have been designed to make it easier for developers, who might not be Docker experts, to work with containers.

As the technology has matured, the company has seen the market broaden, but in order to take advantage of that, it needs to provide a set of tools that make it easier to work with. “We’ve found that customers typically have a small cadre of Docker experts, but there are hundreds, if not thousands, of developers who also want to use Docker. And we reasoned, how can we help them get productive very, very quickly, without them having to become Docker experts,” Scott Johnston, chief product officer at Docker, told TechCrunch.

To that end, it announced a beta of Docker Enterprise 3.0, which includes several key components. For starters, Docker Desktop Enterprise lets IT set up a Docker environment with the kind of security and deployment templates that make sense for each customer. The developers can then pick the templates that make sense for their implementations, while conforming with compliance and governance rules in the company.

“These templates already have IT-approved container images, and have IT-approved configuration settings. And what that means is that IT can provide these templates through these visual tools that allow developers to move fast and choose the ones they want without having go back for approval,” Johnston explained.

The idea is to let the developers concentrate on building applications, and the templates provide all the Docker tooling pre-built and ready to go, so they don’t have to worry about all of that.

Another piece of this is Docker Applications, which allows developers to build complex containerized applications as a single package and deploy them to any infrastructure they wish — on-prem or in the cloud. Five years ago, when Docker really got started with containers, they were a simpler idea, often involving just a single one, but as developers broke down those larger applications into microservices, it created a new level of difficulty, especially for operations that had to deploy these increasingly large sets of application containers.

“Operations can now programmatically change the parameters for the containers, depending on the environments, without having to go in and change the application. So you can imagine that ability lowers the friction of having to manage all these files in the first place,” he said.

The final piece of that is the orchestration layer, and the popular way to handle that today is with Kubernetes. Docker has created its own flavor of Kubernetes, based on the open-source tool. Johnston says, as with the other two pieces, the goal here is to take a powerful tool like Kubernetes and reduce the overall complexity associated with running it, while making it fully compatible with a Docker environment.

For that, Docker announced Docker Kubernetes Service (DKS), which has been designed with Docker users in mind, including support for Docker Compose, a scripting tool that has been popular with Docker users. While you are free to use any flavor of Kubernetes you wish, Docker is offering DKS as a Docker-friendly version for developers.

All of these components have one thing in common besides being part of Docker Enterprise 3.0. They are trying to reduce the complexity associated with deploying and managing containers and to abstract away the most difficult parts, so that developers can concentrate on developing without having to worry about connecting to the technical underpinnings of building and deploying containers. At the same time, Docker is trying to make it easier for the operations team to manage it all. That is the goal, at least. In the end, DevOps teams will be the final judges of how well Docker has done, once these tools become generally available later this year.

The Docker Enterprise 3.0 beta will be available later this quarter.

Apr
30
2019
--

UiPath nabs $568M at a $7B valuation to bring robotic process automation to the front office

Companies are on the hunt for ways to reduce the time and money it costs their employees to perform repetitive tasks, so today a startup that has built a business to capitalize on this is announcing a huge round of funding to double down on the opportunity.

UiPath — a robotic process automation startup originally founded in Romania that uses artificial intelligence and sophisticated scripts to build software to run these tasks — today confirmed that it has closed a Series D round of $568 million at a post-money valuation of $7 billion.

From what we understand, the startup is “close to profitability” and is going to keep growing as a private company. Then, an IPO within the next 12-24 months is the “medium term” plan.

“We are at the tipping point. Business leaders everywhere are augmenting their workforces with software robots, rapidly accelerating the digital transformation of their entire business and freeing employees to spend time on more impactful work,” said Daniel Dines, UiPath co-founder and CEO, in a statement. “UiPath is leading this workforce revolution, driven by our core determination to democratize RPA and deliver on our vision of a robot helping every person.”

This latest round of funding is being led by Coatue, with participation from Dragoneer, Wellington, Sands Capital, and funds and accounts advised by T. Rowe Price Associates, Accel, Alphabet’s CapitalG, Sequoia, IVP and Madrona Venture Group.

CFO Marie Myers said in an interview in London that the plan will be to use this funding to expand UiPath’s focus into more front-office and customer-facing areas, such as customer support and sales.

“We want to move into automation into new levels,” she said. “We’re advancing quickly into AI and the cloud, with plans to launch a new AI product in the second half of the year that we believe will demystify it for our users.” The product, she added, will be focused around “drag and drop” architecture and will work both for attended and unattended bots — that is, those that work as assistants to humans, and those that work completely on their own. “Robotics has moved out of the back office and into the front office, and the time is right to move into intelligent automation.”

Today’s news confirms Kate’s report from last month noting that the round was in progress: in the end, the amount UiPath raised was higher than the target amount we’d heard ($400 million), with the valuation on the more “conservative” side (we’d said the valuation would be higher than $7 billion).

“Conservative” is a relative term here. The company has been on a funding tear in the last year, raising $418 million ($153 million at Series A and $265 million at Series B) in the space of 12 months, and seeing its valuation go from a modest $110 million in April 2017 to $7 billion today, just two years later.

Up to now, UiPath has focused on internal and back-office tasks in areas like accounting, human resources paperwork, and claims processing — a booming business that has seen UiPath expand its annual run rate to more than $200 million (versus $150 million six months ago) and its customer base to more than 400,000 people.

Customers today include American Fidelity, BankUnited, CWT (formerly known as Carlson Wagonlit Travel), Duracell, Google, Japan Exchange Group (JPX), LogMeIn, McDonalds, NHS Shared Business Services, Nippon Life Insurance Company, NTT Communications, Orange, Ricoh Company, Ltd., Rogers Communications, Shinsei Bank, Quest Diagnostics, Uber, the US Navy, Voya Financial, Virgin Media, and World Fuel Services.

Moving into more front-office tasks is an ambitious but not surprising leap for UiPath. Looking at that customer list, it’s notable that many of these organizations have customer-facing operations, often with their own sets of repetitive processes that are ripe for improving by tapping into the many facets of AI — from computer vision to natural language processing and voice recognition, through to machine learning — alongside other technology.

It also begs the question of what UiPath might look to tackle next. Having customer-facing tools and services is one short leap from building consumer services, an area where the likes of Amazon, Google, Apple and Microsoft are all pushing hard with devices and personal assistant services. (That would indeed open up the competitive landscape quite a lot for UiPath, beyond the list of RPA companies like AutomationAnywhere, Kofax and Blue Prism who are its competitors today.)

Robotics has been given a somewhat bad rap in the world of work. Critics worry that they are “taking over all the jobs“, removing humans and their own need to be industrious from the equation; and in the worst-case scenarios, the work of a robot lacks the nuance and sophsitication you get from the human touch.

UiPath and the bigger area of RPA are interesting in this regard. The aim (the stated aim, at least) isn’t to replace people, but to take tasks out of their hands to make it easier for them to focus on the non-repetitive work that “robots” — and in the case of UiPath, software scripts and robots — cannot do.

Indeed, that “future of work” angle is precisely what has attracted investors.

“UiPath is enabling the critical capabilities necessary to advance how companies perform and how employees better spend their time,” said Greg Dunham, vice president at T. Rowe Price Associates, Inc., in a statement. “The industry has achieved rapid growth in such a short time, with UiPath at the head of it, largely due to the fact that RPA is becoming recognized as the paradigm shift needed to drive digital transformation through virtually every single industry in the world.”

As we’ve written before, the company has has been a big hit with investors because of the rapid traction it has seen with enterprise customers.

There is an interesting side story to the funding that speaks to that traction: Myers, the CFO, came to UiPath by way of one of those engagement. She had been a senior finance executive with HP tasked with figuring out how to make some of its accounting more efficient. She issued an RFP for the work, and the only company she thought really addressed the task with a truly tech-first solution, at a very competitive price, was an unlikely startup out of Romania, which turned out to be UiPath. She became one of the company’s first customers, and eventually Dines offered her a job to help build his company to the next level, which she leaped to take.

“UiPath is improving business performance, efficiency and operation in a way we’ve never seen before,” said Philippe Laffont, founder of Coatue Management, in a statement. “The Company’s rapid growth over the last two years is a testament to the fact that UiPath is transforming how companies manage their resources. RPA presents an enormous opportunity for companies around the world who are embracing artificial intelligence, driving a new era of productivity, efficiency and workplace satisfaction.” 

Apr
29
2019
--

A Close Look at the Index Include Clause

Some database—namely Microsoft SQL Server, IBM Db2, and also PostgreSQL since release 11—offer an include clause in the create index statement. The introduction of this feature to PostgreSQL is the trigger for this long overdue explanation of the include clause.

Before going into the details, let’s start with a short recap on how (non-clustered) B-tree indexes work and what the all-mighty index-only scan is.

Recap: B-tree Indexes

To understand the include clause, you must first understand that using an index affects up to three layers of data structures:

  • The B-tree

  • The doubly linked list at the leaf node level of the B-tree

  • The table

The first two structures together form an index so they could be combined into a single item, i.e. the “B-tree index”. I prefer to keep them separate as they serve different needs and have a different impact on performance. Moreover, explaining the include clause requires making this distinction.

In the general case, the database software starts traversing the B-tree to find the first matching entry at the leaf node level (1). It then follows the doubly linked list until it has found all matching entries (2) and finally it fetches each of those matching entries from the table (3). Actually, the last two steps can be interleaved, but that is not relevant for understanding the general concept.

The following formulas give you a rough idea of how many read operations each of these steps needs. The sum of these three components is the total effort of an index access.0

  • The B-tree: log100(<rows in table>), often less than 5

  • The doubly linked list: <rows read from index> / 100

  • The table: <rows read from table>1

When loading a few rows, the B-tree makes the greatest contribution to the overall effort. As soon as you need to fetch just a handful of rows from the table, this step takes the lead. In either case—few or many rows—the doubly linked list is usually a minor factor because it stores rows with similar values next to each other so that a single read operation can fetch 100 or even more rows. The formula reflects this by the respective divisor.2


Note

If you are thinking “That’s why we have clustered indexes”, please read my article: Unreasonable Defaults: Primary Key as Clustering Key.


The most generic idea about optimization is to do less work to achieve the same goal. When it comes to index access, this means that the database software omits accessing a data structure if it doesn’t need any data from it.3

You can read more about the inner workings of B-tree indexes in Chapter 1, “Anatomy of an SQL Index of SQL Performance Explained.

Recap: Index-Only Scan

The index-only scan does exactly that: it omits the table access if the required data is available in the doubly linked list of the index.

Consider the following index and query I borrowed from Index-Only Scan: Avoiding Table
Access
.

CREATE INDEX idx
    ON sales
     ( subsidiary_id, eur_value )
SELECT SUM(eur_value)
  FROM sales
 WHERE subsidiary_id = ?

At first glance, you may wonder why the column eur_value is in the index definition at all—it is not mentioned in the where clause.


B-tree Indexes Help Many Clauses

It is a common misconception that indexes only help the where clause.

B-tree indexes can also help the order by, group by, select and other clauses. It is just the B-tree part of an index—not the doubly linked list—that cannot be used by other clauses.


The crucial point in this example is that the B-tree index happens to have all required columns—the database software doesn’t need to access the table itself. This is what we refer to as an index-only scan.

Applying the formulas above, the performance benefit of this is very small if only a few rows satisfy the where clause. On the other hand, if the where clause accepts many rows, e.g. millions, the number of read operations is essentially reduced by a factor of 100.


Note

It is not uncommon that an index-only scan improves performance by one or two orders of magnitude.


The example above uses the fact that the doubly-linked list—the leaf nodes of the B-tree—contains the eur_value column. Although the other nodes of the B-tree store that column too, this query has no use for the information in these nodes.

The Include Clause

The include clause allows us to make a distinction between columns we would like to have in the entire index (key columns) and columns we only need in the leaf nodes (include columns). That means it allows us to remove columns from the non-leaf nodes if we don’t need them there.

Using the include clause, we could refine the index for this query:

CREATE INDEX idx
    ON sales ( subsidiary_id )
     INCLUDE ( eur_value )

The query can still use this index for an index-only scan, thus yielding essentially the same performance.

Besides the obvious differences in the picture, there is also a more subtle difference: the order of the leaf node entries does not take the include columns into account. The index is solely ordered by its key columns.4 This has two consequences: include columns cannot be used to prevent sorting nor are they considered for uniqueness (see below).


“Covering Index”

The term “covering index” is sometimes used in the context of index-only scans or include clauses. As this term is often used with a different meaning, I generally avoid it.

What matters is whether a given index can support a given query by means of an index-only scan. Whether or not that index has an include clause or contains all table columns is not relevant.


Compared to the original index definition, the new definition with the include clause has some advantages:

  • The tree might have fewer levels (<~40%)

    As the tree nodes above the doubly linked list do not contain the include columns, the database can store more branches in each block so that the tree might have fewer levels.

  • The index is slightly smaller (<~3%)

    As the non-leaf nodes of the tree don’t contain include columns, the overall size of that index is slightly less. However, the leaf node level of the index needs the most space anyway so that the potential savings in the remaining nodes is very little.

  • It documents its purpose

    This is definitely the most underestimated benefit of the include clause: the reason why the column is in the index is document in the index definition itself.

Let me elaborate on the last item.

When extending an existing index, it is very important to know exactly why the index is currently defined the way it happens to be defined. The freedoms you have in changing the index without breaking any other queries is a direct result of this knowledge.

The following query demonstrates this:

SELECT *
  FROM sales
 WHERE subsidiary_id = ?
 ORDER BY ts DESC
 FETCH FIRST 1 ROW ONLY

As before, for a given subsidiary this query fetches the most recent sales entry (ts is for time stamp).

To optimize this query, it would be great to have an index that starts with the key columns (subsidiary_id, ts). With this index, the database software can directly navigate to the latest entry for that subsidiary and return it right away. There is no need to read and sort all of the entries for that subsidiary because the doubly linked list is sorted according to the index key, i.e. the last entry for any given subsidiary must have the greatest ts value for that subsidiary. With this approach, the query is essentially as fast as a primary key lookup. See Indexing Order By and Querying Top-N Rows for more details about this technique.


On my Own Behalf

I make my living from training, other SQL related services and selling my book. Learn more at https://winand.at/.


Before adding a new index for this query, we should check if there is an existing index that can be changed (extended) to support this trick. This is generally a good practice because extending an existing index has a smaller impact on the maintenance overhead than adding a new index. However, when changing an existing index, we need to make sure that we do not make that index less useful for other queries.

If we look at the original index definition, we encounter a problem:

CREATE INDEX idx
    ON sales
     ( subsidiary_id, eur_value )

To make this index support the order by clause of the above query, we would need to insert the ts column between the two existing columns:

CREATE INDEX idx
    ON sales
     ( subsidiary_id, ts, eur_value )

However, that might render this index less useful for queries that need the eur_value column in the second position, e.g. if it was in the where or order by clause. Changing this index involves a considerable risk: breaking other queries unless we know that there are no such queries. If we don’t know, it is often best to keep the index as it is and create another one for the new query.

The picture changes completely if we look at the index with the include clause.

CREATE INDEX idx
    ON sales ( subsidiary_id )
     INCLUDE ( eur_value )

As the eur_value column is in the include clause, it is not in the non-leaf nodes and thus neither useful for navigating the tree nor for ordering. Adding a new column to the end of the key part is relatively safe.

CREATE INDEX idx
    ON sales ( subsidiary_id, ts )
     INCLUDE ( eur_value )

Even though there is still a small risk of negative impacts for other queries, it is usually worth taking that risk.5

From the perspective of index evolution, it is thus very helpful to put columns into the include clause if this is all you need. Columns that are just added to enable an index-only scan are the prime candidates for this.

Filtering on Include Columns

Until now we have focused on how the include clause can enable index-only scans. Let’s also look at another case where it is beneficial to have an extra column in the index.

SELECT *
  FROM sales
 WHERE subsidiary_id = ?
   AND notes LIKE '%search term%'

I’ve made the search term a literal value to show the leading and trailing wildcards—of course you would use a bind parameter in your code.

Now, let’s think about the right index for this query. Obviously, the subsidiary_id needs to be in the first position. If we take the previous index from above, it already satisfies this requirement:

CREATE INDEX idx
    ON sales ( subsidiary_id, ts )
     INCLUDE ( eur_value )

The database software can use that index with the three-step procedure as described at the beginning: (1) it will use the B-tree to find the first index entry for the given subsidiary; (2) it will follow the doubly linked list to find all sales for that subsidiary; (3) it will fetch all related sales from the table, remove those for which the like pattern on the notes column doesn’t match and return the remaining rows.

The problem is the last step of this procedure: the table access loads rows without knowing if they will make it into the final result. Quite often, the table access is the biggest contributor to the total effort of running a query. Loading data that is not even selected is a huge performance no-no.


Important

Avoid loading data that doesn’t affect the result of the query.


The challenge with this particular query is that it uses an in-fix like pattern. Normal B-tree indexes don’t support searching such patterns. However, B-tree indexes still support filtering on such patterns. Note the emphasis: searching vs. filtering.

In other words, if the notes column was present in the doubly linked list, the database software could apply the like pattern before fetching that row from the table (not PostgreSQL, see below). This prevents the table access if the like pattern doesn’t match. If the table has more columns, there is still a table access to fetch those columns for the rows that satisfy the where clause—due to the select *.

CREATE INDEX idx
    ON sales ( subsidiary_id, ts )
     INCLUDE ( eur_value, notes )

If there are more columns in the table, the index does not enable an index-only scan. Nonetheless, it can bring the performance close to that of an index-only scan if the portion of rows that match the like pattern is very low. In the opposite case—if all rows match the pattern—the performance is a little bit worse due to the increased index size. However, the breakeven is easy to reach: for an overall performance improvement, it is often enough that the like filter removes a small percentage of the rows. Your mileage will vary depending on the size of the involved columns.

Unique Indexes with Include Clause

Last but not least there, is an entirely different aspect of the include clause: unique indexes with an include clause only consider the key columns for the uniqueness.

That allows us to create unique indexes that have additional columns in the leaf nodes, e.g. for an index-only scan.

CREATE UNIQUE INDEX …
    ON … ( id )
 INCLUDE ( payload )

This index protects against duplicate values in the id column,6 yet it supports an index-only scan for the next query.

SELECT payload
  FROM …
 WHERE id = ?

Note that the include clause is not strictly required for this behavior: databases that make a proper distinction between unique constraints and unique indexes just need an index with the unique key columns as the leftmost columns—additional columns are fine.

For the Oracle Database, the corresponding syntax is this:

CREATE INDEX …
    ON … ( id, payload )
ALTER TABLE … ADD UNIQUE ( id )
      USING INDEX …

Compatibility

Availability of INCLUDE

PostgreSQL: No Filtering Before Visibility Check

The PostgreSQL database has a limitation when it comes to applying filters on the index level. The short story is that it doesn’t do it, except in a few cases. Even worse, some of those cases only work when the respective data is stored in the key part of the index, not in the include clause. That means moving columns to the include clause may negatively affect performance, even if the above described logic still applies.

The long story starts with the fact that PostgreSQL keeps old row versions in the table until they become invisible to all transactions and the vacuum process removes them at some later point in time. To know whether a row version is visible (to a given transaction) or not, each table has two extra attributes that indicate when a row version was created and deleted: xmin and xmax. The row is only visible if the current transaction falls within the xmin/xmax range.7

Unfortunately, the xmin/xmax values are not stored in indexes.8

That means that whenever PostgreSQL is looking at an index entry, it cannot tell whether or not that entry is visible to the current transaction. It could be a deleted entry or an entry that has not yet been committed. The canonical way to find out is to look into the table and check the xmin/xmax values.

A consequence is that there is no such thing as an index-only scan in PostgreSQL. No matter how many columns you put into an index, PostgreSQL will always need to check the visibility, which is not available in the index.

Yet there is an Index Only Scan operation in PostgreSQL—but that still needs to check the visibility of each row version by accessing data outside the index. Instead of going to the table, the Index Only Scan first checks the so-called visibility map. This visibility map is very dense so the number of read operations is (hopefully) less than fetching xmin/xmax from the table. However, the visibility map does not always give a definite answer: the visibility map either states that that the row is known to be visible, or that the visibility is not known. In the latter case, the Index Only Scan still needs to fetch xmin/xmax from the table (shown as “Heap Fetches” in explain analyze).

After this short visibility digression, we can return to filtering on the index level.

SQL allows arbitrary complex expressions in the where clause. These expressions might also cause runtime errors such as “division by zero”. If PostgreSQL would evaluate such expression before confirming the visibility of the respective entry, even invisible rows could cause such errors. To prevent this, PostgreSQL generally checks the visibility before evaluating such expressions.

There is one exception to this general rule. As the visibility cannot be checked while searching an index, operators that can be used for searching must always be safe to use. These are the operators that are defined in the respective operator class. If a simple comparison filter uses an operation from such an operator class, PostgreSQL can apply that filter before checking the visibility because it knows that these operators are safe to use. The crux is that only key columns have an operator class associated with them. Columns in the include clause don’t—filters based on them are not applied before their visibility is confirmed. This is my understanding from a thread on the PostgreSQL hackers mailing list.

For a demonstration, take the previous index and query:

CREATE INDEX idx
    ON sales ( subsidiary_id, ts )
     INCLUDE ( eur_value, notes )
SELECT *
  FROM sales
 WHERE subsidiary_id = ?
   AND notes LIKE '%search term%'

The execution plan—edited for brevity—could look like this:

               QUERY PLAN
----------------------------------------------
Index Scan using idx on sales (actual rows=16)
  Index Cond: (subsidiary_id = 1)
  Filter: (notes ~~ '%search term%')
  Rows Removed by Filter: 240
  Buffers: shared hit=54

The like filter is shown in Filter, not in Index Cond. That means it was applied at table level. Also, the number of shared hits is rather high for fetching 16 rows.

In a Bitmap Index/Heap Scan the phenomenon becomes more obvious.

                  QUERY PLAN
-----------------------------------------------
Bitmap Heap Scan on sales (actual rows=16)
  Recheck Cond: (idsubsidiary_id= 1)
  Filter: (notes ~~ '%search term%')
  Rows Removed by Filter: 240
  Heap Blocks: exact=52
  Buffers: shared hit=54
  -> Bitmap Index Scan on idx (actual rows=256)
       Index Cond: (subsidiary_id = 1)
       Buffers: shared hit=2

The Bitmap Index Scan does not mention the like filter at all. Instead it returns 256 rows—way more than the 16 that satisfy the where clause.

Note that this is not a particularity of the include column in this case. Moving the include columns into the index key gives the same result.

CREATE INDEX idx
    ON sales ( subsidiary_id, ts, eur_value, notes)
                  QUERY PLAN
-----------------------------------------------
Bitmap Heap Scan on sales (actual rows=16)
  Recheck Cond: (subsidiary_id = 1)
  Filter: (notes ~~ '%search term%')
  Rows Removed by Filter: 240
  Heap Blocks: exact=52
  Buffers: shared hit=54
  -> Bitmap Index Scan on idx (actual rows=256)
       Index Cond: (subsidiary_id = 1)
       Buffers: shared hit=2

This is because the like operator is not part of the operator class so it is not considered to be safe.

If you use an operation from the operator class, e.g. equals, the execution plan changes.

SELECT *
  FROM sales
 WHERE subsidiary_id = ?
   AND notes = 'search term'

The Bitmap Index Scan now applies all conditions from the where clause and only passes the remaining 16 rows on to the Bitmap Heap Scan.

                 QUERY PLAN
----------------------------------------------
Bitmap Heap Scan on sales (actual rows=16)
  Recheck Cond: (subsidiary_id = 1
             AND notes = 'search term')
  Heap Blocks: exact=16
  Buffers: shared hit=18
  -> Bitmap Index Scan on idx (actual rows=16)
       Index Cond: (subsidiary_id = 1
                AND notes = 'search term')
       Buffers: shared hit=2

Note that this requires the respective column to be a key column. If you move the notes column back to the include clause, it has no associated operator class so the equals operator is not considered safe anymore. Consequently, PostgreSQL postpones applying this filter to the table access until after the visibility is checked.

                 QUERY PLAN
-----------------------------------------------
Bitmap Heap Scan on sales (actual rows=16)
  Recheck Cond: (id = 1)
  Filter: (notes = 'search term')
  Rows Removed by Filter: 240
  Heap Blocks: exact=52
  Buffers: shared hit=54
  -> Bitmap Index Scan on idx (actual rows=256)
       Index Cond: (id = 1)
       Buffers: shared hit=2

A Close Look at the Index Include Clause” by Markus Winand was originally published at Use The Index, Luke!.

Apr
29
2019
--

Canonical’s Mark Shuttleworth on dueling open-source foundations

At the Open Infrastructure Summit, which was previously known as the OpenStack Summit, Canonical founder Mark Shuttleworth used his keynote to talk about the state of open-source foundations — and what often feels like the increasing competition between them. “I know for a fact that nobody asked to replace dueling vendors with dueling foundations,” he said. “Nobody asked for that.”

He then put a point on this, saying, “what’s the difference between a vendor that only promotes the ideas that are in its own interest and a foundation that does the same thing. Or worse, a foundation that will only represent projects that it’s paid to represent.”

Somewhat uncharacteristically, Shuttleworth didn’t say which foundations he was talking about, but since there are really only two foundations that fit the bill here, it’s pretty clear that he was talking about the OpenStack Foundation and the Linux Foundation — and maybe more precisely the Cloud Native Computing Foundation, the home of the incredibly popular Kubernetes project.

It turns out, that’s only part of his misgivings about the current state of open-source foundations, though. I sat down with Shuttleworth after his keynote to discuss his comments, as well as Canonical’s announcements around open infrastructure.

One thing that’s worth noting at the outset is that the OpenStack Foundation is using this event to highlight that fact that it has now brought in more new open infrastructure projects outside of the core OpenStack software, with two of them graduating from their pilot phase. Shuttleworth, who has made big bets on OpenStack in the past and is seeing a lot of interest from customers, is not a fan. Canonical, it’s worth noting, is also a major sponsor of the OpenStack Foundation. He, however, believes, the foundation should focus on the core OpenStack project.

“We’re busy deploying 27 OpenStack clouds — that’s more than double the run rate last year,” he said. “OpenStack is important. It’s very complicated and hard. And a lot of our focus has been on making it simpler and cleaner, despite the efforts of those around us in this community. But I believe in it. I think that if you need large-scale, multi-tenant virtualization infrastructure, it’s the best game in town. But it has problems. It needs focus. I’m super committed to that. And I worry about people losing their focus because something newer and shinier has shown up.”

To clarify that, I asked him if he essentially believes that the OpenStack Foundation is making a mistake by trying to be all things infrastructure. “Yes, absolutely,” he said. “At the end of the day, I think there are some projects that this community is famous for. They need focus, they need attention, right? It’s very hard to argue that they will get focus and attention when you’re launching a ton of other things that nobody’s ever heard of, right? Why are you launching those things? Who is behind those decisions? Is it a money question as well? Those are all fair questions to ask.”

He doesn’t believe all of the blame should fall on the Foundation leadership, though. “I think these guys are trying really hard. I think the common characterization that it was hapless isn’t helpful and isn’t accurate. We’re trying to figure stuff out.” Shuttleworth indeed doesn’t believe the leadership is hapless, something he stressed, but he clearly isn’t all that happy with the current path the OpenStack Foundation is on either.

The Foundation, of course, doesn’t agree. As OpenStack Foundation COO Mark Collier told me, the organization remains as committed to OpenStack as ever. “The Foundation, the board, the community, the staff — we’ve never been more committed to OpenStack,” he said. “If you look at the state of OpenStack, it’s one of the top-three most active open-source projects in the world right now […] There’s no wavering in our commitment to OpenStack.” He also noted that the other projects that are now part of the foundation are the kind of software that is helpful to OpenStack users. “These are efforts which are good for OpenStack,” he said. In addition, he stressed that the process of opening up the Foundation has been going on for more than two years, with the vast majority of the community (roughly 97 percent) voting in favor.

OpenStack board member Allison Randal echoed this. “Over the past few years, and a long series of strategic conversations, we realized that OpenStack doesn’t exist in a vacuum. OpenStack’s success depends on the success of a whole network of other open-source projects, including Linux distributions and dependencies like Python and hypervisors, but also on the success of other open infrastructure projects which our users are deploying together. The OpenStack community has learned a few things about successful open collaboration over the years, and we hope that sharing those lessons and offering a little support can help other open infrastructure projects succeed too. The rising tide of open source lifts all boats.”

As far as open-source foundations in general, he surely also doesn’t believe that it’s a good thing to have numerous foundations compete over projects. He argues that we’re still trying to figure out the role of open-source foundations and that we’re currently in a slightly awkward position because we’re still trying to determine how to best organize these foundations. “Open source in society is really interesting. And how we organize that in society is really interesting,” he said. “How we lead that, how we organize that is really interesting and there will be steps forward and steps backward. Foundations tweeting angrily at each other is not very presidential.”

He also challenged the notion that if you just put a project into a foundation, “everything gets better.” That’s too simplistic, he argues, because so much depends on the leadership of the foundation and how they define being open. “When you see foundations as nonprofit entities effectively arguing over who controls the more important toys, I don’t think that’s serving users.”

When I asked him whether he thinks some foundations are doing a better job than others, he essentially declined to comment. But he did say that he thinks the Linux Foundation is doing a good job with Linux, in large parts because it employs Linus Torvalds . “I think the technical leadership of a complex project that serves the needs of many organizations is best served that way and something that the OpenStack Foundation could learn from the Linux Foundation. I’d be much happier with my membership fees actually paying for thoughtful, independent leadership of the complexity of OpenStack rather than the sort of bizarre bun fights and stuffed ballots that we see today. For all the kumbaya, it flatly doesn’t work.” He believes that projects should have independent leaders who can make long-term plans. “Linus’ finger is a damn useful tool and it’s hard when everybody tries to get reelected. It’s easy to get outraged at Linus, but he’s doing a fucking good job, right?”

OpenStack, he believes, often lacks that kind of decisiveness because it tries to please everybody and attract more sponsors. “That’s perhaps the root cause,” he said, and it leads to too much “behind-the-scenes puppet mastering.”

In addition to our talk about foundations, Shuttleworth also noted that he believes the company is still on the path to an IPO. He’s obviously not committing to a time frame, but after a year of resetting in 2018, he argues that Canonical’s business is looking up. “We want to be north of $200 million in revenue and a decent growth rate and the right set of stories around the data center, around public cloud and IoT.” First, though, Canonical will do a growth equity round.

Apr
29
2019
--

ZFS For MongoDB Backups

mongodb backup using zfs

mongodb backup using zfsWe have successfully used ZFS for MySQL® backups and MongoDB® is no different. Normally, backups will be taken from a hidden secondary, either with

mongodump

 , WT hot backup or filesystem snapshots. In the case of the latter, instead of LVM2, we will use ZFS and discuss potential other benefits.

Preparation for initial snapshot

Before taking a ZFS snapshot, it is important to use

db.fsyncLock()

. This allows a consistent on disk copy of the data by blocking writes. It gives the server the time it needs to commit the journal to disk before the snapshot is taken.

My MongoDB instance below is running a ZFS volume and we will take an initial snapshot.

revin@mongodb:~$ sudo zfs list
NAME             USED  AVAIL  REFER  MOUNTPOINT
zfs-mongo        596M  9.04G    24K  /zfs-mongo
zfs-mongo/data   592M  9.04G   592M  /zfs-mongo/data
revin@mongodb:~$ mongo --port 28020 --eval 'db.serverCmdLineOpts().parsed.storage' --quiet
{
    "dbPath" : "/zfs-mongo/data/m40",
    "journal" : {
        "enabled" : true
    },
    "wiredTiger" : {
        "engineConfig" : {
            "cacheSizeGB" : 0.25
        }
    }
}
revin@mongodb:~$ mongo --port 28020 --eval 'db.fsyncLock()' --quiet
{
    "info" : "now locked against writes, use db.fsyncUnlock() to unlock",
    "lockCount" : NumberLong(1),
...
}
revin@mongodb:~$ sleep 0.6
revin@mongodb:~$ sudo zfs snapshot zfs-mongo/data@full
revin@mongodb:~$ mongo --port 28020 --eval 'db.fsyncUnlock()' --quiet
{
    "info" : "fsyncUnlock completed",
    "lockCount" : NumberLong(0),
...
}

Notice the addition of sleep on line 23 of my command above. This is to ensure that even with the maximum

storage.journal.commitIntervalMs

of 500ms we allow enough time to commit the data to disk. This is simply an extra layer of guarantee and may not be necessary if you have very low journal commit interval.

revin@mongodb:~$ sudo zfs list -t all
NAME                  USED  AVAIL  REFER  MOUNTPOINT
zfs-mongo             596M  9.04G    24K  /zfs-mongo
zfs-mongo/data        592M  9.04G   592M  /zfs-mongo/data
zfs-mongo/data@full   192K      -   592M  -

Now I have a snapshot…

At this point, I have a snapshot I can use for a number of purposes.

  • Replicate a full and delta snapshot to a remote storage or region with tools like zrepl. This allows for an extra layer of redundancy and disaster recovery.
  • Use the snapshots to rebuild, replace or create new secondary nodes or refresh test/development servers regularly.
  • Use the snapshots to do point in time recovery. ZFS snapshots are relatively cost free so it is possible to take snapshots even at five minutes interval! This is actually my favorite use case and feature.

Let’s say we take snapshots every five minutes. If a collection was accidentally dropped or even just a few rows were deleted, we can mount the last snapshot before this event. If the event was discovered in less than five minutes (perhaps that’s unrealistic) we only need to replay less than five minutes of oplog!

Point-in-Time-Recovery

To start a PITR, first clone the snapshot. Cloning the snapshot like below will automatically mount it. We can then start a temporary mongod instance with this mounted directory.

revin@mongodb:~$ sudo zfs clone zfs-mongo/data@full zfs-mongo/data-clone
revin@mongodb:~$ sudo zfs list -t all
NAME                   USED  AVAIL  REFER  MOUNTPOINT
zfs-mongo              606M  9.04G    24K  /zfs-mongo
zfs-mongo/data         600M  9.04G   592M  /zfs-mongo/data
zfs-mongo/data@full   8.46M      -   592M  -
zfs-mongo/data-clone     1K  9.04G   592M  /zfs-mongo/data-clone
revin@mongodb:~$ ./mongodb-linux-x86_64-4.0.8/bin/mongod \
	--dbpath /zfs-mongo/data-clone/m40 \
	--port 28021 --oplogSize 200 --wiredTigerCacheSizeGB 0.25

Once mongod has started, I would like to find out the last oplog event it has completed.

revin@mongodb:~$ mongo --port 28021 local --quiet \
>     --eval 'db.oplog.rs.find({},{ts: 1}).sort({ts: -1}).limit(1)'
{ "ts" : Timestamp(1555356271, 1) }

We can use this timestamp to dump the oplog from the current production and use it to replay on our temporary instance.

revin@mongodb:~$ mkdir ~/mongodump28020
revin@mongodb:~$ cd ~/mongodump28020
revin@mongodb:~/mongodump28020$ mongodump --port 28020 -d local -c oplog.rs \
>     --query '{ts: {$gt: Timestamp(1555356271, 1)}}'
2019-04-16T23:57:50.708+0000	writing local.oplog.rs to
2019-04-16T23:57:52.723+0000	done dumping local.oplog.rs (186444 documents)

Assuming our bad incident occurred 30 seconds from the time this snapshot was taken, we can apply the oplog dump with mongorestore. Be aware, you’d have to identify this from your own oplog.

revin@mongodb:~/mongodump28020$ mv dump/local/oplog.rs.bson dump/oplog.bson
revin@mongodb:~/mongodump28020$ rm -rf dump/local
revin@mongodb:~/mongodump28020$ mongo --port 28021 percona --quiet --eval 'db.session.count()'
79767
revin@mongodb:~/mongodump28020$ mongorestore --port 28021 --dir=dump/ --oplogReplay \
>     --oplogLimit 1555356302 -vvv

Note the

oplogLimit

  above shows a 31 seconds difference from the snapshot’s. Since we want to apply the next 30 seconds from the time the snapshot was taken,

oplogLimit

  takes a value before the specified value.

2019-04-17T00:06:46.410+0000	using --dir flag instead of arguments
2019-04-17T00:06:46.412+0000	checking options
2019-04-17T00:06:46.413+0000		dumping with object check disabled
2019-04-17T00:06:46.414+0000	will listen for SIGTERM, SIGINT, and SIGKILL
2019-04-17T00:06:46.418+0000	connected to node type: standalone
2019-04-17T00:06:46.418+0000	standalone server: setting write concern w to 1
2019-04-17T00:06:46.419+0000	using write concern: w='1', j=false, fsync=false, wtimeout=0
2019-04-17T00:06:46.420+0000	mongorestore target is a directory, not a file
2019-04-17T00:06:46.421+0000	preparing collections to restore from
2019-04-17T00:06:46.421+0000	using dump as dump root directory
2019-04-17T00:06:46.421+0000	found oplog.bson file to replay
2019-04-17T00:06:46.421+0000	enqueued collection '.oplog'
2019-04-17T00:06:46.421+0000	finalizing intent manager with multi-database longest task first prioritizer
2019-04-17T00:06:46.421+0000	restoring up to 4 collections in parallel
...
2019-04-17T00:06:46.421+0000	replaying oplog
2019-04-17T00:06:46.446+0000	timestamp 6680204450717499393 is not below limit of 6680204450717499392; ending oplog restoration
2019-04-17T00:06:46.446+0000	applied 45 ops
2019-04-17T00:06:46.446+0000	done

After applying 45 oplog events, we can see additional documents has been added to the

percona.session

  collection.

revin@mongodb:~/mongodump28020$ mongo --port 28021 percona --quiet --eval 'db.session.count()'
79792

Conclusion

Because snapshots are immediately available and because of its support for deltas, ZFS is quite ideal for large datasets that would otherwise take hours for other backup tools to complete.


Photo by Designecologist from Pexels

Apr
29
2019
--

Mirantis makes configuring on-premises clouds easier

Mirantis, the company you may still remember as one of the biggest players in the early days of OpenStack, launched an interesting new hosted SaaS service today that makes it easier for enterprises to build and deploy their on-premises clouds. The new Mirantis Model Designer, which is available for free, lets operators easily customize their clouds — starting with OpenStack clouds next month and Kubernetes clusters in the coming months — and build the configurations to deploy them.

Typically, doing so involves writing lots of YAML files by hand, something that’s error-prone and few developers love. Yet that’s exactly what’s at the core of the infrastructure-as-code model. Model Designer, on the other hand, takes what Mirantis learned from its highly popular Fuel installer for OpenStack and takes it a step further. The Model Designer, which Mirantis co-founder and CMO Boris Renski demoed for me ahead of today’s announcement, presents users with a GUI interface that walks them through the configuration steps. What’s smart here is that every step has a difficulty level (modeled after Doom’s levels, ranging from “I’m too young to die” to “ultraviolence” — though it’s missing Dooms “nightmare” setting), which you can choose based on how much you want to customize the setting.

Model Designer is an opinionated tool, but it does give users quite a bit of freedom, too. Once the configuration step is done, Mirantis actually takes the settings and runs them through its Jenkins automation server to validate the configuration. As Renski pointed out, that step can’t take into account all of the idiosyncrasies of every platform, but it can ensure that the files are correct. After this, the tool provides the user with the configuration files, and actually deploying the OpenStack cloud is then simply a matter of taking the files, together with the core binaries that Mirantis makes available for download, to the on-premises cloud and executing a command-line script. Ideally, that’s all there is to the process. At this point, Mirantis’ DriveTrain tools take over and provision the cloud. For upgrades, users simply have to repeat the process.

Mirantis’ monetization strategy is to offer support, which ranges from basic support to fully managing a customer’s cloud. Model Designer is yet another way for the company to make more users aware of itself and then offer them support as they start using more of the company’s tools.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com