Jul
31
2023
--

Talking Drupal #409 – Data Lakes

Today we are talking about Data Lakes with Melissa Bent & April Sides.

For show notes visit: www.talkingDrupal.com/409

Topics

  • What is a data lake
  • Does it have to be NoSQL
  • How do organizations use data lake
  • How does RedHat use the data lake
  • How do you maintain it
  • How do you make changes to the data lake
  • Who manages Mongo
  • How big does it have to be to be considered a data lake
  • Why not Solr
  • What Drupal modules
  • Communication of changes
  • Gotchas?

Resources

Guests

Melissa Bent – linkedin.com/in/melissabent merauluka April Sides – weekbeforenext

Hosts

Nic Laflin – www.nLighteneddevelopment.com nicxvan John Picozzi – www.epam.com johnpicozzi Tim Plunkett – timplunkett

MOTW Correspondent

Martin Anderson-Clutz – @mandclu Tagify

  • Brief description:
    • Provides an entity reference widget that’s more user friendly: visually styles as tags (without showing the reference ID), drag to sort, and more
  • Brief history
    • How old: created in Mar 2023
  • Versions available:
    • 1.0.19, which works wth Drupal >8.8, 9, and 10
  • Maintainership
    • Actively maintained, latest release in the past week
  • Number of open issues:
    • 4, one of which is a bug
  • Usage stats:
    • 177 sites
  • Maintainer(s):
    • gxleano (David Galeano), who I got to meet in person at Drupal Dev Days
  • Module features and usage
    • Tagify is a popular JS library, so this module is a Drupal integration for that
    • Features in the module include deactivating labels when the field’s max number of entries has been reached, allowing the creation of new tags when the field has been configured to allow that, and so on
    • Will automatically disallow duplicate tags
    • Includes a User List submodule specifically for user reference fields, which also shows the user’s profile pic in the tag
    • Project page has animated GIFs that demonstrate how many of these features work
    • A module I started using on my own blog, nice and simple UX. I could see the drag to sort be really useful, for example if you wanted the first term reference to be used in a pathauto pattern
Jul
31
2023
--

Dynamic SQL Workaround in MySQL: Prepared Statements

Dynamic SQL

Dynamic SQL is a desirable feature that allows developers to construct and execute SQL statements dynamically at runtime. While MySQL lacks built-in support for dynamic SQL, this article presents a workaround using prepared statements. We will explore leveraging prepared statements to achieve dynamic query execution, parameterized queries, and dynamic table and column selection.

Understanding prepared statements

Prepared statements refer to the ability to construct SQL statements dynamically at runtime rather than writing them statically in the code. This provides flexibility in manipulating query components, such as table names, column names, conditions, and sorting. The EXECUTE and PREPARE statements are key components for executing dynamic SQL in MySQL.

Example usage: Let’s consider a simple example where we want to construct a dynamic SELECT statement based on a user-defined table name and value:

SET @table_name := 't1';
SET @value := '123';
SET @sql_query := CONCAT('SELECT * FROM ', @table_name, ' WHERE column = ?');

PREPARE dynamic_statement FROM @sql_query;
EXECUTE dynamic_statement USING @value;
DEALLOCATE PREPARE dynamic_statement;

In this example, we use the CONCAT function to construct the dynamic SQL statement. The table name and value are stored in variables and concatenated into the SQL string.

Benefits and features

  • Prepared statements can be used both as a standalone SQL statement and inside stored procedures, providing flexibility in different contexts.
  • Support for Various SQL Statements: SQL statements can be executed using prepared statements, including statements like DROP DATABASE, TRUNCATE TABLE, FLUSH TABLES, and KILL. This allows for dynamic execution of diverse operations.
  • Usage of Stored Procedure Variables: Stored procedure variables can be incorporated into the dynamic expression, enabling dynamic SQL based on runtime values.

Let’s look at another scenario:

Killing queries for a specific user:

CREATE PROCEDURE kill_all_for_user(user_connection_id INT)
BEGIN
  SET @sql_statement := CONCAT('KILL ', user_connection_id);
  PREPARE dynamic_statement FROM @sql_statement;
  EXECUTE dynamic_statement;
END;

In this case, the prepared statement is used to dynamically construct the KILL statement to terminate all queries associated with a specific user.

Conclusion

You might use prepared statements to make dynamic queries, but dynamic queries can definitely make debugging more challenging. You should consider implementing some additional testing and error handling to help mitigate this issue. That could help you catch any issues with the dynamic queries early on in the development process.

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

 

Download Percona Monitoring and Management Today

Jul
31
2023
--

Should You Keep Your Business Logic In Your Database?

business logic

Open source database architects usually do not implement business logic in their databases. This is in stark contrast to many commercial databases where this is a common practice. In the first case, all the heuristics are kept at the application layer, and the database has little or no effect on the data quality. The second case has the rules required by management are handled by the database itself, not relying on the software developers to be fully cognizant of rules.

Now part of this stance in the open source world is that there was a lack of tooling for business logic in the past in the MySQL universe. Constraint checks that actually do what they are supposed to do is still a very recent addition to MySQL and found only in 8.0.16 or later, which is another reason to upgrade. Previous MySQL versions checked the syntax but did not perform the required work for a constraint check. But now that the tooling exists, it may be time to reexamine the traditional stance. PostgreSQL has long featured the ability to do what you will see in the following examples, and MongoDB also has some capabilities in this area. So the capabilities are there if you are inclined to move your business logic to your database layer.

Keeping bad data out of a database is much less expensive in time and money than correcting it after it is snuggly placed in your tables. You may already use some of the following tools to filter your data. But you can have more rigor in assuring your data is pristine before it becomes a row or a document.

Some examples? There are many ways to protect your data, but we will start with ENUMs and VIEWs, as they have been around MySQL for a very long time.

ENUMs

An ENUM is a string object with a value chosen from a list of permitted values that are enumerated explicitly in the column specification at table creation time. – MySQL Manual

The concept of the ENUM is great. You set up predefined values that you want to include in your tables, kick out data that does not match those predefined values, and the data is compliant with what you need. MongoDB, MySQL, and PostgreSQL all accommodate ENUMs. In the following example, an ENUM is created that will take only three values and reject other values, supposedly.

SQL > CREATE TABLE enum_example (id int, is_good ENUM ('yes','no','maybe'));
Query OK, 0 rows affected (0.0188 sec)
SQL > INSERT INTO enum_example (id, is_good) values (1,'yes');
Query OK, 1 row affected (0.0066 sec)
SQL > INSERT INTO enum_example (id, is_good) values (2,'no');
Query OK, 1 row affected (0.0034 sec)
SQL > INSERT INTO enum_example (id, is_good) values (3,'unknown');
ERROR: 1265: Data truncated for column 'is_good' at row 1
SQL > show warnings;
+-------+------+----------------------------------------------+
| Level | Code | Message |
+-------+------+----------------------------------------------+
| Error | 1265 | Data truncated for column 'is_good' at row 1 |
+-------+------+----------------------------------------------+
1 row in set (0.0009 sec)

In the above example, all is good when we input the predefined values, and, as expected, the row with non-compliant data is rejected outright. But what happens when we try everybody’s favorite value of NULL?

SQL > INSERT INTO enum_example (id, is_good) values (3,NULL);
Query OK, 1 row affected (0.0040 sec)
SQL > SELECT * FROM enum_example;
+----+---------+
| id | is_good |
+----+---------+
| 1 | yes |
| 2 | no |
| 3 | NULL |
+----+---------+
3 rows in set (0.0027 sec)

Dang! NULL is not one of our predefined good values, and yet there it sits, fat and happy, in our database table. This is different from what we want either logically or as data. This NULL problem can be avoided by setting the SQL Mode to STRICT.

Another reason ENUMs are not extremely popular is that they are tricky to sort because ENUMS are sorted based on their index number. In the above example, ‘yes’ would sort first, ‘no’ second, and ‘unknown’ would be third. It can be tricky if that order is not good for you or if you need to sort the values alphabetically.

SQL > SELECT id, is_good FROM enum_example ORDER BY is_good;
+----+---------+
| id | is_good |
+----+---------+
| 3 | NULL |
| 1 | yes |
| 2 | no |
+----+---------+
3 rows in set (0.0012 sec)

And NULL again does what NULLs do and gets sorted first. The concept of ENUMs is pretty spiffy, but the implementation can cause headaches. This does not mean that you should not use them, but you need to be aware of their shortcomings and make accommodations for them.

VIEWS

You may not think of views as part of the realm of business logic, but they are valuable for obfuscating data from prying eyes. Views can be treated like tables, and you can define them so that the permissions of the person using the view are much less than the creator of the view. The original table has restricted access for approved users only, and the view can be established so that the caller can have indirect access to some of that table.

In the following example, we have a table with a secret column. The various permissions to restrict access are omitted for clarity. This table is created so that only selected accounts can see that secret column.

SQL > CREATE TABLE secret_stuff (id int, secret int);
Query OK, 0 rows affected (0.0412 sec)
SQL > INSERT INTO secret_stuff (id, secret) VALUES (1,1),(2,2);
Query OK, 2 rows affected (0.0062 sec)

Records: 2 Duplicates: 0 Warnings: 0
SQL > SELECT * FROM secret_stuff;
+----+--------+
| id | secret |
+----+--------+
| 1 | 1 |
| 2 | 2 |
+----+--------+
2 rows in set (0.0011 sec)

A view is then created. The user of this view has no direct access to the secret column but can get to other less protected data in the table.

SQL > CREATE VIEW see_secret AS SELECT id, 'XYZ' FROM secret_stuff;
Query OK, 0 rows affected (0.0084 sec)SQL > select * from see_secret;
+----+-----+
| id | XYZ |
+----+-----+
| 1 | XYZ |
| 2 | XYZ |
+----+-----+
2 rows in set (0.0045 sec)

The data in the table is protected from seeing the secret columns directly as they can not see the name of, let alone access, that column.

VIEWS with Data Masking in Percona Server for MySQL

A cool feature of Percona Server for MySQL is the ability to mask data. This way, part of the secret column is kept away from the user of the view, such as the last four numbers of a phone number or some identification number.

More details can be found in Data Masking With Percona Server for MySQL – An Enterprise Feature at a Community Price.

Check constraints

MySQL 8.0.16 finally delivered working check constraints. In earlier versions, they were ignored. That caused much gnashing of teeth by MySQL DBAs, and many other fans of other databases would justifiably point out this problem while making ‘toy database jokes. But for the past several years, check constraints have been available if under-utilized.

There are two types of constraints- one for columns ad one for tables. In the next example, a constraint is established on column b to ensure it has a value greater than one. Note that the constraint is named b_gt_1 and that you can use that constraint name once per schema. It is highly recommended that you name the constraints to make it easier to identify when troubles arise.

SQL > CREATE TABLE cc (a INT, 
                       b INT CONSTRAINT b_gt_1 CHECK (b > 1));
Query OK, 0 rows affected (0.0189 sec)
SQL > INSERT INTO cc (a,b) VALUES (1,2);
Query OK, 1 row affected (0.0064 sec)
SQL > INSERT INTO cc (a,b) VALUES (2,0);
ERROR: 3819: Check constraint 'b_gt_1' is violated.

In the next example, the last constraint defined is on the table, and two columns are compared.

SQL > CREATE TABLE cc2 (a INT, 
                        b INT CONSTRAINT b_gt_2 CHECK (b > 2), 
                        check (a > b));
Query OK, 0 rows affected (0.0216 sec)
SQL > INSERT INTO cc2 VALUES (1,2);
ERROR: 3819: Check constraint 'b_gt_2' is violated.
SQL > INSERT INTO cc2 VALUES (1,3);
ERROR: 3819: Check constraint 'cc2_chk_1' is violated.

Notice in the above that when the table constraint is violated, the server has assigned the name ‘cc2_chk_1’. When the constraint fails, and the error message pops up, you are forced to look for the unnamed constraint and, in this case, find the first one. So multiple unnamed constraints can be a pain to find.

Column and table constraints can be combined and as complicated as you are willing to make them.

Triggers

The use of triggers can fix a lot of problems. They can be executed before, after, or upon deletion of a row in a table, allowing that data to be saved in another table. Why is that important? It allows you to create events around changes in the data. Maybe when a customer deletes their account, you want to add their email address to the ‘do not bulk email’ table, record the timestamp of their leaving in another table, and check on any orders being processed they may have.

The following example is an audit trail of when customer representatives change on a customer trouble ticket account. For some reason, management wants to see how often the current customer representative changes. For reasons of clarity, timestamp columns are omitted in the tables below.

The first table is for the trouble tickets.

SQL > CREATE TABLE ticket (id int, customer int, rep_current char(10);
Query OK, 0 rows affected (0.0205 sec)

The next table is the log for changes in those trouble tickets.

SQL > create table ticket_log (id int, customer int, representative char(10));
Query OK, 0 rows affected (0.0156 sec)

Now a trigger is set up so that changes in the ticket table get stored in the ticket_log table.

SQL> DELIMITER $$
SQL > CREATE TRIGGER ticket_change AFTER UPDATE ON ticket FOR EACH ROW
-> BEGIN
-> INSERT INTO ticket_log VALUES (OLD.id, OLD.customer, NEW.rep_current);
-> END $$
Query OK, 0 rows affected (0.0088 sec)
SQL> DELIMITER ;

SQL > UPDATE ticket SET rep_current='Mary' WHERE id=10;
Query OK, 1 row affected (0.0034 sec)

Rows matched: 1 Changed: 1 Warnings: 0
SQL > select * from ticket_log;
+----+----------+----------------+
| id | customer | representative |
+----+----------+----------------+
| 10 | 234 | Mary |
+----+----------+----------------+
1 row in set (0.0009 sec)

Depending on your database, you can have multiple triggers on your table, specify the order of their execution, and define the account that uses the trigger when you need extra data security. Examples of logical checks that could be implemented at this level are customer credit limits, minimum order policies, and stock-on-hand reorder quantity checks that let a business keep humming along.

Sadly for MongoDB users, triggers are only available on the Atlas platform.

Stored procedures

PostgreSQL has fantastic support for stored procedures, while MongoDB has a ‘sort of, kind of’ equivalent, and MySQL has minimal support. A stored procedure can be more secure than ad-hoc queries, can sometimes reduce network traffic, and can keep the code centralized on the database server.
Do We Want The Smarts In The Code Or The Database?

Conclusion

As previously mentioned, most open source database practitioners keep the business logic in their code. That is a practical and reasonable approach as long as all those doing the coding understand the rules and properly implement those rules. But as staff grows, the dissemination of that knowledge may not be passed on properly. Costly mistakes can be made, laws unintentionally violated, and rows are converted to useless gibberish when this happens.

Putting the business logic in the database makes sense in many places. If you sell products only to adults, you will want to check the customer’s age. Promotional details like having the minimum order value for free shipping is $50 or ordering ten or more and getting a 15% discount are easy to implement. Do you have enough stock on hand to ship immediately, or do you need to provide an estimated delivery date from a supplier’s warehouse? That logic is not spread out over several applications but is kept at the database level as a firewall, an arbitrator, and a sole reference.

In the modern world where the number of databases is mushrooming, and the only thing growing faster is the number of projects using those databases, the ability to enforce business rules and logic by relying on an ever-expanding codebase is not practical. Ask yourself if application code change created by an increase in postal shipping rates for your operation is best served by searching a voluminous codebase, updating programs, and deploying new code with a high probability of missing some applications or making one change at the database level.

Now is the time to start looking at your systems to see where you can use some of the above techniques to start moving some of your business logic to the database. There will be cases where having your database do the work is the obvious solution and the sole implementation of the business logic.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

 

Try Percona Distribution for MySQL today!

Jul
27
2023
--

LTS and Innovation Releases for Percona Server for MySQL

LTS and Innovation releases for Percona MySQL

On July 18th, Oracle released its first pair of MySQL LTS and Innovation releases. (For more, check out A Quick Peek at MySQL 8.0.34 and MySQL 8.1.0.) These releases were announced several months ago and gradually detailed closer to the release date. Today, we know from Oracle’s official blog post what to expect, as well as what the cadence and scope of the upcoming releases will be.

So, the next immediate question that comes to mind is: Is Percona going to follow the same release pattern as “upstream“ MySQL?

The short answer is “yes.”

We are proud to say that over the last several years, Percona delivered on its promise of providing the MySQL community a drop-in replacement that offers enterprise features without vendor lock-in.

The key word is “drop-in,” which means that over the years, we have made an effort to keep Percona Server for MySQL fully compatible with its upstream counterpart from Oracle. In simple terms, you can replace the binaries of MySQL Community Edition with Percona Server for MySQL, and, provided that you use the same version (major and minor), Percona software will smoothly pick up where the Community Edition left off.

Our goal is to continue on that path. That means you can expect to see the equivalent releases of MySQL following the patterns of LTS and Innovation releases and following the same support lifecycle. You can expect these releases to be available within several weeks of the upstream releases.

That leads to another question: Will other components of Percona Distribution for MySQL follow the same release patterns?

Percona Distribution for MySQL is a compilation of several open source products, the most notable being Percona Server for MySQL, Percona XtraBackup, and Percona XtraDB Cluster. We are determined to make sure every element of Percona Distribution for MySQL is fully compatible with both LTS and Innovation releases.

At the moment, we are unclear whether it will be necessary to issue separate new versions of the packages for each Innovation release. This will greatly depend on the level of backward- compatibility-breaking changes introduced in each release. But at the very least, we will make it clear which LTS and Innovation releases Percona Distribution for MySQL is compatible with.

 

Learn more about Percona Server for MySQL

Jul
27
2023
--

A Quick Peek at MySQL 8.0.34 and MySQL 8.1.0

MySQL 8.0.34 and MySQL 8.1.0

This is a Quick Peek at what Oracle delivered with MySQL 8.0.34 and 8.1.0 on July 17th, 2023. MySQL previously released software quarterly as part of a continuous delivery system where new features were made available. Many people loved access to the new features even if they could be risky, but others wanted stability and bug fixes only. Now there is a long-term support (LTS) version for the more conservative and the risk-averse folk that will have a roughly two-year lifespan between major releases.

The release with the new features will be called the Innovation series, and MySQL 8.1.0 is the first of these releases. And the LTS will be based on 8.0.34. Please note that comments in italics are mine and do not reflect the views of Percona and possibly not anyone else.

Quick reminder

The End of Life for MySQL 8.0 is in April of 2026, and the End of Life for MySQL 5.7 is in October of 2023. Tempus Fugit!

So what is new? 8.0.34 is an evolution of the MySQL server and related software. 8.1.0 is an evolution of 8.0.34, so there is a lot of overlap in the release notes. 8.0.34 release notes and 8.1 release notes.

What is new in ’34?

You can now set the percentage of characters in a new password that has to differ from the previous password with the validate_password.changed_character_percentage variable. I need to experiment with this as it could be a frustration generator at a large scale under proper circumstances.

Deprecations? The mysqlpump client utility is deprecated and will be removed. The sync_relay_log_info, binlog_format, and group_replication_recovery_complete_at system variables are deprecated and will be removed. And lz4_decompress and zlib_decompress are deprecated, too, as part of deprecating mysqlpump. Mysqlpump has a small number of devotees but never seemed to be taken to heart by the larger community.

CURRENT_USER(), USER(), SESSION_USER(), and SYSTEM_USER() can now be used as default values when defining a table. This will be handy for auditing data.

8.0.34 has 131 bug fixes and 8.1.0 lists 199 big fixes. Most of these are really minor such as updating the version of libraries.

So what is new in 8.1.0?

8.1.0 is based on 8.0.34, so along with the above changes, you get some additional material. The mysql_native_password authentication plugin is deprecated and subject to removal in a future version of MySQL. While this is a great idea for security reasons, there will be many more posts on websites like Stackoverflow bemoaning that their old code and connector can not connect to the new version of the database.

8.1 has EXPLAIN FORMAT=JSON that can output data into a user variable. This could be handy for tracking query tuning and performance auditing.

SQL > EXPLAIN FORMAT=JSON INTO @myvar SELECT * FROM City WHERE District='Texas';
Query OK, 0 rows affected (0.0043 sec)
SQL > SELECT @myvarG
*************************** 1. row ***************************
@myvar: {
 "query_block": {
  "select_id": 1,
  "cost_info": {
   "query_cost": "428.50"
  },
  "table": {
    "table_name": "City",
    "access_type": "ALL",
    "rows_examined_per_scan": 4035,
    "rows_produced_per_join": 403,
    "filtered": "10.00",
    "cost_info": {
      "read_cost": "388.15",
      "eval_cost": "40.35",
      "prefix_cost": "428.50",
     "data_read_per_join": "97K"
    },
   "used_columns": [
      "ID",
      "Name",
      "CountryCode",
      "District",
      "Population"
    ],
    "attached_condition": "(`world`.`city`.`District` = 'Texas')"
   }
 }
}
1 row in set (0.0008 sec)

Query comments are no longer stripped.

The default for SOURE_RETRY_COUNT is now 10 for CHANGE_SOURCE_REPLICATION_TO . This means by default, with SOURCE_CONNECT_RETRY at the default of 60, that the replica tries for 10 minutes to reconnect at 60-second intervals before timing out and failing over.

Before the election of a new primary, the GROUP_REPLICATION_SET_AS_PRIMRY group_replication_set_as_primary() function now waits for ongoing DDL statements such as ALTER TABLE when waiting for all transactions to complete. The statements include ALTER TABLE, ANALYZE TABLE, CACHE INDEX, CHECK TABLE, CREATE INDEX, CREATE TABLE, DROP INDEX, LOAD INDEX, OPTIMIZE TABLE, REPAIR TABLE, TRUNCATE TABLE, DROP TABLE, and any open cursors. Seems like a lot of work to do while servers are failing, but it should save the data in a better state.

Do you have suspicious group replication servers showing up? A new MEMBER_FAILURE_SUSPICIONS_COUNT column has been added to the Performance Schema replication_group_communication_information table. The contents of this column are formatted as a JSON array whose keys are group members ID and whose values are the number of times the group member has been considered suspect. Suspicious servers?

Several new status variables have been added to group replication to improve the diagnosis and troubleshooting of network instabilities, providing statistics about network usage, control messages, and data messages for each group member. This will be very handy.

8.1.0 introduces a number of new messages that are written to the MySQL error log during shutdown To aid in troubleshooting in the event of an excessively long server shutdown, including those listed here:

  • Startup and shutdown log messages for the MySQL server, including when it has been started with –initialize.
  • Log messages showing the start and end of shutdown phases for plugins and for components.
  • Start-of-phase and end-of-phase log messages for connection closing phases.
  • Log messages showing the number and IDs of threads still alive after being forcibly disconnected and potentially causing a wait.

Support is dropped for Enterprise Linux 6 (and associated glibc 2.12 generic), SUSE 12, Debian 10, MacOS 12, Ubuntu 18.04 and 20.04, Windows 10 and Server 2012R2; and 32-bit versions are no longer built.

You can see the parse tree for selects if you build the server with debug mode enabled. This is not for production.

Server startup options can not be NULL. There are exceptions listed in the manual for some options that do not need to be specified.

And using an unquoted identifier beginning with a dollar sign and containing one or more dollar signs (in addition to the first one) now generates a syntax error.

Conclusion

I like the idea of the long-term support edition, as too many have been caught out on some of the tweaks in the quarterly releases. This should add stability to production environments and make life simpler for many.

The announcement of 8.1 was long anticipated, and new features are always interesting and, hopefully, helpful. Seeing 8.0 becoming a bug-fix only for the next few years until the EOL date seems a little bittersweet.

For those wondering when Percona’s 8.0.34 and 8.1.0 releases will become available, please have a little patience. Our engineers need to test, optimize, add the enterprise features, and then get it all ready for your use.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

 

Try Percona Distribution for MySQL today!

Jul
26
2023
--

Ten Recommendations for Running Your MongoDB Environment

Running Your MongoDB Environment

MongoDB is a non-relational document database that provides support for JSON-like storage. It provides a flexible data model allowing you to easily store unstructured data. First released in 2009, it is the most used NoSQL database and has been downloaded more than 325 million times.

MongoDB is popular with developers as it is easy to get started with. Over the years, MongoDB has had many features introduced that have turned the database into a robust solution able to store terabytes of data for applications.

As with any database, developers and DBAs working with MongoDB should look at how to optimize the performance of their database, especially nowadays with cloud services, where each byte processed, transmitted, and stored costs money. The ability to get started so quickly with MongoDB means that it is easy to overlook potential problems or miss out on simple performance improvements. In this piece, we’ll look at ten essential techniques you can apply to make the most of MongoDB for your applications.

Tip #1:

Always enable authorization for your production environments

Securing your database from the beginning prevents unauthorized access and security breaches. This is especially important in today’s times of escalated hacking attempts.

The bigger the database, the bigger the damage from a leak. There have been numerous data leaks due to the simple fact that authorization and authentication are disabled by default when deploying MongoDB for the first time. While it is not a performance tip, it is essential to enable authorization and authentication right from the start, as it will save you any potential pain over time due to unauthorized access or data leakage.

When you deploy a new instance of MongoDB, the instance has no user, password, or access control by default. In recent MongoDB versions, the default IP binding changed to 127.0.0.1, and a localhost exception was added, which reduced the potential for database exposure when installing it. 

However, this is still not ideal from a security perspective. The first piece of advice is to create the admin user and restart the instance again with the authorization option enabled. This prevents any unauthorized access to the instance. 

To create the admin user:

> use admin
switched to db admin
> db.createUser({
...   user: "zelmar",
...   pwd: "password",
...   roles : [ "root" ]
... })
Successfully added user: { "user" : "zelmar", "roles" : [ "root" ] }
Then, you need to enable authorization and restart the instance, if you are deploying MongoDB from the command line:
mongod --port 27017 --dbpath /data/db --auth
Or if you are deploying MongoDB using a config file, you need to include:
security:
    authorization: "enabled"

 

Tip #2:

Always upgrade to the latest patch set for any given major version

Sometimes companies have to stay in versions that have gone EOL (End-of-Life) due to legacy application or driver dependency issues. It is recommended to upgrade your stack as soon as possible to take advantage of improvements and any security-related software updates.

It is critically important not to use versions that are labeled with “critical issues” or “not recommended for production use” warnings. Those releases have been found to have to include bugs that can cause serious performance impacts or cause issues with data integrity – potentially causing data corruption or data loss.

It should seem obvious, but one of the most common issues we see with production instances is due to developers running a MongoDB version that is actually not suitable for production in the first place. This can be due to the version being outdated, such as with a retired version that should be updated to a newer iteration that contains all the necessary bug fixes. Alternatively, we can choose a version that is too early and not yet tested enough for production use.

This can be due to a few different reasons. As developers, we are normally keen to use our tools’ latest and greatest versions. We also want to be consistent over all the stages of development, from initial build and test through to production, as this decreases the number of variables we have to support, the potential for issues, and the cost to manage all those instances.

For some, this can mean using versions that are not signed off for production deployment yet. For others, it can mean sticking with a specific version that is tried and trusted. This is a problem from a troubleshooting perspective when an issue is fixed in a later version of MongoDB that is approved for production but has not been deployed yet. Alternatively, you may forget about that database instance that is ‘just working’ in the background and miss when you need to implement a patch!

In response to this, you should regularly check if your version is suitable for production using the release notes of each version. For example, MongoDB 5.0 provides the following guidance in its release notes: https://www.mongodb.com/docs/upcoming/release-notes/5.0/ 

The guidance here would be to use MongoDB 5.0.11, as this has the required updates in place. If you don’t update to this version, you will run the risk of losing data.

While it might be tempting to stick with one version, keeping up with upgrades is essential to prevent problems in production. You may want to take advantage of newly added features, but you should put these through your test process to see if there are any problems that might affect your overall performance before moving into production too.

Lastly, you must check the MongoDB Software Lifecycle Schedules and anticipate the upgrades of your clusters before the end of life of each version: https://www.mongodb.com/support-policy/lifecycles

End of life versions do not receive patches, bug fixes, or any kind of improvements. This can leave your database instances exposed and vulnerable.

From a performance perspective, getting the right version of MongoDB for your production applications involves being ‘just right’ – not too near the bleeding edge that you encounter bugs or other problems, but also not too far behind that you miss out on vital updates.

Tip #3:

Always use Replica Sets with at least three full data-bearing nodes for your production environments

A replica set is a group of MongoDB processes that maintains the same data on all the nodes used for an application. It provides redundancy and data availability for your data. When you have multiple copies of your data on different database servers —  or even better, in different data centers worldwide — replication provides a high level of fault tolerance in case of a disaster. 

MongoDB replica sets consist of PRIMARY and SECONDARY nodes. The PRIMARY node receives writes from the application server and then automatically replicates or copies the data to all SECONDARY nodes via the built-in replication process, which apply changes via the oplog. Replication provides redundancy, and having multiple copies of the same data increases data availability.

It is best practice to have an odd number of replica set members to maintain voting quorum. The minimum number of replica set members for a replica set is three nodes. A replica set can currently have up to 50 nodes. Only seven of those members can have voting rights. Those votes determine which member becomes PRIMARY in case of problems, failovers, or elections.

The basic replica set would consist of the following:

  • PRIMARY 
  • SECONDARY 
  • SECONDARY

All the nodes of the replica set will work together, as the PRIMARY node will receive the writes from the app server, and then the data will be copied to the secondaries. If something happens to the primary node, the replica set will elect a secondary as the new primary. To make this process work more efficiently and ensure a smooth failover, it is important to have the same hardware configuration on all the nodes of the replica set. Additionally, secondary nodes can be used to improve or direct reads by utilizing options such as secondaryPreferred or nearest to support hedged reads. 

After you deploy a replica set on production, it is important to check the health of the replica and the nodes. MongoDB has two important commands to check this in practice.

rs.status() provides information on the current status of the replica set using data derived from the heartbeat packets sent by the other members of the replica set. It’s very useful as a tool to check the status of all the nodes in a replica set.

rs.printSecondaryReplicationInfo() provides a formatted report of the replica set status. It’s very useful to check if any of the secondaries are behind the PRIMARY on data replication, as this would affect your ability to recover all your data in the event of something going wrong. If secondaries are too far behind the primary, then you can end up losing a lot more data than you are comfortable with.

However, these commands provide point-in-time information rather than continuous monitoring for the health of your replica set. In a real production environment, or if you have many clusters to check, running these commands can become time-consuming and annoying, so we recommend using a monitor system for your clusters, like Percona Monitoring and Management, to monitor them.

Tip #4:

Avoid the use of Query types or Operators that can be expensive

Sometimes the simplest way to search for something in a database is to use a regular expression or $regex operation. A lot of developers choose this option but it can actually harm your search operations at scale. Instead, you must avoid the use of $regex queries, especially when your database is big.

A $regex query consumes a lot of CPU time and it will normally be extremely slow and inefficient. Creating an index doesn’t help so much and sometimes the performance is worse than without indexes. 

For example, we can use a $regex query on a collection of 10 million documents and use .explain(true) to view how many milliseconds the query takes.

Without an index:

>db.people.find({"name":{$regex: "Zelmar"}}).explain(true)
- -   Output omitted  - -
"executionStats" : {
                "nReturned" : 19851,
                "executionTimeMillis" : 4171,
                "totalKeysExamined" : 0,
                "totalDocsExamined" : 10000000,
- -   Output omitted  - -
And if we created an index on "name":
db.people.find({"name":{$regex: "Zelmar"}}).explain(true)
- -   Output omitted  - - 
  "executionStats" : {
                "nReturned" : 19851,
                "executionTimeMillis" : 4283,
                "totalKeysExamined" : 10000000,
                "totalDocsExamined" : 19851,
- -   Output omitted  - -

We can see in this example that the index creation didn’t help to improve the $regex performance.

It’s common to see a new application using $regex operations for search requests. This is because neither the developers nor the DBAs notice any issues in performance to begin with because the size of the collections is small, and the users at the beginning of any application are very few.

However, when the collections become bigger, and more users start to use the application, the $regex operations start to slow down the cluster and become a nightmare for the team. Over time, as your application scales and more users want to carry out search requests, the level of performance can drop significantly.

Rather than using $regex queries, you can use text indexes to support your text search.

This is more efficient than $regex but needs you to prepare in advance by adding text indexes to your data sets. They can include any field whose value is a string or an array of string elements. A collection can only have one text search index, but that index can cover multiple fields.

Using the same collection of the example, we can test how many milliseconds take the same query using text search:

> db.people.find({$text:{$search: "Zelmar"}}).explain(true)
- -   Output omitted  - -
"executionStages" : {
                         "nReturned" : 19851,
                        "executionTimeMillisEstimate" : 445,
                        "works" : 19852,
                        "advanced" : 19851,
- -   Output omitted  - -

In practice, the same query with text search took four seconds less than the $regex query. Four seconds in “database time,” let alone online application time, is a lot.

To conclude, if you can solve the query using Text Search, use it instead of using $regex and reserve $regex for use cases where they are really necessary and always limit the use of other query patterns and operators such as findAndModify and $NIN.  

Tip #5:

Think wisely about your index strategy

Putting some thought into your queries at the start can have a massive impact on performance over time. First, you need to understand your application and the kinds of queries that you expect to process as part of your service. Based on this, you can create an index that supports them. 

Indexing can help to speed up read queries, but it comes with an extra cost of storage, and they will slow down write operations. Consequently, you will need to think about which fields should be indexed so you can avoid creating too many indexes.

For example, if you are creating a compound index, the Equality, Sort, Range rule is a must to follow, and using an index to sort the results improves the speed of the query.

Similarly, you can always check if your queries are really using the indexes that you have created with .explain(). Sometimes we see a collection with indexes created, but the queries either don’t use the indexes or instead use the wrong index entirely. It’s important to create only the indexes that will actually be used for the read queries. Having indexes that will never be used is a waste of storage and will slow down write operations.

When you look at the .explain() output, three main fields are important to observe. For example:

keysExamined:0 
docsExamined:207254 
nreturned:0

In this example, no indexes are being used. That is because the number of keys examined is 0 while the number of documents examined is 207254. Ideally, the query needs to have the ratio nreturned/keysExamined=1. For example:

keysExamined:5 
docsExamined: 0 
nreturned:5

As a final piece of advice, if with .explain() you see a particular query using a wrong index, you can force the query to use a particular index with .hint() This overrides MongoDB’s default index selection and query optimization process, allowing you to specify the index that is used or to carry out a forward collection or reverse collection scan. 

Percona Distribution for MongoDB: Performant, enterprise-grade MongoDB software free of licensing fees and lock-in.

Tip #6:

Watch for changes in query patterns, application changes, or index usage over time

Every database is unique and particular to that application, and they grow and change through time. Nobody knows how an application will grow over time or how the queries will change over time too. Whatever assumptions you make, your prediction will inevitably be wrong, so it is essential to check your database and indexes over time.

For instance, you may plan a specific query optimization approach and a particular index, but after one year, you realize a few queries are using that index, and it’s not necessary anymore. Carrying on with this approach will cost you more in storage while not providing any improvements in application performance.

For this reason, it’s necessary to perform query optimizations and look at the indexes for each collection frequently.

MongoDB has some tools to do query optimization, such as the database profiler or the .explain() method; we recommend using them to find which queries are slow, how the indexes are being used by the queries, and where you may need to improve your optimizations.

Alongside removing indexes that are not used efficiently, look out for duplicate indexes that you don’t need to run as well.

We use some scripts to check if there are duplicate indexes or if there are any indexes that are not being used. You can check and use them from our repository: 

https://github.com/percona/support-snippets/tree/master/mongodb/scripts

Similarly, you can look at how many results you want to get from a query, as providing too many results can have a performance and time impact. You can limit the number of query results with .limit(), as sometimes you only need the first five results of a query rather than tens or hundreds of responses.

Another useful approach is to use projections to get only the necessary data. If you need only one field of the document, use projection instead of getting the entire document and then filter on the app side.

Lastly, if you need to order the results of a query, be sure that you are using an index and take advantage of it to improve your efficiency.

Tip #7:

Don’t run multiple mongoD on the same server

Even if it’s possible to run multiple MongoD on the same server using different processes and ports, we strongly recommend not doing this.

When you run multiple MongoD processes on the same server, it’s very difficult to monitor them and the resources they are consuming (CPU, RAM, Network, etc.). Consequently, if there is any problem, it is extremely difficult to find out what is going on and get to the root cause of that issue.

We have a lot of cases where customers have found a resource problem on the server, and because they are running multiple instances of MongoD it’s impossible to troubleshoot the problem because discovering which specific process has the problem is more difficult.

Similarly, sometimes we can see that developers have implemented a sharded cluster to scale up their application data, but then multiple shards are running on the same server. In these circumstances, the router will send a lot of queries to the same node. This may overload the node leading to poor performance, which is the opposite of what the sharding strategy wants to solve.

The worst-case scenario for these kinds of deployments involves replica sets. Imagine running a replica set for resiliency and availability, and then more than one member of the replica set is running on the same server. This is a recipe for potential disaster and data loss if the physical server that supports those nodes has a problem. Rather than architecting your application for resiliency, you would have made the whole deployment more likely to fail.

If you have a larger sharded environment where you are using many MongoS (query router) processes, it is possible to run multiple MongoS processes on the same nodes. If you choose to do so, it is recommended to run those in separate containers or VMs to avoid “noisy neighbor” resource contention issues and to aid in troubleshooting.

Tip #8:

Employ a reliable and robust backup strategy

So, you have a cluster with replication, but do you want to sleep better? Do backups of your data frequently!

Having backups of your data is a good strategy and allows you to restore the data from an earlier moment if you need to recover it from an unplanned event.

There are different options to backup your data:

Mongodump / Mongorestore: Mongodump reads data from MongoDB and creates a BSON file that Mongorestore can use to populate a MongoDB database. There are efficient tools for backing up small MongoDB deployments. On the plus side, you can select a specific database or collection to back up efficiently, and this approach doesn’t require stopping writes on the node. However, this approach doesn’t back up any indexes you have created, so when restoring, you would need to re-create those indexes again. Logical backups are, in general, very slow and time-consuming, so you would have to factor that time into your restore process. Lastly, this approach is not recommended for sharded clusters that are more complex deployments.

Percona Backup for MongoDB is an open source, distributed, and low-impact solution for consistent backups of MongoDB sharded clusters and replica sets. It enables backups for MongoDB servers, replica sets, and sharded clusters. It can support logical, physical, and point-in-time recovery backups and backups to anywhere, including AWS S3, Azure, or filesystem storage types.

However, it does require initial setup and configuration on all the nodes that you would want to protect.

Physical / File system Backups  You can create a backup of a MongoDB deployment by making a copy of MongoDB’s underlying data files.

You can use different methods for this type of backup, from manually copying the data files, Logical Volume Management (LVM) based snapshots to cloud-based snapshots. These are usually faster than logical backups and can be copied or shared to remote servers. This approach is especially recommended for large datasets, and it is convenient while building a new node on the same cluster.

However, you cannot select a specific database or collection when restoring, and you cannot do incremental backups. 

Lastly, running a dedicated node is recommended for taking the backup as it requires halting writes, affecting application performance.

Any chosen solution must include testing your restore process periodically.

Tip #9:

Know when to shard your replica set and why choosing a shard key is important

Sharding is the most complex architecture you can deploy with MongoDB

As your database grows, you will need to add more capacity to your server. This can involve adding more RAM, more I/O capacity, or even more powerful CPUs to handle processing. This is called vertical scaling. However, if your database grows so much that it goes beyond the capacity of a single machine, then you may have to split the workload up. Several things might lead to this — for instance, there may not be a physical server large enough to handle the workload, or the server instance would cost so much that it would be unaffordable to run. In these circumstances, you need to start thinking about sharding your database, which is called horizontal scaling.

Horizontal Scaling involves dividing the database over multiple servers and adding additional servers to increase capacity as required. For MongoDB, this process is called sharding, and it relies on a sharding key to manage how workloads are split up across machines.

Choosing a sharding key must be the most difficult task on MongoDB. It’s necessary to think, study the datasets and queries, and plan ahead before choosing the key because it’s very difficult to revert the shard once it has been carried out. For versions of MongoDB before version 4.2, assigning a shard key is a one-way process that cannot be undone. For versions of MongoDB over 4.4, it is possible to refine a shard key, while MongoDB 5.0 and above can change the shard key with the reshardCollection command.

If you choose a bad shard key, then a large percentage of documents may go to one of the shards and only a few to another. This will make the sharded cluster unbalanced, affecting performance over time. This typically happens when a key that grows monotonically is chosen to shard a collection, as all the files over a given value would go to one shard rather than being distributed evenly.

Alongside looking at the value used to shard data, you will also need to think about the queries that will take place across the shard. The queries must use the shard key so the MongoS distributes the queries across the sharded cluster. If the query doesn’t use the shard key, then the MongoS will send the query to every shard of the cluster, affecting performance and making the sharding strategy inefficient.

“For versions of MongoDB over 4.4, it is possible to refine a shard key, while MongoDB 5.0 and above can change the shard key with the reshardCollection command.

The ability to reshard is something that developers and DBAs have been requesting for a long time. However, be aware that the resharding process is very resource intensive and can have negative performance impacts, especially when dealing with applications that have heavy write patterns.

Tip #10:

Don’t throw money at the problem

Last, but not least, it’s very common to see teams throwing money at the problems that they have with their databases.

Adding more RAM, more CPU, and moving to a larger instance or a bigger machine can overcome a performance problem. However, carrying this kind of action out without analyzing the real problem or the bottleneck can lead to more of the same kinds of problems in the future. Instead of immediately reaching for the credit card to solve the problem, sometimes it is better to think laterally and imagine a better solution. In most cases, the answer is not spending more money on resources but looking at optimizing your implementation for better performance at the same level.

While cloud services make it easy to scale up instances, the cost can quickly mount up. Worse, this is an ongoing expense that will carry on over time. By looking first at areas like query optimization and performance, it’s possible to avoid additional spending. For some of the customers we worked with, they were able to downgrade their EC2 instances, saving their companies a lot of money monthly.

As a general recommendation, adopt a cost-saving mindset where you can take your time to analyze the problem and think of a better solution than starting with cloud expansions.

Percona Distribution for MongoDB is a freely available MongoDB database alternative, giving you a single solution that combines the best and most important enterprise components from the open source community, designed and tested to work together.

 

Download Percona Distribution for MongoDB Today!

Jul
25
2023
--

20X Faster Backup Preparation With Percona XtraBackup 8.0.33-28!

Faster Backup Preparation With Percona XtraBackup

In this blog post, we will describe the improvements to Percona XtraBackup 8.0.33-28 (PXB), which significantly reduces the time to prepare the backups before the restore operation. This improvement in Percona XtraBackup significantly reduces the time required for a new node to join the Percona XtraDB Cluster (PXC).

Percona XtraDB Cluster uses Percona XtraBackup to do SST (State Snapshot Transfer) from one node to another. When a new node joins the cluster, SST is performed to receive the data from DONOR to the JOINER. JOINER uses PXB to stream the data directory from DONOR. JOINER must prepare the backup before using it. It is observed that when the DONOR has a huge number of tablespaces (one million),  XtraBackup on JOINER side couldn’t complete the preparing the data (xtrabackup –prepare).

Percona XtraBackup copies the InnoDB data files. The data is internally inconsistent because the server concurrently modifies the data files while they are being copied. Percona XtraBackup performs crash recovery on the files to make a consistent, usable database again. This is called the ‘prepare’ operation (xtrabackup –prepare).

The XtraBackup –prepare operation is done in two phases:

  1. Redo log apply
  2. Undo log apply

In the redo apply phase, the changes from redo log file modifications are applied to a page. This phase has no concept of a row or a transaction. The redo apply phase wouldn’t make the database consistent with respect to a transaction. The changes done by an uncommitted transaction can be flushed or written to the redo log by the server. XtraBackup still applies the modifications recorded in the redo log, and the redo log apply phase does not undo those changes. For that, we have to use undo logs.

In the undo log apply phase (AKA rollback phase), the changes required to undo the transaction are read from undo log pages. They are then applied (for example, writing old values back again) to the pages again and written to disk. After this phase, all uncommitted transactions during the backup are rolled back.

Undo log records are of two types: INSERT Undo log record and UPDATE undo log record. DELETE MARK of records is considered as a subtype of the UPDATE UNDO log record.

The format is as shown below:

Undo log records format

When the Server writes these records, it doesn’t write the index/table information along with each record. It just writes a “table_id” as part of UNDO LOG records. The table_id is used to fetch the table schema. The table schema and key fields from the undo log record are used to create an index search tuple (key). This search tuple (key) is used to find the record to perform the undo operation.

So, given a table_id, how do you get the table schema/definition?

After the  “data dictionary” (DD) engine and DD cache are initialized on a server, the Storage Engines can ask for a table definition. For example, InnoDB asks for a table definition based on the table_id, also known as “se_private_id”.

Percona XtraBackup, unlike a server, doesn’t have access to the “data dictionary” (DD). Initializing a DD engine and the cache adds complexity and other server dependencies. XtraBackup does not simply behave like a server to access a table object.

Discover why Percona XtraBackup is trusted by thousands of enterprises.

Percona XtraBackup initializes the InnoDB engine and requires “InnoDB table object” aka dict_table_t, for all its purposes (rollback, export, etc.). XtraBackup relies on Serialized Dictionary Information (SDI). This is a JSON representation of the table. For InnoDB tablespaces, the information is stored within the tablespace. From 8.0, the IBD file is “self-describing”; for example, the table schema is available within an IBD file.

table schema is available within an IBD file

Let’s take a look at the example table.

CREATE TABLE test.t1(a INT PRIMARY KEY, b INT);

The CREATE TABLE statement creates a file called t1.ibd in the test directory. For example, mysql datadir/test/t1.ibd. So t1.ibd contains information about the table structure (columns, their types, how many indexes, columns in indexes, foreign keys, etc.) as SDI. Use a tool called “ibd2sdi” to extract SDI from an IBD file.

ibd2sdi data/test/t1.ibd > t1.sdi

As you can see from the above image, the table name is in the “dd_object:name” field, and the column information is stored in a “dd_object:columns” array.

Old design (until Percona XtraBackup 8.0.33-27):

XtraBackup reads SDI from *every* IBD and loads all tables from each IBD into the cache as non-evictable.  Essentially LRU cache is disabled by loading the tables as non-evictable. Every table remains in memory until XtraBackup exits.

Problems with this approach:

  • Loading tables that are not required for rollback.
  • Unnecessary IO operations from reading SDI pages of tables.
  • Loading unnecessary tables increases the time required to –prepare.
  • Occupies memory and can lead to OOM.
  • Crashes the XtraBackup prepare operation if the backup directory contains a huge number of tables/IBD files.
  • A node joining the PXC cluster requires more memory and takes a long time to join the cluster.

Why did XtraBackup load tables as ‘non-evictable’? Can’t we just load them as evictable to solve the problem? Let’s say a table is evicted and has to be loaded again. How will XtraBackup know the tablespace (IBD) that contains the evicted table? It must scan every IBD again to find the evicted table.

New design (from Percona XtraBackup 8.0.33-28)

To load tables as evictable, a relationship between the table_id and the tablespace(space_id) that contains the table should be established. It is done by scanning the B-tree pages of the data dictionary tables mysql.indexes and mysql.index_partitions

After this relation table_id→space_id is established, it will be used during transaction rollback. In this new design,  user tables are loaded only if there is a transaction rollback on them.

The new design is as follows:

Tables from the cache are evicted when the cache size limit is reached or by the background master thread.

Benefits of the new design, xtrabackup –prepare:

  1. Uses less memory
  2. Uses less IO
  3. Faster prepare
  4. Completes successfully even with a huge number of tables.
  5. A node completes the SST process faster and joins the PXC cluster quickly.
  6. A node requires less memory to join the PXC cluster.

Benchmarks

Percona XtraBackup benchmarks

xtrabackup –prepare on backup directory of other sizes like 10K, 50K, 100K, and 250K tables. The performance improvement is as follows:

Conclusion

As you can see, from Percona XtraBackup 8.0.33-28, xtrabackup –prepare is faster and memory efficient with the dictionary cache. The improvement will depend on the number of tablespace files (IBDs) in the backup directory. The time taken for a new node to join the PXC Cluster is also significantly reduced as the SST process will complete faster.

Percona XtraBackup is a 100% open source backup solution for all versions of Percona Server for MySQL and MySQL that performs online non-blocking, tightly compressed, highly secure full backups on transactional systems.

 

Try Percona XtraBackup today

Jul
25
2023
--

Faster Streaming Backups – Introducing Percona XtraBackup FIFO Parallel Stream

Percona XtraBackup FIFO Parallel Stream

When it comes to backups, there are several options for saving backup files. You can choose to save them locally on the same server, stream them to different servers, or store them in object storage. Percona XtraBackup facilitates streaming through the use of an auxiliary tool called xbcloud.

STDOUT Datasink

This diagram displays the process of streaming a backup to object storage utilizing the current STDOUT datasink:

STDOUT Datasink

  • XtraBackup spawns multiple copy threads. Each one will be reading a chunk of data from a specific file.
  • Each copy thread will write the chunk of data to a pipe (aka STDOUT).
  • Xbcloud will have a red thread that will be responsible for reading each chunk of data from STDIN. This chunk will be uploaded to object storage utilizing an async request, and a callback will be added to an event handler list.
  • The event handler thread of xbcloud will be responsible for checking when the TCP socket has received the response from the object storage and executing the callback depending on the return (success or failure).

When testing the speed of XtraBackup streaming, we can see a limitation of around 1.8 Gbps. This architecture works fine when uploading data via WAN (we mostly will not have this speed over WAN). However, needing to stream backups over the LAN has become more popular, and 10Gbps or more is the standard setup.

After careful consideration of the above architecture, it has become clear that even though xbcloud can act super fast reading from STDIN, doing an asynchronous operation to Object Storage, and delegating the process to wait for the response to the event handler worker thread, there is only one thread on the XtraBackup side that can be writing at the same time to STDOUT. This is done via a mutex to ensure a chunk of data is written from start to finish. Otherwise, we would have the interference of other copy thread data being appended to STDOUT while the other thread has not completed its initial chunk.

FIFO Datasink

Percona XtraBackup 8.0.33-28 introduces a new datasink for true parallel streaming. This new datasink utilizes FIFO (named pipes) to achieve true parallelism to stream data from XtraBackup to xbcloud. Extending the previous diagram, the new data stream will look like below:

FIFO Datasink

  • Multiple FIFO files will be created, which will be responsible for storing the data from XtraBackup to xbcloud.
  • Each XtraBackup copy thread will be bound to a FIFO file in a round-robin fashion.
  •  Xbcloud will have multiple read threads to read from the FIFO streams.

Usage

We introduced three new parameters:

  • fifo-streams=# Number of FIFO files to use for parallel datafiles stream. Setting this parameter to 1 disables FIFO, sending the stream to STDOUT. The default is kept as 1 (STDOUT) for backward compatibility.
  • fifo-dir=path Directory to write Named Pipe.
  • fifo-timeout=# How many seconds to wait for the other end to open the stream for reading. Default is 60 seconds

The current behavior and XtraBackup and cloud are not changed. The default value of –fifo-streams is 1, so we still have a single-line command:

Performance

To test the new FIFO datasink, we have created 1TB of data into multiple tables. The link speed between the source server and the destination server ( utilizing MinIO ) is ~9.2Gbps:

The test was performed without any load (no new redo log during the backup) to ensure both tests are both pushing 1TB of data from two servers.

The results are shown in the below graph:

  • STDOUT – Utilized 239 MBps (1.8 Gbps) and took  01:25:24 to push 1TB of data.
  • FIFO ( 8 pipes) – Utilized 1.15 GBps (9.2 Gbps) and took only  00:16:01 to push the same 1TB of data.

Summary

If you utilize streaming inside your local network, you can definitely benefit from faster backups by streaming data in parallel using the new FIFO datasink.

Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server for MySQL and MySQL.

Learn more about Percona XtraBackup

Jul
24
2023
--

Talking Drupal #408 – The Drupal Association

Today we are talking about The Drupal Association with Tim Doyle.

For show notes visit: www.talkingDrupal.com/408

Topics

  • Tell us about yourself Tim
  • Why Drupal
  • What perspective do you bring
  • Strategic plan
  • Leadership
  • Measurements
  • Drupal the project vs Drupal the community
  • Where will the DA be in 5 years
  • Will Drupal be around in 20 years
  • Favorite aspect of the strategic plan

Resources

  • Director of Philanthropy
  • Burnout
  • Pitch-burgh
  • Alex Moreno
  • New Core Product Manager
  • Tenure of ED/CEOs since formation of DA
    • Jacob Redding
      • March 2010 – February 2013
      • ~3 years
    • Holly Ross
      • February 2013 – May 2016
      • 3 years 3 months
    • Megan Sanicki
      • May 2016 – September 2018
      • 1½ years
    • (interim) Tim Lehnen
      • September 2018 – June 2019
      • 9 months
    • Heather Rocker
      • June 2019 – March 2022
      • 2 years 9 months
    • (interim) Angie Sabin
      • March 2022 – October 2022
      • 8 months
    • Tim Doyle
      • October 2022 – Present (July 2023)
      • 9 months
    • As of February 2026 Tim will be longest-tenured

Guests

Tim Doyle – @TimDoyleMPA

Hosts

Nic Laflin – www.nLighteneddevelopment.com @nicxvan John Picozzi – www.epam.com @johnpicozzi Tim Plunkett – @timplunkett

MOTW Correspondent

Martin Anderson-Clutz – @mandclu ActivityPub

  • Brief description:
    • Implements the ActivityPub protocol on your site, so readers can follow and respond to content on Fediverse sites like Mastodon, and in the future maybe Threads?
  • Brief history
    • How old: created in Feb 2019
  • Versions available:
    • 1.0.0-alpha17, works with Drupal 9.4 and 10, release in Mar 2023
  • Actively maintained
  • Minimally maintained
  • Number of open issues:
    • 43 open issues, only 2 bugs but 1 of those fixed in the past week
  • Usage stats:
  • Maintainer(s):
    • Created and maintained by swentel, who also created a module for publishing Drupal content to the Nostr network
  • Module features and usage
    • Effectively allows your site to act as a Fediverse server: Mastodon, Pleroma, Picelfed, etc.
    • Users who enable ActivityPub for their account have created @user@domain in the Fediverse
    • At that point the Drupal UI provides some common features of a social client: notifications about subscribers, and so on
    • Provides plugins to define user actions. Default plugins are ‘Accept’, ‘Follow’, ‘Delete’, ‘Undo’ and ‘Inbox reply’, but a site could add or enable others
    • Based on the documentation it sounds like if you update an entity (e.g. change the title) you can trigger an Update activity, which is in stark contrast to a certain bird platform
    • There’s quite a bit more in the module’s README file if anyone wants to better understand its capabilities
Jul
24
2023
--

Announcing the Availability of Percona Backup for MongoDB 2.2.0

percona backup for mongodb

In the previous minor release, Percona Backup for MongoDB 2.1.0 (PBM) introduced a GA version of incremental physical backups, which can greatly impact both the recovery time and the cost of backup (considering storage and data transfer cost). Back then, we already knew that our customers needed to back up:

  • More data
  • Faster
  • Using existing tools

During all conversations we’ve had with our Community, customers, and prospects, we’ve noticed how popular the usage of snapshot tools is. While AWS EBS snapshots were mentioned the most, the likes of persistent disk snapshots on GCP, Azure-managed disk snapshots, or local storage, as well as k8s snapshot capabilities, were also playing a crucial part in the backup strategies.

To make sure that Percona Backup for MongoDB answers the pains of our customers, we took the common denominator of the pains that anyone using snapshot capabilities faces when performing backups and restores and positioned PBM as the open source, freely available answer to those pains.

Snapshot backup/restore Technical Preview

As already promised during Percona Live 2023, it’s my pleasure to deliver on that promise. I am happy to announce that with PBM 2.2.0, we are launching the technical preview for Percona Backup for MongoDB snapshot CLI.

With the use of this snapshot CLI, you can now build your backup strategy using the snapshot tools at your disposal or include PBM into your existing strategy, even using existing snapshots of Percona Server for MongoDB!

Now why Technical Preview, you may ask? Well, it is because we believe that open source philosophy is not only about the source but also about design and decision-making. We want to ensure that our Community feels that this design we delivered fits their use cases. We hope for a dialogue with you, our users, to make sure we are fixing your pains in the best way we can. So please do not hesitate and:

  • Provide feedback in the feedback form.
  • Engage on our Community pages (there is even a placeholder topic for this!)
  • Contact me through any channels (we do have our Jira open to the Community, my email is available in the footnote, and I am present on LinkedIn).

Physical backups Point in Time Recovery (PITR)

MongoDB Inc. provides only limited backup/restore capabilities with their Community Edition. These are, respectively, the widely adopted mongodump/mongorestore that also Percona Backup for MongoDB uses for what we call logical backups.

While this type of backup is very convenient for smaller data sets, it has certain limitations regarding larger ones. The main limitations are the RPO and RTO. While RPO is addressed with Point in Time Recovery (PITR) that works by default with logical backups, for RTO improvement, we introduced physical backups. In short, for larger datasets, there is a very distinctive speed improvement in recovery.

By default, the PITR capabilities were designed to work with logical backups, and we want to make the user experience as good as physical backups. While for previous versions, PITR works well with physical backups, some operational limitations require some more manual operations.

With PBM 2.2.0, we introduce numerous fixes that put these limitations in the past:

  • Previously the database, after restoring from a full backup, was allowing connections that required manual changes to restrict users from connecting to it before the PITR restore finishes so that the result of the restore process guarantees data integrity.
  • The restore process for physical backups + PITR up until now was not handled in one command, making the user experience not as good as for the logical backups + PITR.

Fixes, bugs, improvements – all is here!

Of course, each release also includes bug fixes and refining some of the existing features. This one is no different. Outside of your typical array of bugs and patches, we have noticed that the way that physical restores handle the remapping of replica sets needs improvement so that you can notice a better-handled experience there.

Feedback is also a contribution

Contribution is not only code. Feedback, adoption and usage data, bug reports, success stories, case studies, testimonials — all these are contributions. Engaging with Community in meaningful discussions is also a contribution. All of these help us grow and help us deliver a better product.

We appreciate any contribution. Again, even negative feedback is something we can use to make the product evolve!

What’s next?

I hope that next, we will close the gap between the capabilities of physical and logical backups by:

  • Selective backup/restore
  • Further improvements on physical PITR, if needed

and, of course, deliver GA capabilities of snapshot backup/restore based on your feedback.

There is also a lot of work around Percona Monitoring and Management (PMM) for MongoDB and some new improvements coming for Percona Operator for MongoDB.

 

Learn more about Percona Backup for MongoDB 2.2.0

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com