MDEV-38907 Optimistic Relay Log Crash Recovery by ParadoxV5 · Pull Request #4803 · MariaDB/server

ParadoxV5 · 2026-03-13T23:37:06Z

This commit adds an automatic recovery procedure to GTID relay log initialization.

This procedure scans the write end of a log to remove any trailing (or corrupted) incomplete event or event group.
Unlike @@relay_log_recovery, this procedure won’t discard the portion of logs that managed to survive the crash.
For exceptional cases that require manual intervention, @@relay_log_recovery is not altered.

Note, because START SLAVE purges the relay log in GTID mode, this feature doesn’t bring immediate value.
However, this commit is a crash-safety preparation for lifting that limitation.
Non-GTID mode can also build upon this procedure to improve its crash safety precautions.

This commit adds an automatic recovery procedure to GTID relay log initialization. This procedure scans the write end of a log to remove any trailing (or corrupted) incomplete event or event group. Unlike `@@relay_log_recovery`, this procedure won’t discard the portion of logs that managed to survive the crash. For exceptional cases that require manual intervention, `@@relay_log_recovery` is not altered. Note, because START SLAVE purges the relay log in GTID mode, this feature doesn’t bring immediate value. However, this commit is a crash-safety preparation for lifting that limitation. Non-GTID mode can also build upon this procedure to improve its crash safety precautions. Reviewed-by: Kristian Nielsen <knielsen@knielsen-hq.org>

ParadoxV5

I wonder, what was the intent with (non-Engine-based) binary logs to leave them crash unsafe without pessimistic configs, in contrast to now that we are improving the crash safety of the relay log with automatic truncation?

ParadoxV5 · 2026-03-13T23:45:20Z

sql/log.cc

 }


+bool do_relaylog_recovery(MYSQL_BIN_LOG *relay_log, const char *log_name)


TC_LOG_BINLOG::recover() has @@gtid_binlog_state restoration and XA recovery entangled, not to mention missed optimizations.
Thus, I decided to roll a separate function to guide me in figuring out how the two recovery procedures could merge code.

ParadoxV5 · 2026-03-14T00:02:19Z

sql/log.cc

@@ -11685,8 +11879,9 @@ MYSQL_BIN_LOG::close_engine()
 */
 void MYSQL_BIN_LOG::clear_inuse_flag_when_closing(File file)


Would it be beneficial for the relay log to make use of the in-use flag, and only run the recovery procedure if the flag is not cleared?

Problems include:

One of the next steps in the MDEV-4698 project is to restore Gtid_IO_Pos (MDEV-38909, which will always require relay log scanning similar to this recovery procedure (unless there’s a .state file in good condition).

Relay logs upgraded from before MDEV-4698 to after will miss this auto recovery.
(Even if not crashed and corrupted, they may contain incomplete groups that the project task MDEV-38906 wants to eliminate.

To check: the SQL thread might have special handling for Format Descriptions that have the in-use flag set.

ParadoxV5 · 2026-03-14T00:13:12Z

sql/rpl_rli.cc

+            and not safe for non-GTID mode until MDEV-39051
+          */
+          !is_relay_log_recovery && mi->master_use_gtid))


I’m deferring the additional work required for non-GTID replication (MDEV-39051) for now.
Let me know if it’s better together, especially since this procedure is extraneous in GTID replication until we lift the START SLAVE limitation.

File-pos replication can reconnect mid-group without complex skip-enqueue logic, so it actually doesn’t need to check event groups and only needs to truncate a corrupted event.
Though I’m concerned that this’ll be yet another difference that hurts interoperability between GTID and legacy modes (if that future is possible).

ParadoxV5 · 2026-03-14T00:17:04Z

sql/log.cc

+      /*
+        A slave will verify the checksum in the SQL thread;
+        the relay log does not care about checksums until then.
+      */
+      event= Log_event::read_log_event(&log, &err, &fdev, false, false)


TODO: fix comment: SQL thread checksum verification can be disabled.

Regardless, should this procedure verify checksums as if it’s the IO thread?

ParadoxV5 · 2026-03-14T00:20:57Z

sql/log.cc

+    fdev.reset_crypto();
+  }
+
+  if (truncate_entry && (err= relay_log->truncate_and_remove_binlogs(


It is easier said than done to skip truncation when truncating at EOF.

It might be more beneficial deeper in the call chain… but then, I wonder if this optimization will become negligible…

ParadoxV5 · 2026-03-14T00:30:08Z

sql/rpl_rli.cc

-      if (rli->relay_log.open(ln, 0, 0, SEQ_READ_APPEND,
-                             (ulong)(rli->max_relay_log_size ? rli->max_relay_log_size :
-                              max_binlog_size), 1, TRUE))
+      if (rli->relay_log.open(ln, nullptr, 0, SEQ_READ_APPEND,


🧐 Is this section unreachable, or merely repeated code?

ParadoxV5 · 2026-03-14T01:17:47Z

sql/log.cc

@@ -4643,29 +4838,28 @@ bool MYSQL_BIN_LOG::open(const char *log_name,
                         ulong max_size_arg,
                         bool null_created_arg,
                         bool need_mutex,


Need what mutex?
do_binlog_recovery() calls functions with need_lock passed with 1, so what does the code documentation mean by “you must have a locks on LOCK_log and LOCK_index”?

Some issues might be existing bugs, where relay log open errors leads to further problems similar to MDEV-24625.

ParadoxV5 requested a review from knielsen March 13, 2026 23:37

ParadoxV5 added MariaDB Corporation Replication Patches involved in replication labels Mar 13, 2026

ParadoxV5 commented Mar 14, 2026

View reviewed changes

[to squash] a coupa fixes for ci

3d3b522

Some issues might be existing bugs, where relay log open errors leads to further problems similar to MDEV-24625.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MDEV-38907 Optimistic Relay Log Crash Recovery#4803

MDEV-38907 Optimistic Relay Log Crash Recovery#4803
ParadoxV5 wants to merge 2 commits intomainfrom
MDEV-38907

ParadoxV5 commented Mar 13, 2026

Uh oh!

ParadoxV5 left a comment

Uh oh!

ParadoxV5 Mar 13, 2026

Uh oh!

ParadoxV5 Mar 14, 2026

Uh oh!

ParadoxV5 Mar 14, 2026

Uh oh!

ParadoxV5 Mar 14, 2026

Uh oh!

ParadoxV5 Mar 14, 2026

Uh oh!

ParadoxV5 Mar 14, 2026

Uh oh!

ParadoxV5 Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

		}


		bool do_relaylog_recovery(MYSQL_BIN_LOG relay_log, const char log_name)

		@@ -11685,8 +11879,9 @@ MYSQL_BIN_LOG::close_engine()
		*/
		void MYSQL_BIN_LOG::clear_inuse_flag_when_closing(File file)

Uh oh!

Conversation

ParadoxV5 commented Mar 13, 2026

Uh oh!

ParadoxV5 left a comment

Choose a reason for hiding this comment

Uh oh!

ParadoxV5 Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

ParadoxV5 Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

ParadoxV5 Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

ParadoxV5 Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

ParadoxV5 Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

ParadoxV5 Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

ParadoxV5 Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant