SQL: Auto Page Repair in SQL Server ? (Or Not?)

SQL: Auto Page Repair in SQL Server ? (Or Not?)

Database mirroring was added in SQL Server 2005. One of the features added to it in SQL Server 2008 was auto page repair. When SQL Server 2012 was released, Availability Groups also offered auto page repair.

Just how useful is this feature though?

I’ll start by saying that it can’t hurt.

When SQL Server is reading a page on the primary replica and receives an unrecoverable I/O error (typically but not always, an error 823 for a checksum error), it will try to repair the page when:

  • A secondary synchronous replica exists
  • The secondary replica is synchronized

It will try to recover a copy of the page from the secondary replica and rewrite it on the primary. This means that when you execute a query on the primary, and it returns an I/O error, that retrying the same query a short time later might actually work.

In the meantime the page is added to the suspect pages table in msdb, and if you try to query it while the auto page repair is being attempted, you’ll get an error 829 that tells you the page is in recovery.

Did the page get repaired?

Maybe.

Let’s start by saying that if a storage subsystem is returning I/O errors, it’s likely that you have a problem that really needs fixing.

Generally, auto page repair is more like a bandage or a band-aid that’s been applied.

However, storage subsystems are often now self-healing: they recognize that an I/O error occurred in one location, and automatically map that block to another underlying location. So when SQL Server then rewrites the original data, there is a chance that it will have actually fixed the problem.

It’s also worth keeping mind, that as storage systems get larger and larger, we’re starting to test the standard error rates of many underlying I/O devices. One unrecoverable error every so often is actually deemed acceptable. The proportion is really very, very low. But as we start moving larger and larger amounts of data around, we can start to hit these numbers.

I suspect that as storage gets even larger, we will start to be aware of these issues more and more.

What about the secondary replica?

The same logic applies. The secondary can attempt to retrieve a copy of the page from a synchronized primary replica. This could occur when the secondary replica is being read. In mirroring, this could have been if we had made the replica readable via database snapshots, and in Availability Groups, this could just be a readable secondary replica that hits an I/O error on its storage.

So is it a good thing?

Of course. It’s a help, but it’s important to understand that if pages are appearing in the suspect pages table in msdb, you have an underlying problem that needs to be resolved.

2018-01-02