Database Corruption: Detecting Disk Sub-System Errors Early
The very thought of database corruption makes my skin crawl. Unfortunately, it’s something that all Database Administrators will eventually experience and need to correct. I’m not going to discuss how to recover from this lurking problem. Instead, I’m going to give you some simple strategies to ensure you are aware of the issue as soon as possible. Early detection can save you downtime, data loss and even your job.
The Usual Suspect
Generally speaking, the majority of database corruption is caused by hardware errors in the disk sub-system. Our database files are sitting on our SAN, or DAS, and are happily going about their lives without any problem until something goes awry. You may not even know there is a problem until it’s too late and a database is marked suspect.
The most important thing to can do, outside of monitoring for these problems, is to keep your storage system’s firmware and drivers up to date and consistent within the system. Sign-up for notifications with your vendor so you will receive these notices, and when you received them, read the little text file that explains what has changed and what will happen if you don’t update. I can’t tell you how often I’ve read through these and run across verbiage like, “previous version can cause data loss.”
These systems aren’t perfect. Stay on top of your firmware and driver updates or you may be turned down when you make that middle of the night support call and need their help.
Automate the Eights
I’ve got a bag of alerts that I like to implement on each SQL Server that I administer. Three of these alerts are for errors 823, 824 & 825.
Error 823
Sometimes referred to as a hard I/O error, this is alerting SQL Server, and you, that a read or write error has occurred. In the case of a read error, Windows has already attempted a retry four times before sending the message. This error can also be captured by alerting on errors with a severity of 24.
This error will be accompanied by the following entry in the SQL Server error log and Windows event log:
“The operating system returned error 823 to SQL Server during a <<MESSAGE>> at offset <<PHYSICAL OFFSET>> in file <<FILE NAME>>. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online”.
Error 824
Sometimes referred to as a logical consistency error, or soft I/O error, this indicates that Windows was able to read the page from disk but SQL Server has discovered that the page has a problem. This error can be caused when SQL Server determines that there was a checksum error or a torn bits error when it read the page from disk. This is dependent on the page verification option your database is using. This error can also be captured by alerting on errors with a severity of 24.
No matter which occurred, there’s a problem with the page and the following error message will be reported in the SQL Server error log and the Windows event log:
“SQL Server detected a logical consistency-based I/O error: <<ERROR TYPE DESCRIPTION>>. It occurred during a <<Read/Write>> of page <<PAGEID>> in database ID <<DBID>> at offset <<PHYSICAL OFFSET>> in file <<FILE NAME>>. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.”
Error 825
This error also falls in the hard I/O error category. When this error is received, it indicates that Windows requested a read and the read failed at least one time. Oftentimes, this is the first indication that there’s trouble brewing in the disk sub-system. This error can also be captured by alerting on errors with a severity of 10.
You’ll find the following error message in the SQL Server error log and Windows event log:
“A read of the file <<FILE NAME>> at offset <<PHYSICAL OFFSET>> succeeded after failing<<FAILURE COUNT>> time(s) with error: <<MESSAGE>>. Additional messages in the SQL Server error log and system event log may provide more detail. This error condition threatens database integrity and must be corrected. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.”
Conclusion
You’re going to receive one, or more, of these errors during your career as a Database Administrator. No one is immune. Bad things happen to hardware, and most of the time there’s not a lot you can do to prevent it. However, you can be prepared to respond.
If you receive one of these errors, don’t wait around before you investigate. You’re response is critical to the health of the data you’re sworn to protect. These types of problems don’t go away on their own. Don’t make the mistake of thinking that they will.
So, keep your storage system firmware and drivers up to date, implement these alerts and systematically plan your response. These actions will give you the best chance of a speedy recovery when database corruption strikes.
Scott Caldwell, SQL Server DBA