Jan 22

The IO operation at logical block address 0 for Disk 7 was retried.

The IO operation at logical block address 0 for Disk 7 was retried.

When working with Multipath IO (MPIO) like when using iSCSI it is possible you might run across the message:

The IO operation at logical block address X for Disk X was retried.

This initially looks like trouble although truly it is more on the level of a general warning. Should you receive many warning on a short period of time then you should consider addressing this as you might be suffering of performance issues. This warning message appears when the IRP sent over MPIO times out. This could be network connectivity issues (latency, connectivity, etc) but more like it is a matter of the target being too busy to respond timely. So again, sporadic warnings during high IO operations could be expected but recurrent messages means you need to look into the work load or networking of your shared storage.

There is a better / more extensive explanation found here: http://serverfault.com/questions/499269/what-does-the-io-operation-at-logical-block-address-for-disk-was-retried which includes references to Microsoft material:

http://support.microsoft.com/kb/2819485/EN-US (The Kb article for this message)

and

http://blogs.msdn.com/b/ntdebugging/archive/2013/04/30/interpreting-event-153-errors.aspx.

Below is a copy for your convenience:

No it does not mean that the data was lost. It simply means that the IRP (IO Request Packet) timed out while the IO System waited for it to complete, and so it was tried again. When a thread begins any IO operation, the IO manager creates an IRP to represent the operation as it passes through the system.

The IRP gets stored in its initial state in a buffer/look-aside list, so that it can be retried if it fails the first time. That provides the atomicity that one would expect from any transactional system so that we can be more confident that you’re not going to get a bunch of corrupted or incomplete data written to your disk.

This event makes perfect sense in the event of an MPIO failure. Say Windows goes to read or write something from SAN storage. The request is dispatched, and at the same instant, I cut one of the cables to the SAN. That request is never going to complete, and so Windows will try the request again, only this time the request will follow the other path.

These events also occur when the disks are overburdened or just really slow. You might notice these messages coincide with scheduled backups, etc. The disk might just be slow and busy, and some random IRP timed out and had to try again. The IRP could be getting stuck in an interrupt service routine, or a deferred procedure call, or whatever.

I could see having a lot of IO filter drivers in your stack exacerbating this issue as well.

It’s not that this behavior did not occur just like this in previous versions of Windows, it’s just that Microsoft apparently decided to surface these events in Win8/Server 2012.

Edit: You can find the outstanding IRPs of a thread with a kernel debugger: kd> !irp 1a2b3c4d, where you previously found that address by issuing the command kd> !process 8f7d6c4a which will list all the IRPs associated to the threads associated with that process. kd> !process 0 0 to list all the processes running.

Once you list the information about an IRP using the !irp command, you can easily spot which driver last handled the IRP because it will have a > pointing to it in the list. Then to get more information about what that driver was doing with that IRP, do a kd> !devobj 1a2b3c4d5e6f where that is the actual address of the device object.

Then do a kd> dt 0x1a2b3c``3c2b1a _CLASS_PRIVATE_FDO_DATA using the address of the PrivateFdoData structure you got. (Just one backtick; I couldn’t get the parser to do it.)

Now you’re ready to dump the AllTransferPacketsList data structure you got from PrivateFdoData.

The idea is, you’re tracking down what driver was doing what with the IRP the last time it was seen. If the IRP is AWOL for too long, it’s timed out and retried from the beginning. This can be caused by so many things… even a stray cosmic ray. But the important thing is that the transaction will be retried from the beginning, and it will not be considered complete until the IO manager says it is.

Oh, and there’s also thread-agnostic IO which is a completely different can of worms. 🙂

For further reading on this topic, I highly recommend chapter 8, I/O System, of Windows Internals 6th edition, from Mark Russinovich, Margosis, et al.

*Edit: * I did finally find the official KB for this error: http://support.microsoft.com/kb/2819485/EN-US

The IO operation should be retried 8 times, once per minute, until Windows gives up.

Edit: As promised: http://blogs.msdn.com/b/ntdebugging/archive/2013/04/30/interpreting-event-153-errors.aspx

Enhanced by Zemanta

Leave a Reply

%d bloggers like this: