In the world of storage you often get to work with many different operating systems, applications, hardware, protocols and so much more. It is one of the reasons I love being in the storage industry! There is always so much to learn to keep this hyperactive mind busy.
With a lot of options though often comes a lot of complexity. There is so much to keep track of that it is easy to get lost in it all and feel overwhelmed. Coming from a deep background of working escalations for the past ~10 years at various companies I can attest to this. My hope is that today’s post will help at least one person out there who needs it. 🙂
Where to look when debugging storage events in ESXi:
Whenever an interruption in service occurs in an environment it is important to understand how it happened so it can be avoided in the future. When working with VMware vSphere one of the best places you can look to better understand what happened is the vmkernel log file.
The vmkernel file includes a lot of good data such as device discovery, storage communication errors, networking device communication, NIC and HBA driver / firmware events and whole lot more. It provides in-depth information about potentially impacted LUNs / devices, what the VMware NMP (Native Multi-Path) plugin is reporting, etc. Essentially, this the “go to” place to understand most of the issues within an ESXi host. There are certainly other useful files to dig into but this is always the first place I turn to.
A couple of different examples you may see while troubleshooting your ESXi environment are as follows:
2019-04-29T05:53:04.154Z cpu0:2098396)NMP: nmp_ThrottleLogForDevice:3781: Cmd 0x89 (0x45a2dbdf8b80, 2097360) to dev "naa.624a9370761539373262434d06117892" on path "vmhba64:C10:T0:L253" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0. Act:NONE
2019-02-03T20:13:56.593Z cpu20:65613)ScsiDeviceIO: 2980: Cmd(0x439d46320540) 0x2a, CmdSN 0xffffe0005764fbd0 from world 71532 to dev "naa.624a9370abe95798c1cb40320001103a" failed H:0xc D:0x0 P:0x0 Invalid sense data: 0x10 0xbb 0x1.
While it may seem like gibberish as most everything is reported in hexadecimal there is actually a fair amount of information that can be gleaned from here to help with your debugging efforts. In the interest of time I will not focus on the entire line because VMware has already done a good job with this in their, Interpreting SCSI sense codes in VMware ESXi and ESX KB. If you are unsure what the above information is trying to tell you I highly recommend you read that KB before moving forward in this post.
What is valid and what isn’t?
The part I want to focus on for each line are the “Valid sense data” and “Invalid sense data” sections respectively. This is one thing that confused me for quite some time and I still see confusing a lot of people today.
In short these two lines make a very clear distinction: Whether or not the storage array provided the reason for the I/O failure or not.
As you can imagine “Valid sense data” means that this is a verified response from the storage array to the ESXi host on the underlying reason the I/O request has failed. This, for the most part, takes a lot of the “guess work” out of understanding why the I/O request was unsuccessful or a service disruption was seen.
Let’s take our “Valid sense data” example above. By looking at this line we know that the Pure Storage FlashArray reported that the Compare and Write (0x89) request was unsuccessful due to a MISCOMPARE DURING VERIFY OPERATION. This is good because it provides evidence that the FlashArray did indeed receive the request and responded with specific data points for us to investigate. Does it mean the storage array is to blame? Not necessarily, it simply gives us a “valid” data point we can use to investigate the underlying cause for I/O impact.
On the flip side we have the “Invalid sense data” data point. This means that the ESXi host did not receive the sense data from the storage array and it is (for a lack of a better term) “guessing” what the underlying reason is. We know that the I/O failed but we don’t necessarily get the fully picture to help us understand why.
If we look at our example above we know that a WRITE request (0x2a) failed but we don’t have anything for the sense data we can use because it wasn’t provided by the array. We only know from the host status (H:0xc) that there was a transient issue that led to the failure of the I/O request and the I/O will be retried. What is even more interesting though is if you look at the sense data it did provide it is actually just gibberish. There is no sense code “0x10” and there is no ASC / ASCQ of 0xbb/0x01. So not only is it not helpful but it is completely wrong.
Does this mean that we cannot use any of the information from the line? No, we can use everything before the “Invalid sense data” to better understand what may be happening (as outlined above), we just cannot rely on the provided SCSI sense data.
NOTE: In earlier versions of ESXi “Invalid sense data” was named “Possible sense data”. So these two are synonymous with one another.
So now you know, if you have “Valid sense data” your storage array is trying to lead you down the path of resolution. While “Invalid sense data” means the ESXi host has lost its mind and is trying to lead you down the bumpy path of rocks. 🙂
If you have any questions or clarifying statements please let me know! Best of luck in your troubleshooting efforts!