After being gone from work for an extended time due to several family events that required my attention, I am so happy to be back! I love the time I had with my family, it was definitely refreshing, but there is something to be said about being back to work and having your day structured! Additionally, after feeling like you have worked your backside off it is nice to have something tangible to see and know it was accomplished. Parenthood, it seems, is rarely seen in that way! 🙂
Now that I am back I decided to write my first post on an all too common set of questions / scenarios we see from customers, VMware, and even here internally at Pure Storage. I am hoping this will help people troubleshoot in a more effective manner and relieve some of the stress of thinking there is yet another thing to take care of.
When an environment is impacted by some event the first question that is almost always asked is, “What happened and why?”. Since there is a driving need to better understand what happened, often times in-depth investigations begin.
When the investigation is in its infancy there are a lot of unknowns. Due to the unknowns these investigations often start out with a broad exploration of logs and eliminate items from our “suspicion list”, so to speak. As the debugging continues new data points will eliminate (or point towards) the potential culprit. More often than not it will be logging that sheds light on these issues, which makes sense since that is why developers implement logging to begin with.
So if we now take that and imagine an issue within an ESXi environment, where production was impacted due to potential storage and/or network problems, you’ll want to dive head first into vmkernel logs to see what it reported at the time of incident (at least I know I would)!
As you are looking through the logs you come across the following log line:
2019-08-09T15:13:32.496Z cpu5:2098263)ScsiDeviceIO: 3435: Cmd(0x459ad5915dc0) 0x85, CmdSN 0x1526 from world 2099993 to dev "naa.624a93708a75393becad4e43000113fb" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Since this error indicates that a SCSI command (0x85) has failed and we have Valid Sense Data, this is actually something you would naturally want to look at.
The thing you would discover here though is that this message is reported for all Pure Storage volumes, not only that, but it is reported every 30 minutes!
Naturally this would lead to some questions that needed some answering, right?
- What does a SCSI opcode of 0x85 even mean?
- Why does it report errors every 30 minutes for every Pure Storage volume?
- Is it the reason for our production impact OR will it eventually be?
The are all fair questions and, to be quite honest, were questions I was asking at one point in my career as well. So let’s address them one by one to better understand.
What does a SCSI opcode of 0x85 even mean?
As I have discussed in other posts SCSI is, in a simple term, a way for a computer to talk with varying types of devices, which in our case would be a disk. Using a series of SCSI opcodes a computer is able to tell a device what task it would like to accomplish. Since this blog isn’t about that, you can read the Wiki on SCSI commands for a high level overview if you desire.
With that understanding, and even referencing the wiki link above, we can see that a SCSI opcode 0x85 is an: ATA PASS-THROUGH request.
Great, an ATA-PASS-THROUGH request that clears it up, right? 😉
If you want a real in-depth understanding you are always welcome to review the official T10 org document on ATA PASS-THROUGH, it’s a real page turner! But if you don’t care that much, and just want a general understanding, then know it is simply a way for applications, OSs, hosts, etc to issue an ATA command to the underlying disk(s) in order to perform specific tasks or functions. In order to try and keep this post simplified we will not go over all of those functions and will focus on just the applicable one: S.M.A.R.T. data. If you read the provided link you can better understand what the purpose of this request does as, again, that isn’t the purpose of this post.
The reason S.M.A.R.T requests are important here though is because that is the data the ESXi host is attempting to probe for each disk that is attached to it. This is because a LU is seen as a disk to the host and is therefore treated the same as if it was connected via FC, iSCSI or directly via SATA / ATA / SAS or SCSI.
Great, now we know what the command is doing (in this particular case with ESXi) so let’s see why we see those pesky errors every 30 minutes.
Why does it report errors every 30 minutes for every Pure Storage volume?
As previously explained above, since ESXi treats these FC / iSCSI attached LUNs that same as it would a local drive it wants to try and gather diagnostic information for that device. It has been that way since VMware added support for S.M.A.R.T. data collection in ESXi 5.1.
The problem though is that it shouldn’t / doesn’t need to worry about that information for a Pure Storage datastore (really any storage vendor), since that is already monitored from the array itself.
Additionally, volumes presented to the ESXi host for use as a datastore or RDM are often in RAID groups of some kind that contain multiple drives in the pool. So when this query is sent, which device is going to respond to this request? All of them, some of them, one of them? That is why the drives themselves are managed by the backend storage array to begin with, because that is hardware that was designed to manage these types of requirements. 🙂
This is why the FlashArray will reject these requests with the following response:
(0x5) ILLEGAL REQUEST - (0x20/0x00) INVALID COMMAND OPERATION CODE
We are effectively telling the ESXi host we don’t support this request as we are managing these drives.
What about the 30 minute interval? Well, this is easy to explain…. it is because the ESXi host attempts to read S.M.A.R.T. data every 30 minutes. So everytime this request is sent the errors will be reported in the logs because we are rejecting that request from the ESXi host.
So what about our last question?
Is it the reason for our production impact OR will it eventually be?
The answer here is undoubtedly: NO. This error message is completely safe and doesn’t, and will not, impact your production environments. The only problem with this error message is that it is very spammy, every 30 minutes you get errors about the 0x85 request failing. So if you have hundreds of LUNs then this is going to fill up your logs a lot faster than if the messages weren’t reported. That is really the only “impact” you will feel from this message.
So if / when you see this message please do not think of it as your underlying cause.
Don’t follow the rabbit just to see how deep the hole really goes.
Several quick things to note here for clarification:
– At one point we (Pure Storage) found that we were not properly informing the ESXi host that we do not support them sending S.M.A.R.T. queries and thus why these were continually sent to the FlashArray. We fixed this issue (Purity 4.7.4+) and the errors disappeared in ESXi 5.x.
With the release of ESXi 6.0 however, these errors returned and seems as though something has changed within ESXi that no longer queries the array on whether or not we support these requests and thus the errors have returned. We are investigating this with VMware and are hoping to get these errors addressed soon.
– Lastly, a little pre-promotion of NVMe-oF with Pure Storage and VMware, if you are utilizing NVMe rather than SCSI these errors aren’t anything you have to worry about as it is a different protocol. So make sure to check it out once it becomes available and don’t worry lots more to come on that soon! 🙂
As always, please leave comments, questions, suggestions, etc on anything you feel may be worth discussing! Thanks for reading and we’ll be virtually in touch again, soon!