Skip to content
  • Home
  • About Me
  • Disclaimer

Transmitting on the wire

Adversity doesn't build character, it reveals it.

VMware and SCSI: Why do my Pure Storage datastores report SCSI 0x85 errors every 30 minutes?

August 9, 2019 by jhop

After being gone from work for an extended time due to several family events that required my attention, I am so happy to be back! I love the time I had with my family, it was definitely refreshing, but there is something to be said about being back to work and having your day structured! Additionally, after feeling like you have worked your backside off it is nice to have something tangible to see and know it was accomplished. Parenthood, it seems, is rarely seen in that way! 🙂

Now that I am back I decided to write my first post on an all too common set of questions / scenarios we see from customers, VMware, and even here internally at Pure Storage. I am hoping this will help people troubleshoot in a more effective manner and relieve some of the stress of thinking there is yet another thing to take care of.

The Scenario:

When an environment is impacted by some event the first question that is almost always asked is, “What happened and why?”. Since there is a driving need to better understand what happened, often times in-depth investigations begin.

When the investigation is in its infancy there are a lot of unknowns. Due to the unknowns these investigations often start out with a broad exploration of logs and eliminate items from our “suspicion list”, so to speak. As the debugging continues new data points will eliminate (or point towards) the potential culprit. More often than not it will be logging that sheds light on these issues, which makes sense since that is why developers implement logging to begin with.

So if we now take that and imagine an issue within an ESXi environment, where production was impacted due to potential storage and/or network problems, you’ll want to dive head first into vmkernel logs to see what it reported at the time of incident (at least I know I would)!

As you are looking through the logs you come across the following log line:

2019-08-09T15:13:32.496Z cpu5:2098263)ScsiDeviceIO: 3435: Cmd(0x459ad5915dc0) 0x85, CmdSN 0x1526 from world 2099993 to dev "naa.624a93708a75393becad4e43000113fb" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

Since this error indicates that a SCSI command (0x85) has failed and we have Valid Sense Data, this is actually something you would naturally want to look at.

The thing you would discover here though is that this message is reported for all Pure Storage volumes, not only that, but it is reported every 30 minutes!

Naturally this would lead to some questions that needed some answering, right?

The Questions:

  • What does a SCSI opcode of 0x85 even mean?
  • Why does it report errors every 30 minutes for every Pure Storage volume?
  • Is it the reason for our production impact OR will it eventually be?

The are all fair questions and, to be quite honest, were questions I was asking at one point in my career as well. So let’s address them one by one to better understand.

What does a SCSI opcode of 0x85 even mean?

As I have discussed in other posts SCSI is, in a simple term, a way for a computer to talk with varying types of devices, which in our case would be a disk. Using a series of SCSI opcodes a computer is able to tell a device what task it would like to accomplish. Since this blog isn’t about that, you can read the Wiki on SCSI commands for a high level overview if you desire.

With that understanding, and even referencing the wiki link above, we can see that a SCSI opcode 0x85 is an: ATA PASS-THROUGH request.

Great, an ATA-PASS-THROUGH request that clears it up, right? 😉

If you want a real in-depth understanding you are always welcome to review the official T10 org document on ATA PASS-THROUGH, it’s a real page turner! But if you don’t care that much, and just want a general understanding, then know it is simply a way for applications, OSs, hosts, etc to issue an ATA command to the underlying disk(s) in order to perform specific tasks or functions. In order to try and keep this post simplified we will not go over all of those functions and will focus on just the applicable one: S.M.A.R.T. data. If you read the provided link you can better understand what the purpose of this request does as, again, that isn’t the purpose of this post.

The reason S.M.A.R.T requests are important here though is because that is the data the ESXi host is attempting to probe for each disk that is attached to it. This is because a LU is seen as a disk to the host and is therefore treated the same as if it was connected via FC, iSCSI or directly via SATA / ATA / SAS or SCSI.

Great, now we know what the command is doing (in this particular case with ESXi) so let’s see why we see those pesky errors every 30 minutes.

Why does it report errors every 30 minutes for every Pure Storage volume?

As previously explained above, since ESXi treats these FC / iSCSI attached LUNs that same as it would a local drive it wants to try and gather diagnostic information for that device. It has been that way since VMware added support for S.M.A.R.T. data collection in ESXi 5.1.

The problem though is that it shouldn’t / doesn’t need to worry about that information for a Pure Storage datastore (really any storage vendor), since that is already monitored from the array itself.

Additionally, volumes presented to the ESXi host for use as a datastore or RDM are often in RAID groups of some kind that contain multiple drives in the pool. So when this query is sent, which device is going to respond to this request? All of them, some of them, one of them? That is why the drives themselves are managed by the backend storage array to begin with, because that is hardware that was designed to manage these types of requirements. 🙂

This is why the FlashArray will reject these requests with the following response:

(0x5) ILLEGAL REQUEST - (0x20/0x00) INVALID COMMAND OPERATION CODE

We are effectively telling the ESXi host we don’t support this request as we are managing these drives.

What about the 30 minute interval? Well, this is easy to explain…. it is because the ESXi host attempts to read S.M.A.R.T. data every 30 minutes. So everytime this request is sent the errors will be reported in the logs because we are rejecting that request from the ESXi host.

So what about our last question?

Is it the reason for our production impact OR will it eventually be?

The answer here is undoubtedly: NO. This error message is completely safe and doesn’t, and will not, impact your production environments. The only problem with this error message is that it is very spammy, every 30 minutes you get errors about the 0x85 request failing. So if you have hundreds of LUNs then this is going to fill up your logs a lot faster than if the messages weren’t reported. That is really the only “impact” you will feel from this message.

So if / when you see this message please do not think of it as your underlying cause.

Don’t follow the rabbit just to see how deep the hole really goes.

1

Wrap Up

Several quick things to note here for clarification:

– At one point we (Pure Storage) found that we were not properly informing the ESXi host that we do not support them sending S.M.A.R.T. queries and thus why these were continually sent to the FlashArray. We fixed this issue (Purity 4.7.4+) and the errors disappeared in ESXi 5.x.

With the release of ESXi 6.0 however, these errors returned and seems as though something has changed within ESXi that no longer queries the array on whether or not we support these requests and thus the errors have returned. We are investigating this with VMware and are hoping to get these errors addressed soon.

– Lastly, a little pre-promotion of NVMe-oF with Pure Storage and VMware, if you are utilizing NVMe rather than SCSI these errors aren’t anything you have to worry about as it is a different protocol. So make sure to check it out once it becomes available and don’t worry lots more to come on that soon! 🙂

As always, please leave comments, questions, suggestions, etc on anything you feel may be worth discussing! Thanks for reading and we’ll be virtually in touch again, soon!

Post navigation

Previous Post:

VMware vVols: How to Enable Trivia Logging

Next Post:

VMware vVols: Pure Storage and thick provisioned vVols.

2 Commments

  1. Ravichandran K says:
    August 9, 2019 at 11:43 pm

    Awesome explanation jhop, keep things going.. expecting more such informational contents from you. Thanks

    1. jhop says:
      August 11, 2019 at 1:38 pm

      Thank you, I appreciate the kind words and thanks for reading!

      -jhop

Comments are closed.

Subscribe to stay up-to-date!

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Recent Posts

  • Pure Storage & VMware: Adaptive Queueing, Storage I/O Control and QoS (Part 2)
  • Pure Storage & VMware: Adaptive Queueing, Storage I/O Control and QoS (Part 1)
  • Pure Storage: Management Pack 3.0.2 for vRealize Operations Manager (vROps)
  • purepyvmware: Pure Storage and VMware Python module
  • Pure Storage: Management Pack 3.0.1 for vRealize Operations Manager (vROps)

Recent Comments

  • jhop on purepyvmware: Pure Storage and VMware Python module
  • Ravichandran K on purepyvmware: Pure Storage and VMware Python module
  • Christopher Kusek on vSphere 7.0: Configuring NVMe-RoCE with Pure Storage
  • jhop on VMware vVols: Pure Storage and thick provisioned vVols.
  • Tayfun Deger on VMware vVols: Pure Storage and thick provisioned vVols.

Archives

  • November 2020
  • September 2020
  • August 2020
  • July 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • October 2019
  • September 2019
  • August 2019
  • May 2019

Categories

© 2021 Transmitting on the wire | WordPress Theme by Superbthemes