I recently came across an interesting issue that I feel is worth writing about. Since we are working so hard to have a solid vVol implementation for our customers, it is important to get as much knowledge out there as possible for people.
This case was actually rather interesting and fun to work and a great reminder to:
– Always review best practices
– Always check the basics
– Don’t make assumptions
Those are things we should always keep in mind when working on issues. Unfortunately though, those reminders are often presented through issues you have worked way harder on than you should have! 🙂
All VMs on vVol datastores reporting unprotected by vSphere HA
A case was recently opened where a customer reported that all of their VMs on their vVol datastore were reporting the following error message:
This virtual machine failed to become vSphere HA protected and HA may not attempt to restart it after a failure.
Obviously the error message itself isn’t impacting production as I/O continued to flow on VMs, power on / off operations all succeeded, VMs could be modified, etc. The problem though is that it could be production impacting if underlying services were interrupted. This meant that finding a solution quickly was important as you never know when problems may strike.
What was as equally interesting is that VMFS datastores were unimpacted, any VMs on VMFS were protected by vSphere HA with no problems. So this pointed rather clearly to vVols needing to be examined closer.
NOTE: If you would like to learn more about vSphere HA then the How vSphere HA works document by VMware is a great place to start.
Tell me then, what happened?
While the underlying cause for this is somewhat “boring” so to speak, it is a great reminder about what I stated previously. It turns out it was a simple configuration issue. The problem though was that the error message put me off track for a while and was looking too hard for something that wasn’t there.
One of the best things about vVols is that you don’t really have to manage connectivity. You just need to be sure you have properly configured your protocol-endpoint (PE) and then allow VASA and vSphere to do the rest. We try to make things as simple as possible at Pure Storage, so even configuring the protocol-endpoint is as simple as registering the storage provider using our vSphere Plugin and you are all set!
What happened here seems to be a simple case of “old habits die hard”. When the vSphere HA vVol was created it created a config vVol just as we would expect. After it is created there are no actions for you as the end-user to take. It will be automagically managed by VASA and ESXi.
The problem though is that the administrators were too used to using volumes / LUNs in a traditional sense and thought they had to directly connect the config vVol to their hosts.
This was done by running a command on the FlashArray similar to the following:
purevol connect --hgroup VMware-Prod vvol--vSphere-HA-abcd4567-vg/Config-xxxxxxxx
It could also be done in the GUI by simply selecting the volume and associating it with the desired host or host group.
The issue here though is that this meant that the vSphere HA config vVol is no longer accurately managed by VASA. This means that some of the required properties of the vVol are not provided to the ESXi host and thus it reports the following error in the vvol logs:
2019-09-25 18:09:10.640Z warning vvold [Originator@2549 sub=Default] **** invalid soap response start ****
Thus because the ESXi host didn’t get all of the required data for the vSphere HA config vVol, that means the VM which was being powered on (in this instance) reports that it cannot be protected by vSphere HA. Which is why none of the vVol VMs were protected, because everytime that call was made the required information was missing.
For the sake of clarity, this is what the configuration looked like for the vSphere HA config vVol.
Name Size LUN Host Group Host vvol--vSphere-HA-abcd4567-vg/Config-xxxxxxxx 4G 254 VMware Prod esxi-01 vvol--vSphere-HA-abcd4567-vg/Config-xxxxxxxx 4G 254 VMware Prod esxi-02
The thing I am emphasizing here is that the LUN is simply “254”. While this would be a standard LUN for VMFS datastores, RDMs, etc it isn’t for a vVol. Since vVols are connected through your PE it should actually have a sub-lun ID, it should look more like this:
Name Size LUN Host Group Host vvol--vSphere-HA-abcd4567-vg/Config-xxxxxxxx 4G 254:2 VMware Prod esxi-01 vvol--vSphere-HA-abcd4567-vg/Config-xxxxxxxx 4G 254:2 VMware Prod esxi-02
Here you can see that the LUN is “254:2” with the sub-lun id being the “2”. This means that the vVol is being properly presented through the PE to the ESXi host as you would expect. This also means that VASA is properly managing the vVol.
If you would like to know more on this you should read Cody Hosterman’s blog: Introducing vSphere Virtual Volumes on the FlashArray or What is a Config vVol anyway? Both will help strengthen your understanding on these concepts if needed.
How do we fix it then?
The good news is that this is an easy fix! You simply need to disconnect the vSphere HA config vVol via the GUI or by running the following command on the FlashArray CLI:
purevol disconnect --hgroup VMware-Prod vvol--vSphere-HA-abcd4567-vg/Config-xxxxxxxx
Once this is done, the next time an action needs to be performed on the vSphere-HA config vVol (such as powering on VMs, creating VMs, etc) VASA will be able to connect to it as expected and the errors will no longer be reported in the environment.
We do have this documented in our Web Guide: Implementing vSphere Virtual Volumes with the FlashArray under the “vVol binding” section. Part of the challenge though is that it is a rather large document, so it can be hard to see if you aren’t looking for it.
For good measure though I will add the screenshot here:
So if you are utilizing a Pure Storage FlashArray and all your vVol VMs complain that they are not being protected by vSphere HA it is likely a quick and easy fix and you can be on your way.
Please let me know if you have any comments or questions surrounding this. I am happy to help wherever I can!