Here we are again with Part 2 of our mini blog series for VMware’s Adaptive Queue Depth Algorithm (AQDA) and Storage I/O Control (SIOC) features being used in parallel with Quality of Service (QoS) Rate Limiting on a Pure Storage FlashArray!
Let’s jump right in since everything else has been explained in Part 1!
Articles in this series:
Pure Storage & VMware: Adaptive Queueing, Storage I/O Control and QoS (Part 1)
Today’s blog post is going to focus on whether or not you should use Adaptive Queueing (AQDA) on the Pure Storage FlashArray with QoS Rate Limiting enabled.
NOTE: If you are not interested in testing then you can go straight to the What does it all mean? section below.
Adaptive Queue Depth (AQDA) Testing
In order to get a complete picture of whether or not this feature should be enabled there were four main things tested so I could get a good baseline as well as a solid sampling size of configurations:
- Single ESXi Host with a single VM
- Single ESXi host with multiple VMs
- Multiple ESXi hosts with single VM on each host
- Multiple ESXi hosts with multiple VMs on each host
Before we get into the nitty gritty of everything there are a few things to explain about the screenshots so they are easier to understand as you read through them:
- The left side graph is focused on latencies at the various storage levels:
Red: Average Disk/Driver Latency (DAVG)
Blue: Average Kernel Latency (KAVG)
Green: Average Guest Latency (GAVG)
Yellow: Average Queue Latency (QAVG)
- The right side graph is focused on the following three statistics:
Red: Megabytes (MB) read per second
Blue: Megabytes (MB) written per second
Green: Device Queue Depth
Single ESXi host with a single VM
This was obviously the easiest of all the tests and was really used as a simple baseline of what we can expect from a behavioral perspective.
This VM was pushing a 50/50 split of reads and writes at 32k sizes and pushing 10,000 IOPs to the FlashArray. To begin with we got a standard baseline of what it looks like with no QoS, AQDA, or anything else enabled:
As you can see above the VM is pushing ~300MB/s a second and the queue depth is set to 64 for the underlying datastore (right side), The average latency for the driver and guest are lockstep at sub-ms latencies with the kernel and queue times way at the bottom and don’t come into play right now (left side).
Okay, so what do things look like when we add QoS into the mix? I set the limit at 75MB/s (a value taken from a real world case that was opened) and watched what happened:
Here again I don’t see anything all too unexpected. The combined read and write workload drop to ~75MB/s just as it was set on the FlashArray volume. In parallel we notice that the driver and Guest OS latency increase to ~3ms, which again makes sense as we cut the bandwidth availability to the volume but the guest continues to try and send a larger amount. Once again, nothing much for ESXi to do thus, no real latency in the queue or kernel statistics.
Now we add AQDA into the mix and see what happens with our statistics:
This graph will look much more interesting and is still something that is being investigated by myself internally. Here you will notice that, once AQDA was enabled, there was no real change to any of the latency (starting on the far left of each graph). In reviewing the FlashArray we were limiting the read and write workload but didn’t appear as though we were sending back TASK_SET_FULL responses to the ESXi host. I wondered if I simply wasn’t rate limiting it enough so you will see the incremental increases on the left graph, and incremental drops in bandwidth on the right, as I tried various rate limiting values until I got all the way down to 1MB/s. Adaptive Queueing never kicked in (due to not receiving TASK_SET_FULL responses) and thus wasn’t used.
In this case, testing indicates that a single host-single VM configuration doesn’t benefit (or hurt) with the enablement of AQDA. There aren’t going to be a lot of the configurations out in the wild, so let’s keep going!
Single ESXi host with multiple VMs
This test involved multiple VMs (2-3) pushing a 50/50 split of reads and writes at 32k sizes and pushing 5,000 IOPs to the FlashArray. To begin with we got a standard baseline of what it looks like with no QoS, AQDA, or anything else enabled:
Again, here we can see that all latencies are ~sub-ms with the read/write bandwidth totaling ~300MB/s. Now that we have our baseline we will set QoS Rate Limiting on the FlashArray volume. This was once again set to 75MB/s same as the previous test.
This is where things will start to be significantly different from our single VM and single host test.
Here you will note several important things:
- Note how the kernel latency increases here. This is because the FlashArray is returning TASK_SET_FULL responses for the I/O requests that are overrunning the limitation set on the volume. This means the kernel has to retry these I/O operations and thus can result in kernel latency build up.
- The Queue latency increases here too. This is mainly for point number 1 but also because non-essential requests (i.e. anything not I/O from a VM) is pushed off in order to ensure data can still be serviced.
- The Guest OS and driver latencies diverge here. This is because Guest latency is a combination of kernel and driver latencies (along with queue latencies, but we won’t get into why they’re different here) and thus your overall guest latency is higher than just the driver.
- The queue depth (on the right graph) remains steady at 64.
Once again, with QoS Rate Limiting enabled there is nothing unexpected here. This looks as I would expect and simply wanted to point this out before we introduce AQDA.
All right, AQDA has been enabled and what do we see?! Well, nothing much different as far as Queue Depth, Kernel, Driver and Guest OS latency levels. They pretty much look the same.
But what we do notice is that the queue depth value is now fluctuating up and down(right graph), depending on how many TASK_SET_FULL / GOOD responses are received from the FlashArray.
In reality, if we compare the two charts there isn’t much of a difference between them with the exception of the queue depth value being bounced all over the place. The bandwidth and latencies are pretty much the same.
So again, testing indicates that a single host-multi VM configuration doesn’t benefit (or hurt) with the enablement of AQDA.
This configuration may be slightly more likely than the first, but in reality having an ESXi host that houses all of the VMs on a single datastore still isn’t what we would expect in most instances. So let’s try our next one.
Multiple ESXi hosts with single VM on each host
Okay, so we have tested the single host theories (with varying VMs) now what happens if we split the workload between two ESXi hosts? Let’s take a look.
This test involved multiple ESXi hosts with a single VM on each running a 50/50 split of reads and writes at 32k sizes and pushing 5,000 IOPs to the FlashArray.
As you will see in the graphs above the two ESXi hosts are sending a combined total of ~150MB/s to the FlashArray and latencies are reported at sub-ms.
Next step, enable QoS:
With QoS enabled we see that the latency (for driver, kernel, queue, and guest) all increase. Our total Guest OS latency for both ESXi hosts is ~6-7 ms and a bandwidth of ~40MB/s per host, give or take.
Again, this is all expected and you can read one section above for understanding why that is. In an effort to keep this post as brief as possible, I am trying not to repeat myself too much.
Now we introduce AQDA to the mix and review our results:
In my opinion this is where things start to get interesting. Not only because it is a little closer to a real-world scenario, but because we see for the first time AQDA making a difference…. both positive and negative.
If we review ESXi Host “A” with QoS enabled and then compare to ESXi Host A” with QoS and AQDA enabled together you will note a couple of things:
- With AQDA enabled the queue time significantly mellows out (yellow) and we have less intense spikes. This is a good thing overall and would claim that a win.
- If we look closely at the latency with AQDA enabled you will see that it increases on this host. The main contributor to that is due to the increase in kernel latency as you will note the driver latency (triggered from QoS in this case) is still at ~2.5 – 3ms. While it is only an increase in ~2-3 ms on average this is still concerning.
- The bandwidth on this ESXi host drops from staying between 17.5 – 20MB/s range (for both read and writes) and drop to mostly being between 15 – 17.5MB/s.
- With AQDA enabled you will see on this host that the queue depth is all over the place… often hitting as low as “1” slot available.
So overall on ESXi Host “A” we have lost bandwidth and increased latency.
Now let’s review ESXi Host “B” with QoS enabled and with QoS + AQDA enabled together:
- With AQDA enabled the queue time significantly mellows out here too and we have less intense spikes.
- If we look closely at the latency with AQDA enabled you will see that it decreases on this host. The main contributor, again, is due to the decrease in kernel latency as you will note the driver latency remains at ~3 – 3.5 ms. So here we see the opposite, a decrease in latency by ~2-3ms.
- The bandwidth on this ESXi host increases from between 17.5 – 20MB/s range (for both read and writes) to 22.5 – 25MB/s.
- With AQDA enabled you see that this host is doing very little adjustment to the queue depth.
So overall ESXi Host “B” has increased bandwidth and decreased latency.
Testing with single VMs on multiple ESXi hosts indicates that ADQA penalizes one host while helping another. So as we discussed previously in the “Part 1” post, there appears to be no fairness happening here between the two hosts OR the VMs.
Opinions may vary here but I would call this test as an overall negative result with AQDA. While it made things slightly better on one host it made things slightly worse on another. So in an essence it seems to “balance out” with it enabled, but from my perspective with QoS the results were consistent. With AQDA enabled it is a roulette as to which would be more impacted.
Multiple ESXi hosts with multiple VMs on each host
Okay, last scenario! Wahoo! This test involves multiple ESXi hosts with multiple VMs on each ESXi host. This is likely the most common configuration out of all the tests thus far. So let’s get going!
We know the drill here this test involved multiple ESXi hosts with multiple VMs (2-4) on each running a 50/50 split of reads and writes at 32k sizes and pushing 5,000 IOPs (each) to the FlashArray.
As you will note above each ESXi host is sending ~300MB/s to the FlashArray with spikes in latency that overall average of ~3ms. In this case the latency is coming from queueing and driver (thus the orange color) and isn’t something we need to worry about for this specific post.
All right, on to enabling QoS on the volume. This time I set it to 100MB/s (to begin with) and played around with it more as time went on.
Here is a look at what things look like with it enabled:
For the sake of time and redundancy here again we see latencies all increase and the overall bandwidth decrease. All of this is expected and nothing out of the ordinary.
This means it is time to enable AQDA on these two ESXi hosts and see how they handle it with multiple VMs:
Whoa! Now that we have multiple VMs, on multiple ESXi hosts, and more serious throttling in place, we can really see a dramatization of just how bad things can get! Let’s look at ESXi Host “A” first, like we did last time.
If we review ESXi Host “A” with QoS enabled and then compare to ESXi Host A” with QoS and AQDA enabled you will note a couple of things:
- With AQDA you will see on this host that (for the most part) the overall latency dropped for all categories down to ~3 -4 ms. But then would drastically increase to over 100ms of latency for ~30 seconds at a time.
- The bandwidth, like the latency, is not consistent at all. It hovers around ~90MB/s for most of the time and then take a drastic fall down to as low as 5MB/s for ~30 seconds at a time.
- The available queue depth slots are rising and falling but not as quickly as we have seen in the previous tests.
So overall on ESXi Host “A” we have lost increased bandwidth and decreased latency for most of the time… but dramatic swings when it changes.
Now let’s review ESXi Host “B” with QoS enabled and with QoS + AQDA enabled together:
- With AQDA you will see on this host that (for the most part) the overall latency increased for all categories but with the Guest and kernel latency averaging ~70 ms. Like ESXi Host “A” it would then drastically decrease to ~10ms of latency for ~30 seconds at a time.
- The bandwidth, like the latency, is not consistent at all. It hovers around ~10MB/s for most of the time and then drastically increases up to ~95MB/s for ~30 seconds at a time.
- The queue depth here is more consistently just hanging around “1” for this host. Thus the large latency and reduced bandwidth.
- The queue time for this host is astronomical at ~300ms on average but spikes being much higher.
Here we can see that ESXi Host “B” was almost the mirror opposite of ESXi Host “A” and it took a real beating because of it.
Testing with multiple VMs on multiple ESXi hosts indicates that ADQA severely penalizes one host while helping another… but again it is inconsistent. It will flip flop back and forth at random times, for a random amount of time, and not give any host / VM any real consistent workload like there is with just QoS enabled.
What does it all mean?
Okay, for those of you that made it through those testing examples, CONGRATULATIONS! For those of you that skipped straight here for the answer… well, thanks for being here anyway and still wanting to learn. 🙂
Before we go on with the recommendation it is very important that you understand this is only a small amount of the actual data / testing done with this feature. There was testing with more than two hosts, more than 4 VMs per host, varying workloads, etc. While I did find slight differences when 3 or more hosts were involved instead of two, along with the other tests mentioned, most of the testing showed fairly consistent results that you see above.
Recommendation / Guidelines
At this point in time the recommendation is to keep the Adaptive Queue Depth feature disabled when utilizing QoS Rate Limiting on the Pure Storage FlashArray.
For those that didn’t follow the testing outlined above those reasons are as follows:
1. In the more simple cases (such as a single ESXi host with single or multiple VMs) AQDA did not contribute in any positive way. Even when it was engaged the latency and bandwidth remained marginally the same from what it was when just QoS was enabled. The good news is that it also didn’t cause any additional problems, so that is good. 🙂
2. In the more complex cases (multiple ESXi Hosts and multiple VMs) we found that more often than not AQDA would penalize (sometimes severely) ESXi hosts in the cluster to varying degrees. Based off of testing there was no real way to tell which hosts / VMs would be impacted more severely than others. Whether the hosts were all sending the same amount of I/O to the underlying volume or not, the results were somewhat inconsistent.
3. Testing revealed (as well as an issue reported by a customer) that there appears to be a bug within AQDA that causes the queue depth to remain at a value of “1”. This is true even if the TASK_SET_FULL responses stop coming. For some reason that value just gets wedged and won’t change back to its original value (in our case, 64).
It was originally believed that this only happened if you tried to disable AQDA while the queue depth was actively changing from one value to another (a kind of deadlock scenario). Further testing however revealed that the queue depth can become wedged even if you don’t attempt to disable it. The only way to resolve this issue is to reboot the ESXi host. This is obviously very impactful to the environment in and of itself and needs to be investigated more.
At this point in time I think it is pretty clear what the recommendation is and why. If for some reason you believe your environment would benefit from this technology (for one reason or another) I would strongly recommend you thoroughly test as many scenarios as possible to ensure it isn’t going to cause further problems than the already latent environment you’re trying to fix. I do believe there are good use cases for this feature (such as a general overloaded array with other vendors) but with QoS Rate Limiting on the FlashArray it seems as though it is best left disabled.
Feel free to leave any questions or concerns within the comments section! I am always happy to answer any questions someone may have, even if it means I am wrong and get to learn something new! 🙂
Thanks for reading and happy trails!