Over the weekend, I learned the hard way what happens when you deviate from Vmware’s HCL list.
Many months back, I’d decided on HP EX920 1TB NVM drives for my home lab ESXI 6.7 U1 hosts
For reference, the hardware for my hosts is level-set as follows:
HP EliteDesk 800 G3 SFF (not on vmware HCL)
Intel Core i5-6500 CPU – 4 cores / 4 threads (on vmware HCL)
32 GB DDR4 SDRAM
Intel X520 dual port 10 GB nic (on vmware HCL)
Intel quad port NIC i350-T4 (on vmware HCL)
The HP EX920 1 TB NVM drives were installed into of the above ESXi 6.7 hosts U1 hosts a few months back, and were running without issue
Last month, I took 2/3 of the hosts and created a two-node vSan cluster, that’s where the trouble started.
Shortly after creating the vSan cluster, the vmware auto update manager indicated my vSan components were out of date, and required a critical fix. I happily applied ESXi 6.7 U2 to both hosts, on reboot, my NVM drives were gone. After a few minutes of internet searching, I landed here:
https://www.virtuallyghetto.com/2019/05/quick-tip-crucial-nvme-ssd-not-recognized-by-esxi-6-7.html
I followed the work-around in the blog, which was simply to take the NVME.000 driver from an ESXI 6.5 build, copy it to /bootbank and reboot the host. Nice! My NVM drives reappeared. Shortly after completing this step, I ended up removing my vSan config as I needed the hosts to study for some upcoming Citrix cert exams
As of the time of this blog posts, the hosts are just running local storage and I run periodic manual vMotions over 10 GBe (fast!)
Over the weekend, I was running some manual vMotions between the hosts over 10 GBe and noted some very odd behavior:
Attempting to run vMotions of 3 or more VMs from the HP EX920 NVM internally to another SSD in the same host, or to another host’s drive over 10 GBe would result in the HP EX920 going offline approx 5 mins after the vMotion job completed
As I was aware that my NVM driver was taken from an earlier version of ESXi, I decided to repro the issue on various older versions of ESXi. At home, I can provision these HP hosts very quickly, as each is connected to an IP KVM, and I keep spare USB drives. So, I was able to test out ESXi 6.5, ESXi 6.7 GA, ESXI 6.7 U1 and U2 . Once the base ESXI image was installed, I use host profiles to get up and running. Here’s what I found:
Regardless of the ESXI version/build I chose, the issue was easily reproduced. 3+ or more vMotions and the HP EX920 would fall over
In one of my other ESXi hosts, I had a spare Samsung 970 EVO Plus NVM unit. I decided to attempt to repro the issue on the same ESXI versions/builds using this drive, I wasn’t able to! My conclusion, the HP EX920 doesn’t play nice with ESXi 6.5 / 6.7, regardless of which NVM driver you use. Others have reported similar results here and here
- THE GOOD:The issue appears to be isolated to the HP EX920 NVMs
- THE BAD: I own three of the affected units
- THE UGLY: See above
Anyway, I purchased a pair of replacements ; Samsung 970 EVO Plus 1TB NVM
All 3 units have been posted for sale on KIJIJI, the next owner will be explicitly warned not to use them in an ESXi host!