This article is about avoiding unnecessary resyncs when an ESXi host in a VSAN Cluster suffers a PSOD. It’s only valid for VSAN Clusters with 4 or more hosts.
The technical justification for this, is to avoid unnecessary resync of components, that are left in an Absent State following the PSOD.
By default if an ESXi hosts suffers a PSOD it will stay up and running unresponsively with the PSOD stack trace displayed on the Console. It may take a while until realizing that the host is unresponsive and the time to reboot the host.
I assume that the default cluster policy has not been changed from FT=1.
In a HA enabled VSAN Cluster, should a PSOD happen, the VMs will be automatically restarted in one of the remaining hosts if the VM Folder and VM data objects meet the VSAN requirements for accessibility. There are a few situations where VMs can fail to restart like the ones referred in the following post: http://blogs.vmware.com/vsphere/tag/psod.
Three different things can happen here:
- The host had running VMs but the VSAN components of those VMs were not located in the Disk Groups of the crashed host.
- The host had some components that were stored in the Disk Groups, leaving some VMs that are running in another host, out of compliance.
- A combination of 1 and 2: Host had the VM running and also had one or more components of that VM in his own Disk Groups.
In the first scenario, resync is not needed, HA will normally take care of restarting the VM somewhere.
In the second scenario HA will kick almost immediately and resync after 60 minutes.
And finally in the third scenario HA should kick in and the VM should be able to power on in another host and resync after 60 minutes.
Again, all these scenarios are theoretical and have to be tested in a Lab environment for example.
Important to mention here is that the resync of components, after a PSOD, will only happen on VSAN Clusters with at least 4 nodes. In 3 node VSAN Cluster, should a host fail VSAN will not reprotect the failed components. With 3 node VSAN Cluster you get re-protection against Magnetic Disk Failures and SSD but not against host failures.
The amount of data to be resynced to other hosts can be quite significant and can eventually have a performance impact on the running VMs in the VSAN Cluster. VSAN has an internal scheduler that throttles the resync IO traffic in such a way to be fast enough to recover from the failure and that also tries to not compromise the performance of the running VMs.
Even if the scheduler is there, we want to guarantee that in a PSOD we won’t have to start resync operations after the default timeout of 60 minutes.
For more information about this setting (ClomRepairDelay) please consult this KB
Also, we want to capture the PSOD dump, in order to diagnose the root cause of the crash.
For that purpose, I strongly recommend to set up a Network Dump Collector. The instructions to set up the Network Dump Collector can be find in this KB:
So now that we know what we want to achieve, it’s just a matter to change the default ESXi behavior during PSOD.
Again, the following VMware KB explains how configure it:
The tricky part of this is that we want to leave enough time for the coredump be transferred through the network to the Network Dump Collector and at the same time we want that the remaining time (60-x) to be enough to completely allow the ESXi to reboot.
I will leave up to you to test this setting and try to figure out how long does the ESXi takes to send the coredump over the network and how long it takes to reboot.
You can try crashing you ESXi host running the following command in a ssh session:
vsish -e set /reliability/crashMe/Panic 1
Hope this article makes you think about the implications of a PSOD in a VSAN Cluster.