Configure Fault Tolerance in VSAN 6.1

VSAN 6.1 has introduced support for SMP-FT VMs, so we can now use Fault Tolerance to protect our vCenter VM for example.

Fault Tolerance has some requirements in order to be implemented.

A full list of the requirements can be accessed on the vSphere 6.0 documentation:

Let’s then see how to configure Fault Tolerance on a VSAN 6.1 Cluster:

Configure Host Networking for FT

FT requires compatible hosts and a dedicated 10GB network.

I am using a dvSwitch for the set up in this scenario.


Set up vmkernel for FT on every host of the Cluster


Verify that you can vmkping all the FT vmkernel IP addresses that were configured for FT Logging.

Using vmkping you can test the network connectivity to each of the VSAN nodes through the FT vmkernel uplink.


Testing FT vmkernel interfaces with vmkping

Create a simple VM with 4 vCPUs


Right Click the VM and turn on FT:


Turning on Fault Tolerance on the 4vCPU VM


Select the host that will have the secondary VM


Verify that FT is active


Power On the VM:


And that’s it ! Your VM is now fully protected by FT with zero RPO/RPO should the host where the primary VM fails.

We can now see 2 VM’s running on each host

[root@ds-lab-vsan14:/var/log] esxcli vm process list
SMP-FT VM - Nelson
World ID: 4569207
Process ID: 0
VMX Cartel ID: 4569169
UUID: 42 32 e3 82 f6 b9 51 00-eb 90 dd 0d d5 83 cb 97
Display Name: SMP-FT VM - Nelson
Config File: /vmfs/volumes/vsan:525145947e3307d9-a9e094a5a7db903d/e8ff2456-0046-be24-8b7d-90b11c2465b3/SMP-FT VM - Nelson.vmx

[root@ds-lab-vsan12:~] esxcli vm process list
SMP-FT VM - Nelson
World ID: 4560087
Process ID: 0
VMX Cartel ID: 4560086
UUID: 42 32 e3 82 f6 b9 51 00-eb 90 dd 0d d5 83 cb 97
Display Name: SMP-FT VM - Nelson
Config File: /vmfs/volumes/vsan:525145947e3307d9-a9e094a5a7db903d/1ee92456-80 68-1d50-641d-90b11c2b5454/SMP-FT VM.vmx

Monitor vsan.resync_dasboard From Bash Shell in vCenter Appliance


It’s been a while since my last post and this one won’t be too long as I’m very busy with VSAN at the moment.

Just found a very interesting way to monitor the progress of vsan.resync_dashboard with a simple bash one liner you can execute from within the vcenter linux appliance. Does not work for Windows vCenter

# watch '/usr/bin/rvc -c "vsan.resync_dashboard 1/Datacenter_VSAN/computers/Cluster_VSAN/" -c "quit" root:'PaSSW0rd'@localhost /dev/null 2>&1 | grep "Total"'


Feel free to test it on your VSAN Environments to proactively monitor the progress of the resyncs

Avoiding Resyncs When an ESXi in a VSAN Cluster Suffers a PSOD

This article is about avoiding unnecessary resyncs when an ESXi host in a VSAN Cluster suffers a PSOD. It’s only valid for VSAN Clusters with 4 or more hosts.

The technical justification for this, is to avoid unnecessary resync of components, that are left in an Absent State following the PSOD.

By default if an ESXi hosts suffers a PSOD it will stay up and running unresponsively with the PSOD stack trace displayed on the Console. It may take a while until realizing that the host is unresponsive and the time to reboot the host.

I assume that the default cluster policy has not been changed from FT=1.

In a HA enabled VSAN Cluster, should a PSOD happen, the VMs will be automatically restarted in one of the remaining hosts if the VM Folder and VM data objects meet the VSAN requirements for accessibility. There are a few situations where VMs can fail to restart like the ones referred in the following post:

Three different things can happen here:

  1. The host had running VMs but the VSAN components of those VMs were not located in the Disk Groups of the crashed host.
  2. The host had some components that were stored in the Disk Groups, leaving some VMs that are running in another host, out of compliance.
  3. A combination of 1 and 2: Host had the VM running and also had one or more components of that VM in his own Disk Groups.

In the first scenario, resync is not needed, HA will normally take care of restarting the VM somewhere.

In the second scenario HA will kick almost immediately and resync after 60 minutes.

And finally in the third scenario HA should kick in and the VM should be able to power on in another host and resync after 60 minutes.

Again, all these scenarios are theoretical and have to be tested in a Lab environment for example.

Important to mention here is that the resync of components, after a PSOD, will only happen on VSAN Clusters with at least 4 nodes. In 3 node VSAN Cluster, should a host fail VSAN will not reprotect the failed components. With 3 node VSAN Cluster you get re-protection against Magnetic Disk Failures and SSD but not against host failures.

The amount of data to be resynced  to other hosts can be quite significant and can eventually have a performance impact on the running VMs in the VSAN Cluster. VSAN has an internal scheduler that throttles the resync IO traffic in such a way to be fast enough to recover from the failure and that also tries to not compromise the performance of the running VMs.

Even if the scheduler is there, we want to guarantee that in a PSOD we won’t have to start resync operations after the default timeout of 60 minutes.

For more information about this setting (ClomRepairDelay) please consult this KB

Changing the default repair delay time for a host failure in VMware Virtual SAN(2075456)

Also, we want to capture the PSOD dump, in order to diagnose the root cause of the crash.

For that purpose, I strongly recommend to set up a Network Dump Collector. The instructions to set up the Network Dump Collector can be find in this KB:

ESXi Network Dump Collector in VMware vSphere 5.x (1032051)

So now that we know what we want to achieve, it’s just a matter to change the default ESXi behavior during PSOD.
Again, the following VMware KB explains how configure it:

Configuring an ESX/ESXi host to restart after becoming unresponsive with a purple diagnostic screen (2042500)

The tricky part of this is that we want to leave enough time for the coredump be transferred through the network to the Network Dump Collector and at the same time we want that the remaining time (60-x) to be enough to completely allow the ESXi to reboot.
I will leave up to you to test this setting and try to figure out how long does the ESXi takes to send the coredump over the network and how long it takes to reboot.

You can try crashing you ESXi host running the following command in a ssh session:

vsish -e set /reliability/crashMe/Panic 1

Hope this article makes you think about the implications of a PSOD in a VSAN Cluster.

ESXi Scratch Partition on the VSAN Datastore – The Risks

One of the Best Practices for VSAN is to not use the VSAN Datastore for the Scratch Partition or for the Syslog Server.

As per VMware KB:

Creating a persistent scratch location for ESXi 4.x and 5.x (1033696)

Note: It is not supported to configure a scratch location on a VSAN datastore.


So what is the reason for this ?

Imagine that for any reason you are forced to leave an ESXi from the VSAN Datastore, and when you type the command:

~ # esxcli vsan cluster leave

You will get this error:

/dev/disks # esxcli vsan cluster leave

Failed to leave the host from VSAN cluster. The command should be retried (Sysinfo error on operation returned status : Failure. Please see the VMkernel log for detailed error information

Vob Stack:

[vob.sysinfo.set.failed]: Sysinfo set operation VSI_MODULE_NODE_umount failed with error status Failure.


Basically, the ESXi host can’t leave the Cluster as it’s impossible to release the configured scratch partition that is locked in the VSAN Datastore.

This can also happen if the Syslog folder is configured inside the VSAN Datastore.

To solve this just connect directly to your host using the vSphere Client and change the Scratch Partition to another folder.

Scratch PartitionSyslog Configuration


Follow the Best practices and don’t ever configure the Scratch Partition or the Syslog on the VSAN Datastore.




How to Redirect Output from the VSAN Ruby Console to a File


I was struggling today to get the output of the Ruby console redirected to a file. I was using the Windows vCenter Ruby Console.

For the Linux vCenter the regular linux redirectors work as usual.

So, in order to get this working you have to modify the rvc bat to look tlike this:


The rvc is located here:


C:\Program Files\VMware\Infrastructure\VirtualCenter Server\support\rvc


..\ruby-1.9.3-p392-i386-mingw32\bin\ruby -Ilib -Igems\backports-3.1.1\lib -Igems\builder-3.2.0\lib -Igems\highline-1.6.15\lib -Igems\nokogiri-1.5.6-x86-mingw32\lib -Igems\rbvmomi-1.7.0\lib -Igems\terminal-table-1.4.5\lib -Igems\trollop-1.16\lib -Igems\zip-2.0.2\lib bin\rvc -c “vsan.support_information 1” -c “quit” administrator:VMware123!@localhost



Once you have modified the rvc bat, just create a shortcut on the Desktop and modify the Target as follow:



RVC Shortcut


Then, just click on the Shortcut and you will get a text file created in C:\VSAN.log

VSAN Log File

VSAN Log File


For more information about the Ruby Console check this Blog:



Troubleshooting File Transfer Performance Between VM’s – Part 2

So, here we are again to finish this chapter.

After getting the Customer on a Webex session we managed to go to the BIOS settings of the DELL Server and change the System Profile to “Performance”.

BIOS Settings

So, we started the ESXi servers and we did some testings.

As a reference, before this change, the average speed to copy a 9 GB file from one VM to other was around 20 Mb/s with some drops during the transfer.

So, here are the results after:

File transfer speed after Power Management change

File Transfer Speed After Power Management Change: 55 Mbs


Power Management really matters !

Troubleshooting File Transfer Performance Between VM’s – Part 1

So, this will be my first very post.

I’ve been asked to find why the file transfer performance between 2 Windows VM’s was so poor.

So far I have to draw a strategy to start rulling out some components.

The very first thing to check at a Global level is the Power Management of the ESXi hosts.

ESXi Host offer 4 ways to control power:

High Performance: This power policy maximizes performance, using no power management features. It keeps
CPUs in the highest P-state at all times. It uses only the top two C-states (running and halted), not any of the
deep states (for example, C3 and C6 on the latest Intel processors). High performance is the default power
policy for ESX/ESXi 4.0 and 4.1.
• Balanced: This power policy is designed to reduce host power consumption while having little or no impact on
performance. The balanced policy uses an algorithm that exploits the processor’s P-states. Balanced is the
default power policy for ESXi 5.
• Low Power: This power policy is designed to more aggressively reduce host power consumption, through the
use of deep C-states, at the risk of reduced performance.
• Custom: This power policy starts out the same as balanced, but it allows individual parameters to be modified.
If the host hardware does not allow the operating system to manage power, only the Not Supported policy is
available. (On some systems, only the High Performance policy is available.

Checking the HPM (Host Power Management) MUST always be the very first thing to verify before going deeper in the analaysis.

As a reference, in the past, I had to troubleshoot a P2V of a VM that was hosting a Java application. Before Virtualization the Tomcat service took around 1 min and 30 secs to start. After Virtualization, it took 3 min and 20 secs. I have verified everything, even tried to change the Java Heap size parameters within the application and at the end I figured out, that the ESX was running on Balanced HPM. After having changed in the BIOS the HPM from Balanced to High Performance, the service toook 1 min and 15 secs to start ! Even faster than when the Server was physical !

So lesson learned: Start checking Host Power Management for any performance issue first !

For more information about this topic check this VMware White Paper

Host Power Management in VMware vSphere® 5