High availability has become one of those functions that many companies take for granted. The ability for a mission critical virtual machine to re-spawn elsewhere in the event of a host failure is really useful. While there is some downtime associated with the virtual machine restarting and recovering itself, the reaction time is fantastic.
This functionality is accomplished by the host maintaining a “heartbeat” with other servers in the HA cluster. In the event that the other servers stop receiving the heartbeat signal, the cluster assumes that the server is down and reboots the virtual machines on the server onto other, available virtual machines.
Issues arise when a network is not designed properly or the server is somehow isolated from the other servers (perhaps a specific switch failure) or failure of the management network. Suddenly, there are major issues with multiple copies of the same virtual machine running. Not good at all. It takes just a second of thought to understand how complicated the repercussions of this situation is. However, never fear, VMware has heard your cries and incorporated another level of host detection in this round of vSphere versions.
Master / Slave Relationship
No longer are the days of primary and secondary nodes. Rather, all nodes in the HA cluster can participate automatically. The following are the criteria that are used to determine which host is going to be the master in the cluster:
- Which host has the access to the most datastores in the cluster.
- ESXi host with the highest MOID
Master elections occur when:
- HA functionality being enabled initially
- Master node fails
- Master node enters maintenance mode
- Management network partitioning
- If the management network is somehow split up (failed switch, for example), hosts that cannot see the original master will elect a new one and operate within the same HA environment.
- Upon resolution of management network partitioning, the multiple master nodes will consolidate into a single management node.
In the new master / slave relationship, the master node is responsible for monitoring the activities of the slave nodes via the heartbeat. Additionally, it will maintain a list of the VMs running on each ESXi host. The slave node, on the other hand, monitors the run state of the local VMs and monitors the health of the master node (see, even the master node needs a little love and attention sometime…. it’s hard work being the master).
Storage Heartbeats
All this talk about heartbeat monitoring is a great segue into a new type of heartbeat… the storage heartbeat.
Previously, heartbeat relied upon the IP network to pass the status information around to the other nodes. But, we all can come up with ways in which this can fail. If the virtualization environment utilizes fibre storage or is architected such that IP storage is on a separate physical network, it is possible for management to fail but the VMs continue to run uninterrupted. To the VMs, nothing has happened. They may see a drop in client connectivity, but they can access their server resources. The other hosts will freak out and start up multiple copies on other ESXi hosts. Not good.
VMware has introduces a new heartbeat type that is able to address this issue. Storage heartbeats utilizes the datastore level to maintain heartbeat information. So, in the event of a network failure, ESXi hosts will look to the datastores to determine if the ESXi host is still active. If so, the VMs remains in the same state. If the ESXi host is no longer actively using the datastores, the master node will start the VMs elsewhere. This is accomplished by storage heartbeats writing to specific locations on the VMFS datastores or to a specific file location on a NAS datastores.
The datastores are selected by random during initialization of the functionality. The datastores can be changed manually, but it is not suggested to alter the default behavior. When a new datastore is introduced into the environment or a change to the environment would allow for greater/less preference, vCenter will recalculate the proper datastore for the storage heartbeats. In manual mode, you would need to manually change it.
The storage heartbeat is meant to be a final catch-all function and ends up being a great diagnostic feature as well. It will drastically protect the integrity of your server environment and help protect the virtual machines from having multiple versions running at the same time.
This is one of those features that rely upon a properly designed network… especially if utilizing IP-based storage.
HA States
A new host property exists that will inform you of the ESXi host’s HA state in the HA cluster.
- NA (HA not configured)
- Election (Master election in process)
- Master (remember, in the event of a network partition, there can be more than one master)
- Connected (Connected to the master – aka “Slave”)
- Network partitioned
- Network isolated
- Dead
- Agent Unreachable
- Initialization Error
- Unconfig Error
- These new properties can be useful in ascertaining the HA state of your virtualization infrastructure… especially useful if you are experiencing an HA failure at the moment.
Conclusion
- The new HA operating states and functions in vSphere 5 provide for a more secure HA environment in your virtual datacenter. The new master/slave election process allows for resiliency during a management network partition. Those hosts that can see each other become a new HA sub-environment until the partitioning has been resolved. The storage heartbeat allows the protection of virtual machines in the event of a network partition or IP connectivity failure.
- HA will continue to work with multiple versions of ESXi. However, the functions available are a limit of the version running. So, if HA is critical to you and you like what you see, you better start evaluating vSphere 5 at your earliest convenience and roll it out!
