If I wanted to, this could be a very big post all about configuring HA correctly. But I don’t want to reinvent the wheel. Instead I just want to share my experiences with this error:
HA Agent on ESX-HOST in cluster CLUSTER-NAME has an error
Odds are that you will eventually see this one pop up in vCenter for one of your ESX 3.x hosts. If you’re not sure what it means, well the translation basically is that the host displaying the error could fail and your VMs running on it probably won’t get started up automatically on another host. Essentially HA is broken on the host.
The first thing that you could try in this situation is to right click on the host and select Reconfigure for HA. Some of the time this will do the trick. It’s a non-service-impacting solution and if you have change management processes in place, they probably won’t apply here. Simple.
However, if that does not fix the issue you’re going to have to take some slightly more drastic measures. Here are some options:
- Place the host in Maintenance Mode and reboot it
- Disable and then enable HA
- Place the host in Maintenance Mode, remove it from vCenter, then re-add it and take it back out of Maintenance Mode
- Reinstall ESX
The latter is obviously a little bit drastic but should hopefully work. On the downside, if you have a lot of port groups and networking config to do and no automated way to do it, it’s not the quickest job. Just rebooting is worth a try but often doesn’t improve the situation although I have seen it work. The second option may also work and won’t affect running VMs. By far the best option is the third one, removing the host from vCenter then re-adding it. None of the network configuration should be lost and as long as you have enough other running hosts to satisfy the number of VMs that you have and, more importantly, not violate your configured HA failover capacity then you should be ok. It takes about 10 minutes to do and as long as you use Maintenance Mode, you should have any VMs accidentally unregistered.
But why does this error occur? HA functionality is based on Legato AAM (now called EMC Legato Full Time (ft) AutoStart) so most likely any problems with HA on ESX will be problems with Legato. The documentation for AAM (if you can find any) may yield some answers, I’ve never found the time to go looking though.
Further HA resources
A few places that you could try looking for further HA information:
- Yellow-Bricks HA Deepdive
- Troubleshooting VMware HA Isolation Response – Scott Lowe
- VMware Communities: Detecting HA Failures