Wednesday, April 1, 2009

VMWare HA Agent had an error

I recently had a problem with one hoste in a 3 node VMWare ESX 3. 5 cluster. For some reason, the HA Agent would not start after I applied outstanding software updates. I kept receiving "VMWare HA Agent had an error" in the alerts after the host came online. I had recently needed to change the IP address of the VMWare 3.5 ESX host, so I figured that was the root cause.

I tried the standard VMWare HA Agent troubleshooting steps in particular order:
1. Reconfiguring HA on the affect host - FAILED
2. Removing and Adding the Host - FAILED
3. Removing the vpxa package from the console on the host - FAILED
4. Verified that the hosts file contained all of the cluster hosts and the VC & that resolv.conf had the correct DNS Servers and search domain
5. Rebooting the VMWare ESX 3.5 host
6. Various combinations of the above

Nothing worked.

I did some more research, and discovered that the HA Agent caches host information in the following file: /etc/opt/vmware/aam/FT_HOSTS

I simply moved FT_HOSTS to a different filename and reconfigured HA on the VMWare ESX Host. A new FT_HOSTS file was created and HA is now up and running.

I hope this helps someone from spending as much time on this problem as did!


No comments: