For a while now, vRA has contained a Health Service that can be used to check and validate the vRA environment. It’s a great tool, and very necessary when migrating from an older vRA versions. In this case, it can be used to identify if any of the VMs and appliances in the environment use an older version of Gugent Agent. Upgrade of the Agent is a necessary step for a successful migration.
Sometimes the Health tab, which is used to access the Health Service, can be blank. There are many reason for this, but I encountered a specific one during migrations that was interesting. The vRA documentation has a tip that if the tab is empty, you need to stop the service and start the service again. Unfortunately this did not work for me.
While looking into the logs, an error was discovered in the catalina.out log:
[UTC:2018-09-19 08:28:01,881 Local:2018-09-19 08:28:01,881] vcac: [component="cafe:healthbroker-proxy" priority="ERROR" thread="tomcat-http--48" tenant="ehc" context="yEQTsQB5" parent="" token="yEQTsQB5"] com.vmware.vcac.platform.service.rest.resolver.ApplicationExceptionHandler.handleUnexpectedException:874 - Forbidden
This is odd, since the user trying to access the Health tab had pretty much all the rights you can have in vRA. Assigning Health admin/user rights manually to the user did not help with the “Forbidden” error. For me this problem happened first right after I did a migration from vRA 7.3 to 7.4. After consulting with the VMware Support, the latest vRA hotfix was installed from KB 56618. It did actually fix the situation, but it turns out the hotfix may have not been a real fix.
Being curious, I installed a new clean distributed vRA 7.4 environment. I checked the Health tab after install and it was fine. Instead of doing a migration again to this vRA, I installed the hotfix before doing anything else. Surprisingly, after applying the hotfix the Health tab was empty again with the same error in the log. This should not have happened if the hotfix was actually fixing something.. It’s still a good idea to run the hotfix, it does have plenty of other benefits.
Poking around the vrhb-service I noticed that even if you shutdown the service and prevent the service to start during boot with chkconfig, the service still crawled up automatically. Having the service in “off” state in chkconfig should have prevented the service to start after reboot, but it still got up. Although nothing in the logs indicated that startup is the problem, it certainly pointed to that direction. The way I got this fixed was with the following procedure. Disclaimer! This procedure will shutdown one vRA node. While this is ok, the environment is still in a vulnerable state, so be cautious and continue with your own risk.
- Gracefully shutdown the secondary vRA node
- On the primary vRA node, shutdown vrhb-service
service vrhb-service stop
- Remove the sandbox folders as per KB 56618. Need to be fast here so that the service doesn’t restart automatically
rm -rf /var/lib/vrhb/service-host/sandbox /var/lib/vrhb/vra-test-host/sandbox
- Start the vrhb-service
service vrhb-service start
- Monitor the /storage/log/vmware/vrhb/service-host.log and wait for the service to start and for it to start looking for the secondary node:
[W][2018-09-27T07:17:09.972Z][lambda$checkServiceAvailability$1][SelectOwner for /health/config/auth-bootstrap to https://primary_node_IP:8090/core/node-selectors/healthgroupfailed: java.lang.IllegalStateException: Available nodes: 1, quorum:2]
- At this stage, restart the secondary vRA node and wait to see in the service-host.log that the secondary node is up and there are 2 available nodes
That did it for me, the Health Tab returned!