Imported VMs Disappear after vRA 7 In-Place Upgrade

Getting ready to upgrade vRA from 6.x to 7.x? If your system contains imported VMs, pay attention.


There’s a known issue with imported VMs when doing an in-place upgrade. VMware has a KB 2150515 on the issue, but you have to do the fix before you upgrade. As it happens, we noticed the KB after the upgrade was done and things were in a bad state. If you can, use the KB. If it is too late, read on.

The symptoms are clear. All the old imported VMs disappear from the Items list. When you see the details of these VMs in Managed Machines, you’ll notice that the machines are in Unmanaged state and the tagged blueprint is “system_blueprint_vsphere”. This is a standard system blueprint and it does not exist in your Design tab. There is a way to fix the situation. There will be some hurdles, but the procedure is clear. First you need to unregister the affected VMs with CloudClient or directly removing them from vRA databases, and then reimport them using Bulk Import. Very simple, but in practice things can go wrong in many ways.

You should start the fix by exporting all the Managed Machines using the Generate CSV File from Bulk Imports functionality. Remember to select the “Include custom properties. This step is crucial to get the VMs back in working order under vRA. Next you need a list of the VMs with the wrong blueprint attached. Unfortunately the bulk export list doesn’t contain the blueprint information, so you will have to get that either with CloudClient or from Managed Machine list in vRA. The list can be exported, but in my experience it is very unreliable. Some items can be missing, some are double, etc.


Compare the two lists and make a final list with just the affected VMs. When done, you’ll need to add and remove some stuff to avoid surprises. First of all, add the proper Deployment name, Blueprint ID and Component Blueprint ID. Deployment name you can decide freely, the Blueprint ID you can find by opening your import blueprint, and Component Blueprint ID is the name of the actual machine inside the blueprint.


Next, you will need to add a custom property for VM OS to each of the imported VM in the csv file as per the vRA Release Notes. Even if you go with vRA 7.3, this needs to be done. Here’s an example, just add the items to the end of the line. You need to have a valid entry for the OS, but funnily enough it actually does not have to match with the real VM OS. The import will fail if the tag is not a valid OS, but that’s all it checks.


You can get a list of suitable OS codes from here. If you don’t have this line in the csv for each machine, the import will fail.

Next thing to modify is the machine lifecycles, or “stubs”. A machine goes through different lifecycles and states when it is being provisioned, imported or deleted, for example. The documentation will walk you through the steps, but for Bulk Import we are interested in the RegisterMachine and MachineProvisioned stubs. Depending on your setup, these stubs have been called when the original VM was imported, so you might need to remove them from the csv file when you import the machines back. For example, in our case the MachineProvisioned called on a backup functionality and added the VM to a DRS group. While getting an extra backup is ok, it’s better not to call the stubs unless they are needed. You should leave other stubs alone such as Disposing, since they are needed when the VM is eventually deleted from vRA.

Same goes with the new Event Broker functionality. Again, depending on your setup, it might be wise to disable all post provisioning tasks that you might have to keep the machines as original as possible. Preprovisioning tasks won’t be called when importing.

Now it’s time to get rid of those VMs in vRA. CloudClient does a good job of doing this, but don’t run too many removals at the same time! We had some issues with “cloudclient forceunregister” command. We looped through over 400 VMs, and while the commands were issued successfully, the unregister operation failed for about 300 VMs. We didn’t spend too much time figuring out why, but it seemed that there was some sort of timeout. We had to clean the unregister mess using the database scripts from VMware KB 2144269. Do them in batches is the lesson here. As for the Bulk Import operation, this seemed to work reliably with 250 VMs at one go without a hiccup.

There were some other smaller issues. Make sure the owner of the imported VM has access to the business group and also that the import blueprint is assigned to at least one entitlement. These will be checked during import. Also vRealize Business can cause problems. In our case vRB Cost Collection failed because vRB found multiple entries for the same VM in vRA database. vRA doesn’t seem to delete anything from the database, it just creates a new ID for a new object. If you delete a VM with CloudClient, the VM is still in vRA DB with DELETED status. vRB picks this up and creates an error. The problem should be fixed in vRB 7.2.1 and 7.3, but we did get this with 7.2.1 as well. Contact VMware support for a fix in case you need it.

Another “fun” problem is ASCII coding. When you export and import a csv file, vRA does some ASCII modification magic and basically fails to recode all the special characters properly. Check your import csv file before you import everything back to vRA, you might need to manually replace some “%40” characters (and plenty of others) with the correct ones.

Before running the Bulk Import, export another list of Unmanaged machines to a csv file. It’s a good idea to compare the Unmanaged list and the list you prepared for Bulk Import. This is a good way to verify all machines were removed from vRA and that your import list is accurate.


Last step is to run the Bulk Import. I would recommend to select the “Ignore managed machines” option to make sure you don’t overrun any existing machines. After the import is done, all the machines should be visible again in the Items tab and ready to roll!




EHC 4.1.1 Scalability and Maximums

Architecting large scale cloud solutions using VMware products have several maximums and limits when it comes to scalability of the different components. People tend to look at only vSphere limits, but the cloud also has several other systems with different kind of limits to consider. In addition to vSphere, we have limits with NSX, vRO, vRA, vROps and the underlying storage. Even some of the management packs for vROps have limitations that can affect large scale clouds. Taking everything into consideration requires quite a lot of tech manual surfing to get all the limitations together. Let’s inspect a maxed out EHC 4.1.1 configuration and see where the limitations are.


The above design contains pretty much everything you can throw a VMware based cloud design at. The design is based on EHC, so some limitations are from internal design choices, but almost all of the limitations are relevant to all VMware clouds.

Start from the top. vRA 6.x and 7.1 can handle 50 000 VMs. vRA 7.3 goes up to 75 000 VMs, but EHC 4.1.1 uses vRA 7.1. Enough for you? Things are not that rosy I’m afraid. Yes, you can add 50 000 VMs under vRA management and it will work. It’s the underlying infrastructure that is going to cause you some grey hair. Even vRealize Orchestrator cannot support 50 000 VMs. Instead, it can handle either 35 000 VMs in standalone install, or 30 000 VMs in a cluster mode with 2-nodes. Cluster mode is the standard with EHC, so for our design, 30 000 VMs is a limitation. This limit only affects if your VMs are under vRO’s management, for example they are utilizing EHC Backup-as-a-Service. You could have VMs outside of vRO of course, so in theory you could also achieve the max VM count. Additional vRO server is an option, but for EHC, we use only one instance for our orchestration needs. Anything beyond that is outside of the scope of EHC.

Next let’s look at the vCenter blocks of our design. A single vCenter can go up to 10 000 powered on VMs, 15 000 in total. So just slap 5 of those under vRA and good to go, right? Wrong! There are plenty of other limiting factors like 2048 powered on VMs per datastore with VMware HA turned on, but also things like 1000 VMs per ESXi host and 64 hosts per cluster. These usually won’t be a problem. With EHC, you can have maximum of 4 vCenters with full EHC Services and 6 vCenters outside of EHC Services. You can max out vRA, but you can only have EHC Services for 40 000 VMs. When we take vRO into account, the limit drops to 30 000 VM. You can still have 20 000 VMs outside of these services on other vCenters, no problem.


Inside the vCenter block we have other components besides just vCenter. NSX follows the vSphere 6 limits, so it doesn’t cause any issues. NSX Manager is mapped 1:1 with vCenter, so single vCenter limits apply. You can add multiple vCenters to vRA, so overall limit will not be lowered by NSX. In addition to NSX, we have two collectors for monitoring, Log Insight Forwarder and vROps Remote Collector. Both have some limitations, but they don’t affect the 10 000 VM limit for the block.

As always, storage is a big part of infrastructure design. Depending on your underlying array and replication method, you might not achieve the full 10 000 VMs from vCenter. For example, vSAN can only have one datastore per cluster. As said before, with the combination of HA, the limit per cluster is 2048 powered on VMs with older vSAN versions. However, this limit doesn’t apply to vSAN 6.x anymore. Now the maximum for vSAN cluster is 6400 VMs, and all can be powered on. You can also have only 200 VMs per host with vSAN based solutions. On a normal cluster the limit is 1024. If you use a vSAN based appliance such as Dell EMC VxRail, the vCenter limit drops to 6400 VMs since you can only have one cluster and one datastore.


You most likely want to protect your VMs across sites. There are two methods for this with EHC: Continuous Availability (aka VPLEX/vMSC) and Disaster Recovery (aka RP4VM). The first option, EHC CA, doesn’t limit your vCenter maximum. VPLEX follows vCenter limits the same way as NSX does. EHC supports 4 vCenters with VPLEX, so that brings the total of CA protected VMs to 40 000 VMs. Again, vRO limits your options a bit to 30 000 VMs, and yes, you can have VMs outside of VPLEX protection in a separate cluster and separate vCenters. You could have 4 vCenters with 30 000 protected VMs in total with VPLEX, and on top of that 20 000 VMs outside of EHC.


For EHC DR, the go-to option is to use RecoverPoint for VMs. RP4VM does not use VMware SRM, but it has its own limits. The maximum for a vCenter pair is 2048 VMs with RP4VM 4.3. These limits will grow with the upcoming RP4VM 5.1 release later this year. You can have two vCenter pairs in EHC with RP4VM, so then the total protected VMs would be 4096. You can have both replicated and non-replicated VMs in the same cluster, so the overall limit is not affected beyond vRO. We do support physical RecoverPoint appliances with VMware SRM as well. SRM can support up to 5000 VMs, and you can use SRM in 1 vCenter pair only. You can have non-replicated clusters with replicated ones, so the overall limit can still be high. With the combination of RP4VM and SRM, you could have up to 7048 protected VMs between 2 vCenters and 2048 protected VMs between 2 other vCenters, so in total 9096 DR protected VMs in the system.


In addition to replication, backup is crucial as well. Backup design can have interesting side affects. Avamar doesn’t have a fixed VM limit, since ingesting backup data doesn’t have much to do with VM count, but data change rate does. Backup system limit has to be calculated using backup window, amount of backup proxies and the data change rate. You can have up to 48 proxies associated with an Avamar grid. Each proxy can backup/restore simultaneously 8 VMs, so total is 384 VMs. This limit is not fixed, but it’s not recommended to change it. So any given moment, you can backup 384 VMs. If your backup windows is 8 hours, and 1 VM takes 10 minutes to backup, your maximum is 18432 VMs inside the backup window (assuming all 384 VMs start and finish during 10 minutes). There’s a lot of assumptions in the calculations, so be careful when designing the backup infrastructure. You can obviously have many Avamar grids if needed.


If you thought that was complex, wait until we get to the monitoring block. You wouldn’t think that monitoring is a limiting factor, but you would be wrong. There are some interesting caveats that should be at least known and taken into consideration. Obviously the platform limits are what really counts, but monitoring is a huge part of a working cloud environment. Log Insight doesn’t really have VM limitations. It only cares about incoming events (Events Per Second, EPS). There is a calculator out there to help with the sizing. You can connect up to 10 vCenters, 10 Forwarders, 1 vROps and 1 AD among other things to a single Log Insight instance. Our design uses Log Insight Forwarders to gather data from vCenters and ship it to a main cluster.


vROps is another matter. Whereas the vROps cluster can ingest huge amounts of VMs (120 000 objects with maximum config), the Management Packs can become a bottle neck. vRealize Automation Management Pack can handle 8000 VMs when using vRA 7.x, and 1000 VMs with vRA 6.2. That’s quite a lot less than the 50 000 VMs vRA can support. It would be nice to have all these VMs monitored, right? NSX Management Pack also has a limitation of just 2000 NSX objects, but they also say that this is the testing limit and it will work beyond 2000 VMs and 300 edges. This is probably true with vRA Management Pack as well, but it is not stated in the docs.

Finally, vRealize Business for Cloud adds another limit to the mix. It can handle up to 20 000 VMs across 4 vCenters. Again, this will limit the overall amount of VMs in the system, if all of them need to be monitored. Unfortunately there is no way to exclude some the VMs in vRA, all are monitored by vRB. You can opt out to leave some vCenters outside of vRB monitoring. Combining this limit with others in this post, the total limit comes down to 20 000 VMs, and even lower if you want them to monitored by vROps. There are ways to go beyond the limits by just not monitoring all of the vCenters or adding more VMs than is supported and taking a risk. The last part is not recommended of course.

As you can see, the limitations are all around us. You are golden up to 2000 VMs, but after that you really need to think what you need to accomplish and do some serious sizing. Well, maybe a bit before that..

EHC 4.1.1
Component VM Limitation vCenter Limitation Other Source
vCenter 6.0 U2 10 000 VMs (Powered On)
15 000 VMs (Registered)
8000 VMs per Cluster
2048 Powered On VMs on single Datastore with HA
64 ESXis per Cluster
500 ESXis per DC
1000 ESXi hosts
vSphere 6 Configuration Maximums
vRA 7.1 50 000 VMs
75 000 VMs (vRA 7.3)
1 vRO instance per tenant (XaaS limitation) EHC: 1 tenant allowed with EHC Services vRealize Automation Reference Architecture
vRO 7.1 35 000 VMs
15 000 VMs per vRO Node in Cluster Mode
30 vCenters Single SSO domain vSphere 6 Configuration Maximums
NSX 6.2.6 vCenter limits 1 vCenter per 1 NSX Manager vSphere 6 Configuration Maximums
vROps 6.2.1 120 000 Objects (with fully loaded vROps, 16 Large nodes) 50 vCenter Adapter instances
50 Remote Collectors
VMware KB 2130551
Log Insight 3.6 No VM limitations, only Events Per Second matter 10 vCenters 10 Forwarders, 1 AD, 2 DNS Servers, 1 vROps Log Insight Administration Guide
Log Insight Calculator
vSAN 6.2 200 VMs per Host
6400 VMs per Cluster
6400 Powered On VMs
64 Hosts per Cluster 1 Datastore per Cluster
1 Cluster per VxRail system
vSphere 6 Configuration Maximums
vSAN Configuration Limits
vRB for Cloud 7.1 20 000 VMs 4 vCenters vRealize Automation Administration Guide
Avamar 7.3 No fixed limit, depends on data change rate, backup windows and amount of Proxies 15 vCenters Maximum of 48 Proxies
8 concurrent backups per Proxy
Avamar 7.3 for VMware User Guide
EMC KB 411536
VPLEX / vMSC 5.5 SP1 P2 10000 Powered on VMs
15000 Registered VMs
Follows vCenter limits vSphere 6 Configuration Maximums
RecoverPoint 4.4 SP1 P1 / SRM 6.1.1 5000 VMs 1 vCenter pair allowed in EHC Can recover max 2000 VMs simultaneously VMware KB 2105500
RecoverPoint for VMs 4.3 SP1 P4 1024 individually protected VMs
2048 VMs per vCenter Pair
4096 VMs across EHC
2 vCenter Pairs in EHC
32 ESXi hosts per cluster
Recommended max 512 VMs per vSphere cluster with 4 vRPA clusters
If EHC Auto Pod is protected with RP4VM, 896 CGs left for Tenant workloads
RP4VM Scale and Performance Guide
vRA Mgmt Pack 2.2 8000 VMs (with vRA 7 / EHC 4.1.x) Mgmt pack v.2.0+ vRA Mgmt Pack Release Notes
NSX Mgmt Pack 3.5 2000 VMs
300 Edges
(will scale beyond)
Mgmt pack v.3.5+ NSX Mgmt Pack Release Notes

Latency Rules and Restrictions for EHC/vRA Multi-Site

EHC 4.1.1 can support up to 4 sites with 4 vCenters across those sites with full EHC capabilities. On top of that, we can connect to 6 more external vCenters without EHC capabilities (called vRA IaaS-Only endpoints). There are many things to consider when designing a multi-site solution, but one aspect is often omitted: latency. If you have two sites near each other, latency is usually not a problem. When it comes to multiple sites across continents, then we need to consider roundtrip times (RTT) between the main EHC Cloud Management Platform and remote sites very carefully. There are many components that connect over the WAN to the main instance of EHC and vice versa, and some of the components are sensitive to high latency. It’s also difficult to find exact information on what kind of latencies are tolerated. Often the manuals just state that “can be deployed in high latency environments” or something similar. Let’s try to find some common factors on how to design multi-site environments. For a quick glance of the latencies involved, scroll down to a summary table at the end of this post. For a bit more explanation, read on!

There are several different scenarios how to connect remote sites to EHC:

  1. EHC protected between 2 sites with Disaster Recovery connected with up to 2 remote sites/vCenters
  2. EHC protected between 2 sites with Continuous Availability connected with up to 3 remote sites/vCenters
  3. Single Site EHC connected with up to 3 remote sites/vCenters
  4. Single Site EHC connected with up to 3 remote sites/vCenters and up to 6 vRA IaaS-Only Sites

It’s also possible to have mix of different protection scenarios (e.g. DR+CA+Single Site), but from a latency perspective, these 4 scenarios cover all the limitations. The concept of a “site” is intentionally vague, since it can mean many things in different environments. Often 1 Site = 1 vCenter, but we don’t limit EHC like that. For simplicities sake, let’s assume that for latencies we have 1 site with 1 vCenter. Within site you would have local area network, and between sites wide area network. If you have several vCenter within a site, latency is normally not an issue since the local network is fast and with low latency.

For the first two scenarios, storage latency comes into play. The EHC components are almost identical between the different scenarios, so the differences in latency come from the storage layer. Depending on the replication technology, latency requirements can be very strict. In a pure Disaster Recovery deployment, the storage latencies can be up to 200 ms when using RecoverPoint with asynchronous replication. However, if Continuous Availability is used, then the requirement drops to under 10 ms! With Continuous Availability, we utilise vSphere Metro Storage Cluster (vMSC) for an active-active implementation of EHC. The underlying storage technology is VPLEX and depending on the setup (cross-connect or non-cross-connect), the latency needs to be under 5 ms or 10 ms.

The last two scenarios seem simple, you just hook another vCenter to EHC as a vRA endpoint and done, right? Unfortunately it goes a bit deeper than that. The diagram below shows the different components needed for a full EHC capable remote endpoint/site. Things we have to worry about when it comes to latency are VMware Platform Services Controller (PSC), vRealize Automation Agents, SMI-S Provider, Log Insight Forwarders and vRealize Operations Manager Remote Collectors. All of those components connect back to the main site and all of them have latency requirements. If NSX is part of the solutions, then also NSX Manager in the remote site will connect to the primary NSX Manager on the main site. Backup has some limitations as well, but backup replication is usually not the limiting factor.


PSC is perhaps the most sensitive component of them all. There are no official hard requirements for PSC, but according to VMware engineering working with PSC, a comfortable limit for PSC is under 100 ms within the same SSO domain. If you go over that, the risk for conflicting changes increases too much. This is a very important point, because EHC requires that all the remote vCenters with full EHC capabilities are part of a single EHC SSO domain. It all comes down to the vRealize Orchestrator that provides orchestration services for the whole EHC. The SRM plugin for vRO requires that all the vCenters that are connected to it use the same SSO domain for authentication. We also want to keep all of our EHC SSO architectures the same across different implementations, so that future upgrades are easier. Since we rely on vRO for all of our orchestration needs, this becomes a limitation for multi-site. Therefore the latency needs to be under 100 ms when connecting remote sites or vCenters to EHC. Note that this applies to DR scenarios as well. Although RecoverPoint can tolerate latencies up to 200 ms, PSC cannot. Since PSC is a crucial part of the solution, it will also define the maximum latency, unless some other component require a smaller RTT.

The Log Insight Forwarders do not have a published latency requirement, but if you deploy them across a very high latency WAN, the delay can be compensated by increasing the Worker count. For vROps Remote Collectors, the latency needs to be under 200 ms. vRA Agents have a vague description on latency requirements. All that is said about them is that they “may be distributed to the geography of the endpoint“. I take it that latency is not an issue in any setup. Next component is the SMI-S Provider. It is used with Dell EMC VNX and VMAX to control the storage arrays with Dell EMC ViPR. The SMI-S Provider automates storage provisioning tasks and ViPR orchestrates them. There is a requirement of less than 150 ms latency between ViPR and SMI-S Provider.

The connection between NSX Managers does not have a published latency requirement, but the maximum is set to Cross-vCenter vMotion latency of under 150 ms in the NSX Cross-vCenter documentation. It makes sense since you should be able to do a vMotion between sites, and this feature requires the latency to be under 150 ms. The same latency limit applies to NSX Controllers. In a Cross-vCenter setup the NSX Controllers need to communicate with the remote hosts and secondary NSX Manager.

You can also use vRA IaaS-Only endpoints with EHC. These endpoints are vCenters without any EHC services available for them (e.g. Backup-as-a-Service). You can either add them to the same EHC SSO domain as the rest of the endpoints, or create a new one. If you decide to go with a disjointed SSO domain, then obviously the PSC latency limit does not apply. In this case the tolerated latency depends purely what other components are used with the remote endpoint. At minimum vRA Agents, vROps Remote Collector and Log Insight Forwarder should be there, so maximum latency would be 200 ms.

Lastly, we need to consider backup replication in all of the scenarios. If the EHC solution has Backup-as-a-Service functionality included, then we need to replicate the backup data between sites. This can be done with either Avamar replication or Data Domain replication (Avamar is the frontend for both replication methods). There are no fixed latency requirement for backup replication. It should be under 100 ms, but the products can be configured to allow a higher latencies. Anything under 100 ms can be handled with default replication settings, but anything higher, and the implementation team needs to tweak the settings.

To make this even more complex, we have to look at the primary site as well and which of those components connect to the remote sites. And on top of that, there are some external services, mainly Active Directory, that can cause headache. The primary site has two components that collect information from the remote sites, VMware vRealize Business for Cloud and Dell EMC ViPR SRM. vRB is not an issue, but ViPR SRM requires a separate Collector deployed in the remote site. The latency between ViRP SRM Backend and Collector can be up to 300 ms, but between the Collector and the vCenter/Storage, only 20 ms is acceptable.

The final thing to look at are the external services. Active Directory can cause significant delay in login times if there is a high latency between the domain controller and the remote component. EHC uses Active Directory authentication across the solution for user authentication and component integration, so it is a crucial service. It is recommended to have a local domain controller at the remote site to ensure fast login times if there is significant latency in the WAN connection.

You might also have configuration management tools in use like Puppet. There are no latency limits available for Puppet, but there are customers out there who are using a multi-Master implementation with a Master of Masters in a high latency environments without issue. You will most likely face issues with other components in the environment before Puppet becomes a problem.

The summary of all the latencies:

Component Communicates with Latency Requirement Source
VPLEX Cluster (Remote Site) VPLEX Cluster (Primary Site) < 5 ms (cross-connect)
< 10 ms (non-cross-connect)
VPLEX 5.5.x Support Matrix
PSC (Remote Site) PSC (Primary Site) < 100 ms VMware KB 2113115
Avamar Server (Remote Site) Avamar Server (Primary Site) < ~100 ms Avamar 7.3 and Data Domain System Integration Guide
Data Domain (Remote Site) Data Domain (Primary Site) < ~100 ms Avamar 7.3 and Data Domain System Integration Guide
SMI-S Provider ViPR < 150 ms ViPR Support Matrix
NSX Manager (Secondary) NSX Manager (Primary) < 150 ms NSX-V Multi-site Options and Cross-VC NSX Design Guide
vCenter (Remote Site) vCenter (Primary) < 150 ms (vMotion) VMware KB 2106949
vROps Remote Collector vROps Cluster Master Node < 200 ms VMware KB 2130551
vRPA Cluster (Remote Site) vRPA Cluster (Primary Site) < 200 ms RecoverPoint for VMs Scale And Performance Guide
RPA Cluster (Remote Site) RPA Cluster (Primary Site) < 200 ms RecoverPoint 4.4 Release Notes
ViPR SRM Collector ViPR SRM Backend < 300 ms ViPR SRM 3.7 Performance and Scalability Guidelines
vCenter (Remote Site) vRealize Business for Cloud Not specified, but latency sensitive Architecting a VMware vRealize Business Solution
vCenter (Remote Site) vRealize Orchestrator Not specified, but latency sensitive Install and Configure VMware vRealize Orchestrator
vRA Agent vRA Manager/Web Not specified, high latency ok VMware KB 2134842
Log Insight Forwarder Log Insight Cluster Not specified, high latency ok Log Insight 3.6 Documentation

Configure Log Insight Forwarder in Enterprise Hybrid Cloud

As part of our Enterprise Hybrid Cloud, we deploy a Log Insight instance to gather the logs from the various components of the solution. Back in the days of EHC 3.5 and older, we used to have a single Log Insight appliance or a cluster, and all the syslog servers were pointed to that. Since EHC 4.0, that design has changed. Now we utilize a separate Log Insight Forwarder instance to collect and forward some of the logs. The reason behind this change is the ability of EHC 4.0 and newer to connect several remote sites (or vCenters) to one main instance of EHC. We want to collect logs from the remote sites as well, but it’s not efficient from networking perspective to collect the logs straight from the components over WAN to the main Log Insight cluster. Log Insight has a nifty built-in feature called Event Forwarding, that can push the local logs to a central location. It’s designed to work over WAN, so it can optimize the network usage and also can encrypt the traffic between sites. Pretty cool! There are plenty of other reasons to use forwarding as well.


Getting the Forwarder up and running is a simple process, but it’s not that well documented in the context of an existing Log Insight cluster. The information can be found in VMware documentation, but they don’t really specify the design. First things first, the Log Insight Forwarder is a separate installation of Log Insight. Unlike vRealize Operations, you cannot deploy a “remote collector” instance of Log Insight and add that to the existing cluster. Instead, you have to do a full install of Log Insight. It can be a cluster as well, but since we use it to simply collect and push logs to central location, a single node installation is fine for our purposes. Follow the normal process of deploying the Log Insight OVA, configuring the network and launching the installation UI. Choose “New Deployment” and configure Log Insight just like you did for the main cluster.

In order to get the encrypted connection (not mandatory) to work between the Forwarder and main LI cluster, there needs to be a trust established between the two installations. To make this happen, you will need custom CA-signed certificate on the main cluster, but it should already be there for the cluster to work properly. Using self-signed is not supported when it comes to the distributed components of EHC. For the connection to work, you need to add the root certificate chain of the main Log Insight Cluster to the Forwarder keystore. Official doc for additional information.

  • Copy the trusted root certificate chain with scp or Filezilla into a temporary directory on the Forwarder instance. For example: /home
  • SSH to the forwarder instance and run the following command localhost:
     /usr/java/jre1.8.0_92/bin/keytool -import -alias loginsight -file /home/Root64.cer -keystore cacerts

    The default keystore password is changeit.
    Note: Java versions might vary with time.

  • Restart the vRealize Log Insight Forwarder instance

After the Forwarder instance is up and running, the final step is to add Event Forwarding between the Forwarder and the Cluster. Follow the docs for additional information. Navigate to Administration interface of the Log Insight Forwarder and select Event Forwarding on the left pane. Choose New Destination, fill out the Log Insight Cluster FQDN, check the Use SSL box, make sure you are using Ingestion API and press Test. You can leave the other options to default. Click Save.


I came across a weird bug with the connection test and SSL. I had a clean Log Insight instance without anything logging to it. I configured all the steps above, and hit Test. It came back with an error “Failed connection with {LI_FQDN}:9543″. Without SSL the connection test was ok. I double checked everything and the certificates seemed fine. I tried an SSL connection by forcing an Log Insight Agent to do an SSL connection with the same root certificate chain with the appliance. This was successful, so the error seemed quite odd. I came back and hit the Test, and it was successful! It seems that if the Log Insight appliance doesn’t have any logs to forward, the Test might fail. It’s also possible that this is a certificate related issue, but I haven’t got to the bottom of it yet.

The last step is to configure the necessary agents and collect information from the local components. In the case of EHC, we divide the components according to the cluster where they are deployed. The Forwarder instance is located in the AMP or Core cluster, so we will use that instance for all the AMP component log collection. This way we can deploy additional sites with the same exact Log Insight setup on all of them.

For EHC, here’s a list of components and the associated Log Insight instance:


  • VMware vSphere/vCenter
  • VMware Site Recovery Manager
  • VMware ESXi Servers from all the clusters within the site
  • VMware NSX Manager
  • VMware NSX Edges
  • VMware NSX Controllers
  • VMware NSX Distributed Logical Routers
  • VMware vRealize Operations Manager Remote Collector
  • Dell EMC Storage
  • Dell EMC RecoverPoint
  • Dell EMC RecoverPoint for Virtual Machines
  • Dell EMC Avamar
  • Dell EMC SMI-S
  • Core Microsoft SQL Server
  • Core VMware Platform Services Controller 1 & 2
  • VMware vRealize Automation Agents
  • Microsoft Active Directory (if applicable)
  • Cisco UCS

Main Cluster:

  • VMware vRealize Automation (all the components except for Agents)
  • VMware vRealize Orchestrator
  • VMware vRealize Operations Manager (all components except for Remote Collectors)
  • VMware vRealize Business for Cloud
  • Automation Pod Microsoft SQL Server
  • Dell EMC Data Protection Advisor
  • Dell EMC ViPR
  • Automation Pod VMware Platform Services Controller

Done. Time for some serious log inspection!

vCenter Appliance 5.5 update failed with database issues

Ah, the joys of upgrading the home lab. It’s almost guaranteed that something goes wrong, since I don’t really spend much time maintaining my environment. I wanted to update my vCenter Appliance from 5.5 Update 3d to Update 3e. I normally use the built-in update functionality of the vCenter Appliance VAMI page. That has been one of my favourite and best features, and it has never failed. Well, until now.

The download and update process worked until the final reboot. After that, I noticed that I could not login to vCenter, so I logged back into VAMI. The vCenter Server service was not running. This is the time for a deep breath, because it’s not gonna be pretty. I did try my luck with rebooting the appliance, but of course that didn’t help. In my experience, if the vCenter Server does not start, it’s almost always the database. Log time.


The vpxd.log did have some errors in it. There was also an error related to ODBC. Database, my friend! There’s a block error, and that can’t be good.


Postgres logs at “/storage/db/vpostgres/pg_log” had some interesting errors as well. The blocks didn’t match, so there had been a write error at some point. Most likely I’ve done one of my reboots (= yank the cable), and the database got corrupted. Luckily, the fix was quite simple.


First things first, backup your vCenter Appliance before doing anything! To access the database, you need to enable bash for the postgres user. This VMware KB walks you through it quite nicely. After that’s done, we need to get rid of the bad block. There’s a another KB how to do this, although it only mentions vFabric Postgres, but it works just fine for the vCenter Appliance. In practice it’s the same database product under the hood. Follow the guide to remove the bad block. To stop and start the psql, you can either use the command in the KB, or just use “service vmware-vpostgres stop/start” as the root user. Just notice that you need to use the full path for the pg_ctl command, which is “/opt/vmware/vpostgres/1.0/bin/pg_ctl”. It won’t work without the path, and the KB fails to mention that.

After the fix the C# client worked fine, but I did have a new error when using the Web Client. You can get rid of that by simply clearing the browser cache.


Now I can finally add my repurposed Mac Mini to my cluster!

Pivotal Cloud Foundry on (tiny) vSphere Lab

Wheels are turning. As we move on from the IaaS space to offer a more developer friendly PaaS solution, it’s time to learn some Pivotal Cloud Foundry! I wanted to implement PCF on my own to see how it functions under the hood, and also see how it reacts in a, hmm, more challenging infrastructure environment. I’m running a ridiculously small vSphere lab, which is waaaay under the requirements for PCF. Also, I do get frequent power outages because I forget that I’m running servers and flick the power switch carelessly ;).

Let’s see where we are. Here are the official minimum requirements for Pivotal Cloud Foundry:

  • Disk space: 1TB
  • Memory: 120GB
  • vCPU cores: 80
  • Overall CPU: 28 GHz

This is what I have:

  • Disk space: 1.8TB free on NAS (7.200 RPM disks) and internal 100GB SSD on one server
  • Memory: 32GB
  • vCPU cores: 8 (mobile i3 and i5 processors)
  • Overall CPU: 8.18 GHz
  • Single Ethernet port on servers

Oh boy.. I don’t really care about the disk space, although I have the capacity. It’s going to be thin provisioned anyway, so that’s not a problem. The memory and vCPUs, however, might turn out to be a real blocker. I don’t even have the full 32GB of memory at my disposal, since my vCenter and DNS servers take 8GB. I’ve done enough of PoCs and lab builds during my time that I’m fairly certain that the actual need is not even close to the requirements, but the question is how much is really needed to make this run. Only way to know is to install, so here we go.

The three main components for PCF are the Ops Manager, Ops Manager Director and Elastic Runtime. The first two are for managing the PCF environment, whereas the Elastic Runtime is the actually Cloud Foundry environment where apps live. The installation outline is quite simple:

  1. Deploy Ops Manager OVA to the vSphere environment
  2. Log into Ops Manager UI and configure the Director
  3. Deploy Director
  4. Upload Elastic Runtime file to Ops Manager
  5. Configure Elastic Runtime
  6. Deploy Elastic Runtime

You can follow the installation docs for detailed steps. I’m not going to go step by step since they are very well documented by Pivotal. I would propose to create and use vSphere Resource Pools to make sure your main components get the necessary resources during congestion. For PCF, these can be specified in Director configuration under Availability Zones. I created two AZs, one for the ‘Singleton’ jobs (= Director itself and NFS server) and one for Elastic Runtime VMs. AZs are defined in Director config and assigned in Director and Elastic Runtime config.


I did run into some issues during my installation. My analysis is that it was all down to lack of resources and more particularly servers timing out during requests. I made a couple of tweaks to my environment to make the installation process a success. It did fail several times with different errors before I found the right settings. Everything went fine without tweaking until I tried to deploy the Elastic Runtime. During the deployment about 20 VMs are pushed to the vSphere, so this is where my problems started. The first part of the deployment went fine. After the actual push of the VMs, there were some extra steps called Errands. In this stage a few additional features are implemented, but they are all optional. The first one is a Smoke Test, that checks that the Elastic Runtime is ok and apps can be deployed. This failed once for me with a ‘StagingError’. I suspect there was a timeout when uploading the necessary bits for the test application. In fact, I saw this again after the installation was a success and I pushed my first app. During Errands, the App Manager is also installed. App Manager is the GUI for Pivotal Cloud Foundry. Not absolutely needed, but I wanted to have that as well for the full experience. This part was the toughest to get through, it failed constantly with different errors, for example database and upload errors. This is where I had to start tweaking the system.


During the configuration of Elastic Runtime, you can modify the resource config for the different VMs. There are some minimum requirements, but the installer will tell you if you configure the settings too low. I changed the Diego Cells size to ‘medium.disk’ to reduce the memory footprint. Diego Cells are the VMs where application and their containers run. I don’t need that much of application capacity to test PCF, so those settings could be reduced. I also moved my Director VM to SSD disk and change my DRS settings to ‘Partially Automated’. I noticed during the failed installation attempts that the Director started dropping packets at some point causing random timeout errors and DRS started moving VMs during critical installation stages due to resource congestion. It seemed that CPU cycles started to be a problem and I saw very high cpu_ready values in vSphere. PCF requires DRS to be operational, but ‘Partially Automated’ is a supported setting. This way I saved some CPU cycles to get rid of the timeouts. On top of that, I only have a single ethernet port on my servers, so all the management and data traffic go through the same port. vMotion affects the network and it can cause more timeouts.


After these modifications the installation went through without problems! You might need to run it several times, since the timeouts are quite random. This won’t happen if you have the necessary resources, mind you. I also had a couple of times a timeout during app push. Again, this is due to my poor environment. I won’t recommend trying PCF for the first time in a setup like mine, it’s far easier and a better experience to have a go at the PCF public cloud, Pivotal Web Services.



I yanked the power cable to see what happens. Everything came up disappointingly well, nothing to fix 🙂

vRA 7.0 Reinitiate Installation Wizard


Well, there’s actually a CLI command to do the steps below. Just run vcac-vami installation-wizard activate, and it does everything for you. Sounds like a clean approach to me.



vRA 7.0 comes with a nice Installation Wizard to ease the process of getting vRA and IaaS Components running. However, if you butter finger the installation process by clicking Cancel and not really reading what vRA is trying to tell you (I did that), you cannot access the installation wizard again. It’s a manual installation after that, and I’m not going to do that anymore. So, let’s fix it.


Log into vRA appliance using the SSH client of your choice. Navigate to /etc/vcac folder. There’s a nice little file called vami.ini. The only thing it contains is this setting:


Jackpot! Edit the file with vi, change false to true, save the file and restart vami service:
service vami-lighttp restart

Log back to VAMI at https://fqdn_of_vra:5480, and the Installation Wizard is reinitiated. If you need to close the Wizard and you don’t want to go through this hassle again, click Logout on the upper right corner.