Latency Rules and Restrictions for EHC/vRA Multi-Site

EHC 4.1.1 can support up to 4 sites with 4 vCenters across those sites with full EHC capabilities. On top of that, we can connect to 6 more external vCenters without EHC capabilities (called vRA IaaS-Only endpoints). There are many things to consider when designing a multi-site solution, but one aspect is often omitted: latency. If you have two sites near each other, latency is usually not a problem. When it comes to multiple sites across continents, then we need to consider roundtrip times (RTT) between the main EHC Cloud Management Platform and remote sites very carefully. There are many components that connect over the WAN to the main instance of EHC and vice versa, and some of the components are sensitive to high latency. It’s also difficult to find exact information on what kind of latencies are tolerated. Often the manuals just state that “can be deployed in high latency environments” or something similar. Let’s try to find some common factors on how to design multi-site environments. For a quick glance of the latencies involved, scroll down to a summary table at the end of this post. For a bit more explanation, read on!

There are several different scenarios how to connect remote sites to EHC:

  1. EHC protected between 2 sites with Disaster Recovery connected with up to 2 remote sites/vCenters
  2. EHC protected between 2 sites with Continuous Availability connected with up to 3 remote sites/vCenters
  3. Single Site EHC connected with up to 3 remote sites/vCenters
  4. Single Site EHC connected with up to 3 remote sites/vCenters and up to 6 vRA IaaS-Only Sites

It’s also possible to have mix of different protection scenarios (e.g. DR+CA+Single Site), but from a latency perspective, these 4 scenarios cover all the limitations. The concept of a “site” is intentionally vague, since it can mean many things in different environments. Often 1 Site = 1 vCenter, but we don’t limit EHC like that. For simplicities sake, let’s assume that for latencies we have 1 site with 1 vCenter. Within site you would have local area network, and between sites wide area network. If you have several vCenter within a site, latency is normally not an issue since the local network is fast and with low latency.

For the first two scenarios, storage latency comes into play. The EHC components are almost identical between the different scenarios, so the differences in latency come from the storage layer. Depending on the replication technology, latency requirements can be very strict. In a pure Disaster Recovery deployment, the storage latencies can be up to 200 ms when using RecoverPoint with asynchronous replication. However, if Continuous Availability is used, then the requirement drops to under 10 ms! With Continuous Availability, we utilise vSphere Metro Storage Cluster (vMSC) for an active-active implementation of EHC. The underlying storage technology is VPLEX and depending on the setup (cross-connect or non-cross-connect), the latency needs to be under 5 ms or 10 ms.

The last two scenarios seem simple, you just hook another vCenter to EHC as a vRA endpoint and done, right? Unfortunately it goes a bit deeper than that. The diagram below shows the different components needed for a full EHC capable remote endpoint/site. Things we have to worry about when it comes to latency are VMware Platform Services Controller (PSC), vRealize Automation Agents, SMI-S Provider, Log Insight Forwarders and vRealize Operations Manager Remote Collectors. All of those components connect back to the main site and all of them have latency requirements. If NSX is part of the solutions, then also NSX Manager in the remote site will connect to the primary NSX Manager on the main site. Backup has some limitations as well, but backup replication is usually not the limiting factor.

EHC_multisite_basic

PSC is perhaps the most sensitive component of them all. There are no official hard requirements for PSC, but according to VMware engineering working with PSC, a comfortable limit for PSC is under 100 ms within the same SSO domain. If you go over that, the risk for conflicting changes increases too much. This is a very important point, because EHC requires that all the remote vCenters with full EHC capabilities are part of a single EHC SSO domain. It all comes down to the vRealize Orchestrator that provides orchestration services for the whole EHC. The SRM plugin for vRO requires that all the vCenters that are connected to it use the same SSO domain for authentication. We also want to keep all of our EHC SSO architectures the same across different implementations, so that future upgrades are easier. Since we rely on vRO for all of our orchestration needs, this becomes a limitation for multi-site. Therefore the latency needs to be under 100 ms when connecting remote sites or vCenters to EHC. Note that this applies to DR scenarios as well. Although RecoverPoint can tolerate latencies up to 200 ms, PSC cannot. Since PSC is a crucial part of the solution, it will also define the maximum latency, unless some other component require a smaller RTT.

The Log Insight Forwarders do not have a published latency requirement, but if you deploy them across a very high latency WAN, the delay can be compensated by increasing the Worker count. For vROps Remote Collectors, the latency needs to be under 200 ms. vRA Agents have a vague description on latency requirements. All that is said about them is that they “may be distributed to the geography of the endpoint“. I take it that latency is not an issue in any setup. Next component is the SMI-S Provider. It is used with Dell EMC VNX and VMAX to control the storage arrays with Dell EMC ViPR. The SMI-S Provider automates storage provisioning tasks and ViPR orchestrates them. There is a requirement of less than 150 ms latency between ViPR and SMI-S Provider.

The connection between NSX Managers does not have a published latency requirement, but the maximum is set to Cross-vCenter vMotion latency of under 150 ms in the NSX Cross-vCenter documentation. It makes sense since you should be able to do a vMotion between sites, and this feature requires the latency to be under 150 ms. The same latency limit applies to NSX Controllers. In a Cross-vCenter setup the NSX Controllers need to communicate with the remote hosts and secondary NSX Manager.

You can also use vRA IaaS-Only endpoints with EHC. These endpoints are vCenters without any EHC services available for them (e.g. Backup-as-a-Service). You can either add them to the same EHC SSO domain as the rest of the endpoints, or create a new one. If you decide to go with a disjointed SSO domain, then obviously the PSC latency limit does not apply. In this case the tolerated latency depends purely what other components are used with the remote endpoint. At minimum vRA Agents, vROps Remote Collector and Log Insight Forwarder should be there, so maximum latency would be 200 ms.

Lastly, we need to consider backup replication in all of the scenarios. If the EHC solution has Backup-as-a-Service functionality included, then we need to replicate the backup data between sites. This can be done with either Avamar replication or Data Domain replication (Avamar is the frontend for both replication methods). There are no fixed latency requirement for backup replication. It should be under 100 ms, but the products can be configured to allow a higher latencies. Anything under 100 ms can be handled with default replication settings, but anything higher, and the implementation team needs to tweak the settings.

To make this even more complex, we have to look at the primary site as well and which of those components connect to the remote sites. And on top of that, there are some external services, mainly Active Directory, that can cause headache. The primary site has two components that collect information from the remote sites, VMware vRealize Business for Cloud and Dell EMC ViPR SRM. vRB is not an issue, but ViPR SRM requires a separate Collector deployed in the remote site. The latency between ViRP SRM Backend and Collector can be up to 300 ms, but between the Collector and the vCenter/Storage, only 20 ms is acceptable.

The final thing to look at are the external services. Active Directory can cause significant delay in login times if there is a high latency between the domain controller and the remote component. EHC uses Active Directory authentication across the solution for user authentication and component integration, so it is a crucial service. It is recommended to have a local domain controller at the remote site to ensure fast login times if there is significant latency in the WAN connection.

You might also have configuration management tools in use like Puppet. There are no latency limits available for Puppet, but there are customers out there who are using a multi-Master implementation with a Master of Masters in a high latency environments without issue. You will most likely face issues with other components in the environment before Puppet becomes a problem.

The summary of all the latencies:

Component Communicates with Latency Requirement Source
VPLEX Cluster (Remote Site) VPLEX Cluster (Primary Site) < 5 ms (cross-connect)
< 10 ms (non-cross-connect)
VPLEX 5.5.x Support Matrix
PSC (Remote Site) PSC (Primary Site) < 100 ms VMware KB 2113115
Avamar Server (Remote Site) Avamar Server (Primary Site) < ~100 ms Avamar 7.3 and Data Domain System Integration Guide
Data Domain (Remote Site) Data Domain (Primary Site) < ~100 ms Avamar 7.3 and Data Domain System Integration Guide
SMI-S Provider ViPR < 150 ms ViPR Support Matrix
NSX Manager (Secondary) NSX Manager (Primary) < 150 ms NSX-V Multi-site Options and Cross-VC NSX Design Guide
vCenter (Remote Site) vCenter (Primary) < 150 ms (vMotion) VMware KB 2106949
vROps Remote Collector vROps Cluster Master Node < 200 ms VMware KB 2130551
vRPA Cluster (Remote Site) vRPA Cluster (Primary Site) < 200 ms RecoverPoint for VMs Scale And Performance Guide
RPA Cluster (Remote Site) RPA Cluster (Primary Site) < 200 ms RecoverPoint 4.4 Release Notes
ViPR SRM Collector ViPR SRM Backend < 300 ms ViPR SRM 3.7 Performance and Scalability Guidelines
vCenter (Remote Site) vRealize Business for Cloud Not specified, but latency sensitive Architecting a VMware vRealize Business Solution
vCenter (Remote Site) vRealize Orchestrator Not specified, but latency sensitive Install and Configure VMware vRealize Orchestrator
vRA Agent vRA Manager/Web Not specified, high latency ok VMware KB 2134842
Log Insight Forwarder Log Insight Cluster Not specified, high latency ok Log Insight 3.6 Documentation
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s