Add or Upgrade Plugins in vCenter Orchestrator Cluster Mode

Configuring vCenter / vRealize Orchestrator in a cluster mode can be tricky. There are several sources of information how to do that, including the official VMware documentation (vCenter Orchestrator 5.5.2 Documentation), so it’s not that big of a problem. Upgrading the plugins in cluster mode, however, can be challenging. There’s a procedure you have to follow if you don’t want to end up in a situation where plugins keep disappearing for no good reason.

The documentation from VMware really doesn’t cover this use case. In a normal operation, you probably have to install new plugins and upgrade the old ones from time to time. If you’ve worked with a single server install of the vCO previously, this was a simple process of uploading and installing the plugin. If you do that with a vCO in a cluster mode, it will go horribly wrong. During a customer implementation, we came up with a procedure that works and keeps the vCO operational. There is a short downtime, because the servers need to be rebooted and kept shutdown for a small period of time. The whole thing takes about 1 hour to complete, but it can be done much faster if there’s only one plugin to implement.

If you can tolerate the whole hour of downtime, I would suggest to disable the load balancer for the duration of the upgrade. If not, it can be partially up during, but obviously there’s a risk that workflows get suspended during server reboots. We kept it up but no one was allowed to use the portal.

The key is to keep only one of the vCO nodes active at a time. Basically you shutdown vco2, upgrade/install vco1, shutdown vco1 and upgrade/install vco2. Follow this procedure to guarantee a successful install:

  1. Snapshot the vCO VMs
  2. Disable the load balancer leg for vCO node vco2
  3. Shutdown vCO node vco2
  4. Open vCO configuration page for vCO node vco1 and install the new/upgraded plugin (reboot if needed)
  5. Shutdown vCO node vco1
  6. Disable load balancer leg for vco1
  7. Restart vCO node vco2
  8. Enable load balancer leg for vco2
  9. Open vCO configuration page for vCO node vco2 and install the new/upgraded plugin (reboot if needed)
  10. Shutdown vCO node vco2
  11. Disable load balancer leg for vco2
  12. Restart vCO node vco1
  13. Enable load balancer leg for vco1
  14. Verify that the new plugin was installed correctly
  15. Restart vCO node vco2
  16. Enable load balancer for leg vco2
  17. Verify that the cluster has both nodes in RUNNING state (Server Availability tab)

If you already managed to destroy the cluster by trying to install a plugin while both of the nodes were up (it happens 😉 ), recovering is not too difficult. You might want to snapshot the VMs before doing anything. Here’s how we did it:

  1. Snapshot the vCO VMs
  2. Disable the load balancer for both vCO nodes
  3. Open vCO configuration page for vCO node vco2 and Disable Cluster Mode and return to Single Server mode
  4. Shutdown vCO node vco2
  5. Open vCO configuration page for vCO node vco1 and Disable Cluster Mode and return to Single Server mode
  6. Install/fix the necessary plugins on vco1
  7. Shutdown vCO node vco1
  8. Restart vCO node vco2
  9. Open vCO configuration page for vCO node vco2 and install/fix the necessary plugins
  10. Restart vCO node vco1
  11. Enable Cluster Mode on vco1 in Server Availability tab and specify 2 active nodes
  12. Export vCO config on vco1. Go to General tab and select Export
  13. Download the config file from vco1 appliance to your desktop
  14. Open vCO configuration page for vCO node vco2, go to General tab and import the config file. Before applying, deselect the check box, we don’t want to modify network settings!
  15. Go to Server Availability tab, and verify that both nodes are visible and RUNNING. We had to restart vco1 for this to happen

Bypass Traverse Checking in vRealize Automation 6.2

This week we ran into an interesting problem during a Federation Enterprise Hybrid Cloud implementation. We had the solution implemented with VMware vRealize Automation 6.2, and everything was running smoothly. The vRA implementation was done as a distributed install, so after configuration we moved to do some vRA component failover testing. We succeeded in failing over the primary component to secondary component on all of the different VMs (vRA appliance, IaaS Web, IaaS Model Manager + IaaS DEM-O, IaaS DEM-Workers and IaaS DEM-Agents), but failback was not successful. After diving into the component logs, we found a distinctive error on almost all of them:

System.Configuration.ConfigurationErrorsException: Error creating the Web Proxy specified in the 'system.net/defaultProxy' configuration section

bypass_traverse_checking
This error was on the IaaS Model Manager, DEM-O and DEM-Agents. Rest of the components failed back just fine. The symptom was that the VMware vCloud Automation Center Service and the DEM-Orchestrator Service would not start on reboot. We also could not restart them manually, because they would fail and the same error would appear in the logs. The error points to .NET call that sets a default proxy according to the web.config file found on the Windows host (Windows\Microsoft.NET\Framework\v4.0.30319\Config). These files were not modified by us, so the error did not make a lot of sense. The web.config file also exists in some of the vRA folders, so the origin of this error was unclear. It was clear, however, that the vRA code was calling to .NET function during service start, and that call failed due to a proxy error. This lead us to a wild goose chase with VMware support for a couple of days. It became clear that the security settings or the Windows image were blocking the services to start. Since the issue only occurred after rebooting the Windows VMs, GPO seemed the prime suspect. After engaging the customer Windows/Security SME, we found the root of the problem.

Our customer runs a high security environment, so their GPO settings are very strict. The vRA manuals tells to give these rights to the IaaS Service User:

"Log on as a batch job" and "Log on as a service"

We verified these settings, and everything was according to vRA requirements. However, the customer SME found out by using the Process Explorer (https://technet.microsoft.com/en-gb/sysinternals/bb896653.aspx) that the Service User needs an extra right to local privilege called Bypass Traverse Checking. The Process Explorer actually shows that the user needs a privilege called SeChangeNotifyPolicy, but that privilege also gives user the Bypass Traverse Checking. More info on that here: http://blogs.technet.com/b/markrussinovich/archive/2005/10/19/the-bypass-traverse-checking-or-is-it-the-change-notify-privilege.aspx. After giving the user the new rights, all of the services restarted!

OpenStack Juno Lab installation on top of VMware Workstation – Problems

O-oh, here we go, the problems start pouring in. After installing nova in my lab, I noticed that a few services (nova-cert, nova-conductor, nova-consoleauth, nova-scheduler) failed to start after reboot. In fact, if you check them immediately after reboot, they are started, but they fail after a while. From the logs I found this line:

Can't connect to MySQL server on 'controller'

So it seems that our DB is not up and running. Let’s check the db state:

service mysql status

Hmm, it’s up! If you restart the services now, everything will work. The problem is that in OpenStack, there is no check if the db is actually alive, it just issues the start commands and moves on. If for some reason the db is not accepting connections when nova services start, they will not function. I did some closer inspection of the logs, and there is a nice 2 second gap between mariadb still trying to get everything up and nova services trying to connect. It would be a lot easier if the db just didn’t start, it would be simple to adjust retries and delays (http://docs.openstack.org/juno/config-reference/content/list-of-compute-config-options.html#config_table_nova_database). Now we need to figure out how to delay the nova services by a couple of seconds or make the operating system try again when the services fail. A script after reboot checking the services and after finding a service not started it would try to restart them. That would work but it’s not neat. After banging my head to the OpenStack wall I couldn’t find an answer how to delay the start of services or do a retry. None of the delay options from the manual work. Well, a simple script will do the job for now.

Another thing. DO NOT MESS UP THE HOSTNAMES! I’ve done that twice now, stupid me, and this is what you get:

nova_service-list

Nova still does not understand that the host might change its hostname. If an existing host has a new hostname, it is considered a brand new host with new services. I had to clean my services list with nova service-delete ID. MAC address check, anyone?

OpenStack Juno Lab installation on top of VMware Workstation – Prep + Nova + Compute1

I had a busy fall changing jobs, so my OpenStack installation project was put aside. I joined EMC’s Enterprise Hybrid Cloud team as a Senior Solutions Architect to participate in the development of the product. Currently we have a Federation version of the EHC GA’d (using EMC and VMware products to deliver a solid foundation for our customers to build their cloud on), but later on there will be an OpenStack version coming out as well. Because of that, OpenStack is even more relevant to me although my time right now is committed to VMware products. I can’t go into details of the upcoming OpenStack version, but any hands on knowledge is important. There will be a lot of automation (we are talking about a cloud after all!), but that does not remove the need to know how to do stuff manually.

Since summer, a new release of OpenStack, Juno, has been released, so I decided to ditch Icehouse. OpenStack is being developed at the speed of light, so much of the installation issues with previous versions have been fixed. My previous post is still relevant on the prep of the VMs if you decide to run OpenStack lab deployment as VMs on ESXi. Follow my post to create a template which you can use on different OpenStack components. Have a look at the Juno installation manual, there are less steps that are required for the base machine. Also make a decision this point if you are going with Neutron or nova-network (aka legacy networking). This will affect your network settings for the nodes.

The requirements for minimal installation with CirrOS are quite low, so we can use a base machine with 2GB of RAM for all the components (networking node only needs 512 MB). I also noticed that the current installation manual for Juno takes note on running OpenStack inside VMs, like we are doing here. The need for promiscuous mode support and disabled MAC address filtering has been noted (hurray!). Note that you only need promiscious mode enabled and mac address filtering disabled for external network! You can follow my previous post on how to do it on ESXi. Promiscuos mode is disabled by default, so it needs to be changed. MAC address forging detection and filtering are already disabled so we can leave those be. For this build, I’m actually using VMware Workstation 9. Depending if you are using Linux as your underlying OS or Windows, things differ how to enable promiscuous mode. I’m running Windows 7, so all I need to do is enable promiscuous mode in the vmx file of my VMs. When using Workstation on Windows, promiscuous mode should be enabled by default. Just to make sure and to avoid issues later, let’s edit the vmx file and add this line:

ethernet0.noPromisc = "false"

This will enable promiscuous mode for eth0. More vmx tweaking can be found here (http://sanbarrow.com/vmx/vmx-network-advanced.html). If you want to be exact, you should only do it for nics that are used for external networks. I had so many issues with this using Icehouse, so I am being paranoid and I will enable it for all of my nics. Since this is lab enviroment, it doesn’t matter that much.

If you are running Linux, take a look at here:
https://pubs.vmware.com/workstation-9/index.jsp?topic=%2Fcom.vmware.ws.using.doc%2FGUID-089D2595-26C5-433B-9DA4-D2A94C63B7B5.html

After these steps you can continue installing OpenStack components using the official installation manual for Juno with Ubuntu. I won’t go into every command, because the manual is quite good. There are a few notes however, that I would like to share. First of all, OpenStack is using MariaDB nowadays. Won’t affect anything, but it was a nice surprise. PostgreSQL is also supported, by the way.qemu

The manual notes that you can enable verbose mode for all of the components. As a learning experience, I strongly recommend that you do so. Something WILL go wrong, and chatty logs are good for that. On that note, one major issue that I had with compute node was with hypervisors. KVM requires hardware assisted virtualization to work. We can enable this on our VM environment (https://communities.vmware.com/docs/DOC-8970), but that won’t save you. I had huge issues with KVM on Icehouse, and switching to QEMU helped a lot. Things might have progressed since, but for now I’m going with QEMU. After I get my setup to work, I will definitely give KVM another go. If you try KVM, make this change to your vmx file:

vcpu.hotadd = "FALSE"

That’s it, let’s start typing some commands!

Automatisation of RecoverPoint Virtual Appliance installation Part 2: Expect

run_expectAs I mentioned in my previous post, automatic installation of virtual appliances is not a trivial task. In automation projects, we tend to concentrate on basic operational tasks, like automating the creation of a multi-tier vApp. But sometimes we need to have new kind of functionality in the environment, like data replication. My previous post showed how we can deploy a virtual appliance from OVF-file using vCenter Orchestrator and ovftool. Now we need to actually implement the appliance. Usually there is a wizard for implementing these appliances, and EMC RecoverPoint is no different. If we can do these configurations from CLI, then automation is possible. Even better, if there is a set of commands that are used from Bash, configuration is easy. When it comes to RecoverPoint however, we do have CLI but it is strictly wizard based and you don’t have access to Bash. One workaround without some serious hacking for this problem is Expect. With this tool, we can emulate a user going through the wizard and making choices during the installation. You cannot install Expect on RecoverPoint appliance, so you need a Linux box that is used as a configuration server and a jump box. I used the same Centos Linux VM that has my ovftool installed. The installation of Expect is straightforward:

yum install expect

You can use SSH and Expect together, so you can form an SSH session to RecoverPoint appliance and run Expect from a remote server. The actual Expect code is easy. You simply wait until a particular string appears on the console, for instance “Enter IP address”. When Expect finds this string, it can insert an answer for that question, i.e. “192.168.0.1”. We need to parse through the configuration process, record the answers we would give and turn that into Expect language. Unfortunately I don’t have a VNX system with iSCSI ports in my lab, so I couldn’t finish my code, but the principle of the solution works, you just need IQNs to integrate. After that we can use the CLI to configure LUNs for RP and start the actual replication of data. When the LUN is protected, we can use vCenter Orchestrator to migrate the selected VMs to protected LUNs and we are done! The necessary files can be found at the end of this post. Have fun!

The bash script that I called from a vCO workflow looks like this:

#!/usr/bin/expect -f
# vRPA login information
set USER "boxmgmt"
set PASSWORD "boxmgmt"
set IP "192.168.0.179"
#VNX settings
set VNXSN "CKM00112233444"
set VNXNAME "VNX5500"
set SPA "192.168.0.124"
set SPB "192.168.0.125"
set CS "192.168.0.123
set ISCSI1VNX "192.168.0.156"
set ISCSI2VNX "192.168.0.157"
set ISCSI3VNX "192.168.0.158"
set ISCSI4VNX "192.168.0.159"
set VNXUSER "sysadmin"
set VNXPASSWORD "sysadmin"
#vRPA LAN/MGMT settings
set LANMASK "255.255.255.0"
set LANGW "192.168.0.1"
set LANVIP "192.168.0180"
set RPA1LAN "192.168.0.181"
set RPA2LAN "192.168.0.182"#vRPA WAN settings
set WANMASK "255.255.255.0"
set WANGW "192.168.0.1"
set RPA1WAN "192.168.0.183"
set RPA2WAN "192.168.0.184"
#vRPA iSCSI settings
set ISCSIMASK "255.255.255.0"
set ISCSIGW "192.168.0.1"
set RPA1ISCSI1 "192.168.0.185"
set RPA1ISCSI2 "192.168.0.186"
set RPA2ISCSI1 "192.168.0.187"
set RPA2ISCSI2 "192.168.0.188"
#vRPA General settings
set DNS1 "192.168.0.4"
set DNS2 ""
set NTP "192.168.0.4"
set DOMAINNAME "demo.lab"
set CLUSTERNAME "RP"
set NUMBEROFRPAS "1"
set TIMEZONE "+2:00"
set CITY "26"

# SSH to RecoverPoint Appliance and start the Configuration Wizard
spawn ssh $USER@$IP
expect "Password:"
send "$PASSWORD\r"
expect "Do you want to configure a temporary IP address?"
send "n\r"
expect "Enter your selection"
send "1\r"
expect "Enter your selection"
send "1\r"
expect "Are you installing the first RPA in the cluster"
send "y\r"

# Cluster settings
expect "Press ENTER to move to next page"
send "\r"
expect "Enter cluster name"
send "$CLUSTERNAME\r"
expect "Enter the number of RPAs in the cluster"
send "$NUMBEROFRPAS\r"
expect "Enter time zone"
send "$TIMEZONE\r"
expect "Enter your selection"
send "$CITY\r"
expect "Enter primary DNS"
send "$DNS1\r"
expect "Enter secondary DNS"
send "$DNS2\r"
expect "Enter domain name"
send "$DOMAINNAME\r"
expect "Enter addresses of host names of NTP servers"
send "$NTP\r"
expect "Press ENTER to move to next page"
send "\r"

# LAN
expect "Select network interface IP version"
send "1\r"
expect "Enter default IPv4 gateway"
send "$LANGW\r"
expect "Enter interface mask"
send "$LANMASK\r"
expect "Enter RPA 1 IP address"
send "$RPA1LAN\r"
expect "Press ENTER to move to next page"
send "\r"

#WAN
expect "Select network interface IP version"
send "1\r"
expect "Enter interface mask"
send "$WANMASK\r"
expect "Enter RPA 1 IP address"
send "$RPA1WAN\r"
expect "Press ENTER to move to next page"
send "\r"

#iSCSI port 1, eth2
expect "Do you want the RPA to require CHAP"
send "n\r"
expect "Select network interface IP version"
send "1\r"
expect "Enter interface mask"
send "$ISCSIMASK\r"
expect "Enter RPA 1 IP address"
send "$RPA1ISCSI1\r"

#iSCSI port 2, eth3
expect "Select network interface IP version"
send "1\r"
expect "Enter interface mask"
send "$ISCSIMASK\r"
expect "Enter RPA 1 IP address"
send "$RPA1ISCSI2\r"
expect "Press ENTER to move to next page"
send "\r"

#VNX
expect "Enter a name for the storage array"
send "$VNXNAME\r"
expect "Does the storage array require CHAP"
send "n\r"

# VNX iSCSI port 1
expect "Select network interface IP version"
send "1\r"
expect "Enter IP address"
send "$ISCSI1VNX\r"
expect "Enter the iSCSI port number"
send "3260\r"

# VNX iSCSI port 2
expect "Select network interface IP version"
send "1\r"
expect "Enter IP address"
send "$ISCSI2VNX\r"
expect "Enter the iSCSI port number"
send "3260\r"

expect "Do you want to add another iSCSI storage port"
send "n\r"
expect "Do you want to add another storage iSCSI configuration"
send "n\r"
expect "Press ENTER to move to next page"
send "\r"

expect "Do you want to add a gateway"
send "n\r"
expect "Press ENTER to move to next page"
send "\r"
expect "Press enter to continue"
send "\r"

expect "Do you want to apply these configuration settings now"
send "y\r"
interact

Here are the workflows for ovftool and Expect.
Deploy vRPA with ovftool
Helper workflow for using ovftool
Run Expect on remote host

Prerequisites for the workflows:

  • a Linux VM with SSH enabled
  • ovftool installed
  • expect installed
  • expect script copied to the VM

Hello EMC Enterprise Hybrid Cloud!

My blog has been a bit quiet lately, mostly due to my change of job and scenery. I took the opportunity to join EMC Enterprise Hybrid Cloud team as a Senior Solutions Architect to be part of the first deployments and more importantly to participate in the development of the product. 2015 will be a big year for EMC EHC, a lot of things will be happening. Currently, EHC is a product that combines EMC and VMware (aka Federation) products to form a solid hybrid cloud base for our customers. There is still a lot of debate where you should run your application, cloud or private. The answer always depends on the application and the customer. We at EMC strongly believe that the future is hybrid.

EHC is at version 2.5.1, which will bring our customers an infrastructure capable of running IT as a Service. EMC has put a lot of effort and time to take away the burden of developing automation and orchestration processes from the customer. We automate the basic operations task like application backup and application deployment and give the customer time to concentrate on the bigger picture. I know it is a cliché, but cloud is a journey. We have been saying that for years, and it is even more true now that it was back in 2010. With EHC we bring you a platform to start that journey strong. There is no way that you can have a public cloud -like flexibility and automation overnight in a private cloud environment, no matter how much money you throw at it. The technology is relatively easy, it’s the transformation of process, people and business that will be the hard part. We are here to help you with both parts.

Automatization of RecoverPoint Virtual Appliance installation Part 1: ovftool

There are a few of attributes that define cloud. Two of them, On-Demand Self Service and Rapid Elasticity, are the ones that are the most relevant to me. In a pure cloud environment, everything should be automatic and especially elastic. I’ve noticed that having these attributes in a cloud service is a standard nowadays, but it’s interesting how far they actually reach.

Let’s take an example. We have a vSphere environment with some VMs running on top of a storage array. Another array is brought to the environment. Now we want to introduce replication in the equation. No problem, let’s just implement the replication technology of our choice and be done with it. Wouldn’t it be cool if all of this happened in a cloud fashion, aka automatically? This is something that I’m working towards, but there’s a lot of steps to automate.

Something that can help in this scenario is EMC ViPR. With this product we can automate the process of protecting a LUN with continuous data protection with EMC RecoverPoint. ViPR will automate every task from creating the LUN on an array, configuring RecoverPoint settings and mounting the finished LUN to a host or a cluster. The only problem is that ViPR (and I’m guessing every orchestration product out there) expects to have the infrastructure implemented and ready to use. In a true cloud environment we might not have everything set up, because things cannot be predicted very well.

The beauty of software is that it’s not hardware ;). EMC has made a virtual version of RecoverPoint for VMware environments, and in fact there should be a lot of other products virtualized in the near future. When we utilize a virtual appliance, it can be deployed automatically through orchestration, on demand. This is not something that comes out-of-the-box, so there are a couple of issues with it. Firstly, the appliance is in OVF-format and secondly, the installation of the RecoverPoint appliance is done with a GUI based wizard.

Importing OVF automatically is a bit harder than it sounds. OVF was designed for the vSphere admin to use. Luckily, there’s a CLI based ovftool, that can be leveraged for automation with some tweaking. For the orchestration of all of this I used (obviously) the vCenter Orchestrator. If you want to spend some money, there is a neat commercial vCO plugin that can deploy OVFs automatically and also convert VMs to OVF. With a little effort, you can make your own ovf deployer using builtin vCO functionality.

deploy_vRPA_OVF

vCO can easily run CLI commands on a local or remote server. With this functionality we can run the ovftool and deploy the RecoverPoint OVF to our vSphere environment. First we need to construct the proper command for ovftool so it can deploy the OVF successfully. I installed the ovftool on a separate VM, but it can be installed on the vCO appliance as well (or any other server for that matter). You can use ovftool in probe mode. It will instruct the user what kind of attributes we need for a certain OVF file for deployment. For RecoverPoint/VE, the ovftool command in vCO workflow is something like this:

cmd = "ovftool --datastore=" + datastore + " --diskMode=thin --name=" + vmName + " --ipAllocationPolicy=fixedPolicy ";
cmd += "--net:\"LAN Network\"=\""+lan+"\"" + " --net:\"WAN Network\"=\""+wan+"\"" + " --net:\"iSCSI1 Network\"=\""+iSCSI1+"\"";
cmd += " --net:\"iSCSI2 Network\"=\""+iSCSI2+"\"" + " --prop:ip=" + ipTemp + " --prop:dns=" + dns + " --prop:netmask=" + mask;
cmd += " --prop:gateway=" + gateway + " --acceptAllEulas --powerOn " + "/" + execFolder + "/" + ovaName + " vi://" + viUser + ":" + viPassword + "@" + vcenter;
cmd += "/" + datacenter + "/host/" + viCluster;

Not pretty, but it does the job. Now we just need a vCO workflow that will define the necessary attributes, make an SSH connection to the server and run the command above. If you want to use this for any other OVF, just run the ovftool against the OVF file and deduct the necessary command. The ovftool will tell you everything you need. The biggest issue with this approach is the clear text admin password that is required for deploying a VM. Nope, securestring cannot be used. Need to find a solution for this.. I think this approach could be used for a general automatic OVF deploying tool, since ovftool will return the needed attributes for a OVF. Lot’s of coding needed, obviously.

So that’s done, the RP/VE appliance can be automatically deployed using all VMware tools. Next step is to automate the configuration part. Think Expect.