VMware vROps - To HA or not to HA?
With the release of vROps 6.0, VMware introduced HA and all of a sudden everyone wanted it without first trying to understand what vROps HA is. I admit that after the initial product release there was a lot of confusion, but now a year later, after the dust has settled, let's get some things straight.
First things first, so what is vROps HA? Let's get down to it and understand the benefits that HA provides. As the name implies, "High Availability" provides some level of protection against failure. At the most basic level HA will allow you to access and use the vROps solution while one of the nodes has failed for whatever reason. During an HA event, you should be able to log-in to the Product UI, run reports, and view dashboards as if the application was fully available. Data collection will also continue uninterrupted. So what types of failure does HA protect against?
- ESXi host failure.
- ESX data/VM network connectivity loss.
- vROps Service Availability.
- Storage Problems.
- Is vROPs going to be classified as a Business Critical system? Meaning that if it goes down, will your company lose revenue and/or customers? If no, then you don't need vROps HA because vROps availability does not impact your bottom line.
- You say it does not matter and that you have a policy which requires HA for everything. OK then. What are your Recovery Time Objective and Recovery Point Objective (RPO) for non-Business Critical systems? If it's anything higher than 5 minutes, then you don't need vROps HA because vSphere HA and proper backup solution will take care of most issues.
- As I hinted before, there is increased resource cost. When you enable vROps HA, it requires double the compute and storage resources to support the same number of collected objects and metrics you designed your environment for. For example, let's say your sizing calculations called for 4 medium nodes. If you enable HA, you will need 8 medium nodes. Remember, all nodes must be equal in size, you can't mix and match. You will need the additional compute power to calculate the parity information for duplicating the metric data across all of the nodes. Obviously, you need the additional storage for storing the duplicate metric data. Can your organization handle the increased resource cost for something that's just nice to have? Continuing with our 4 medium node example consuming 32 vCPUs, 128GB of memory, and 4TB of storage, doubling that would put us at 64 vCPUs, 256GB of memory, and 8TB of storage. Like I said, the costs can add up quick.
- Another consideration is physical server availability in your vSphere Cluster hosting the vROps solution. Do you have enough ESXi hosts for one-to-one vROps node-to-ESXi host relationship? Why? It would be bad design if you had more than one vROps node per ESXi host. Remember, vROps HA protects against single node failure only. If you lose a host with 2 vROps nodes on it, the solution will experience critical failure. This may also lead to data corruption and you may have to recover from backup. So when using vROps HA, host and datastore anti-affinity rules are a must. If you have to buy more physical hosts and storage to accommodate vROps HA, then that's increased cost and wait time.
- You had a vROps HA event. Now what? Have you considered what to do in case you have a vROps HA event? Typically, you will have a vROps HA event for one of two reasons: A) ESXi host failure, or B) storage issue. In the first case, it's pretty straight forward: vSphere HA will kick in and start your node on another host. You log into the Admin UI and reestablish HA, easy. Let's consider the second scenario now: you lost a Datastore, someone deleted a LUN, and one of your vROps nodes is gone. Now what? Well, you will either have to restore the entire vROps cluster from backup, or evict the damaged node from cluster, deploy a new one and add it back. If you restored then you're back in business. However, if you deployed new node then you have to reestablish HA and wait for hours until all data gets replicated to the new node before HA is fully functional again. What!?! You thought it was automagic? You set it and forget, and vROps takes care of itself. No way! Just like with hard disk RAID systems, someone has to take action to replace the failed disk and initiate a rebuild. This is very similar. This is why using vROps HA for protecting against ESXi host failure has very little to no benefit. It could take longer to restore the vROps HA cluster than just relying on plain old vSphere HA. In the second case, it boils down to your RTO and RPO, and how long of an vROps outage you can afford. Do you have a plan to deal with this?
There is one final note I would like to make about vROps HA. For some reason, when people hear vROps HA, they think they can stretch a vROps cluster between two geographically disparate datacenters, like a vSphere Cluster. Although possible, this should never be done, period! Why? Because the product was not designed to do that. This is because the underlying in-memory-database technology that provides HA, GemFire, is very latency sensitive and will fail if you spread the cluster nodes among datacenters. Then you say, "well, we have dark fiber and latency is not an issue for us." Still, don't do it! It will work for a while and then fail and you will be spending long hours on the phone with VMware Support trying to undo the mess you created. Trust me on this one. I've seen this, so don't try it at home. Besides latency, another common problem that stretched vROps HA clusters run into is an isolation event when the link between two sites goes down. Is such cases, the HA does not work because the cluster just lost half of its nodes and that's a critical failure which HA does not protect against. For example, let's say we have a 4 node HA cluster with 2 nodes (master and data node) at datacenter A and two other nodes (master replica and another data node) at datacenter B. Since HA protects against single node failure, when we lose communication to the other site with 2 nodes, that's more than HA can handle so things collapse. The cluster dies and HA can't do anything about it. Same thing applies no matter how many nodes you have and how they are distributed between the sites. This is why vROps HA is not a disaster recovery (DR) solution and only protects against single node failure only.
I hope this vROps HA reality check will help you make a better decision, regardless of your final choice. Just don't get carried away, and keep things simple.
Mastering vRealize Operations Manager by Scott Norris
VMware vRealize Operations Manager Capacity and Performance Management by Iwan 'e1' Rahabok
VMware Professional Services
Official vROps Documentation
VMware Operations Management White Papers
Extensibility and Management Packs
vROps product page
vXpress by @
virtual red dot by @
Virtualise Me by @
Elastic Sky Labs by @
i'm all vIRTUAL by @