Cloud Director cells 10.1 and later failover methods. 1

Cloud Director cells 10.1 and later failover methods.

HA support for the Cloud Director nodes were introduced in version 9.7 and has been improved in the 10.1 release. In Cloud Director 9.7 the only failover method was manual. So in case of a failed primary database node, the failover need to be manually initiated from the web console: https://vcloud.domainname.com:5480 That all changed in Cloud Director 10.1!

In this blog post, I will show you how to configure the failover method to automatic, and I will show you how to replace a primary failed node with a new standby node. This article is based on Cloud Director version 10.2.

Failover methods: manual vs Automatic

As I already mentioned before, the Cloud Director HA setup is now able to perform a automatic failover when the primary database node fails. In the picture below, you see on the left side the manual failover process and on the right side the automatic failover process. The only difference between the two processes is the automatic failover process. It still require manual redeployment of a new standby node after a failed primary node.

Cloud Director Failover methods

Perform manual failover

After you have deployed a Cloud Director HA setup, the failover method is by default configured as manual. Let’s start with demonstrating the manual failover method. How can you initiate a manual failover? That can be performed from the Cloud Director console. To do so, login on the Cloud Director web console: https://vcd-node:5480

Cloud DIrector web console

After logging in to the Cloud Director web console, you will see the embedded database availability page. On this page you can see all the available Cloud Director cells and their current state (Primary or Standby).

The primary database node is currently the vCD-mgmt-1 (as shown in the previous picture). To initiate a failover to the second cell (vCD-mgmt-2), we only have to click on the switchover button of that cell.

Cloud DIrector embedded database availability
Cloud DIrector failover

As you can see, Cloud Director cell named vCD-mgmt-2 is now the primary database node.

Cloud DIrector embedded database availability

How to configure the automatic failover mode

As mentioned before, the default failover method is manual. It is unfortunately not possible to change the failover method to automatic from the Cloud Director web console. The only way to change this is by using the API from Cloud Director.

Cloud Director web console
Failover method is by default manual.

There are several ways to communicate to the Cloud Director API, please see the following link for additional information on using the Cloud Director API.

Note: The Cloud Director Appliance API can be used to get and change the state information of your Cloud Director appliances. The API is only accessible on port 5480. This API only returns JSON formatted data.

In my case, I used postman to connect to the Cloud Director API. Open Postman, change the HTTP Request to POST and change the Request URL to “http://vCD-node:5480/api/1.0.0/nodes/failover/automatic”. It doesn’t matter which node you use, the configuration will be applied on all the nodes.

Note: To switch back to manual mode, use the following Request URL: “http://vCD-node:5480/api/1.0.0/nodes/failover/manual”

cloud director postman

Configure the root credentials of the Cloud Director cell in the authorization tab as shown below:

postman
Use the root credentials of de Cloud Director cell.

Add the following configurations in the headers tab:

  • Accept: application/json
10.1 vcd

We are now ready to send the API call to configure the failover method to automatic. Click on the blue send button to send the API call. Make sure to verify the status of the API call. Expected status should be 202 ACCEPTED.

postman

We now have successfully changed the failover method from manual to automatic. To verify the change, go to the Cloud

vcd

Remove the failed primary node

In this example, I will simulate a failure of the primary node (vcd-mgmt-1) by disconnecting the NICs in vCenter. So let’s start with logging in to the vCenter server and edit the settings of the vcd-mgmt-1 node:

disable nic
I have deselected the connected checkbox of the NICs

As you could see in the picture below, the Cloud Director cluster health status has been changed to status degraded. The vcd-mgmt-2 node has become the new primary node during the automatic failover and the vcd-mgmt-1 has the status failed.

degraded healthy status

Remove cell from Cloud Director

The only way to have healthy cluster again is to redeploy the vcd-mgmt-1 node. First of all, we need to remove the vcd-mgmt-1 cell from the Cloud Director cluster. To do so, open the provider web GUI and go to Resources –> Cloud Resources –> Cloud Cells and select the inactive node and click on unregister.

remove cell from cloud director

Unregister cell from the repmgr cluster

The next step is to remove the cell from the repmgr cluster as well. Open a putty session and login as root to any of the other nodes that are still running.

sudo -i -u postgres

Use the following command to show the nodes in the repmgr cluster with their status. We are looking for the failed vcd-mgmt-1 node that should be in a failed state.

repmgr cluster show
putty
Make a note of the vcd-mgmt-1 ID.

Use the following command with the ID of the failed primary node to remove the failed node from the repmgr cluster:

/opt/vmware/vpostgres/current/bin/repmgr primary unregister --node-id=node ID

We now have successfully removed the failed Cloud Director node from the repmgr cluster.

putty repmgr show

To verify this, open any other Cloud Director node web console and you should see that the failed node has been removed. The cluster health should be healthy again and we are able to deploy a new standby node with the previous cell name.

cloud director 10.2

Summary

I hope that this article could help you out with configuring and testing your Cloud Director HA setup. If you need any help or you have some questions on this topic, please do not hesitate to contact me.

3 Comments

  1. Hi,

    My system switches over to the standby cell successfully but, the webportal is not accessible anymore.
    In the appliance cluster page i see that it has switched over to another as primary.

    I had to manually (SSH) into the new primary cell to change the public address to the new cell for the webportal to be functional again.

    Can you explain how to setup this part ?

    1. Hi Ardley,

      Did you create a public address that has those 3 nodes as members in a loadbalancer, and did you configured that FQDN in the public address field in the web portal of Cloud Director?

Leave a Comment