In this article, we'll guide you through the process of stretching a VMware Cloud Foundation (VCF) management domain across two availability zones, enhancing your infrastructure's resilience and ensuring continuous availability. By the end of this tutorial, you'll have a solid understanding of how to configure and manage a stretched VCF deployment to protect your critical workloads.

Prerequisites

The requirements for stretching a management domain in VCF are documented on the VMware by Broadcom website. In our environment, we deviate from the VCF blueprint. We have two dedicated Edge nodes in AZ1 and two in AZ2. You can see the subnet requirements table we are using below.

Function AZ1 AZ2 HA Layer 3 Gateway
VM management VLAN
ESXi Management VLAN (AZ1) X
vMotion VLAN (AZ1) X
vSAN VLAN (AZ1) X
NSX Host Overlay (AZ1) X
NSX Edge Uplink01 (AZ1) X X
NSX Edge Uplink02 (AZ1) X X
NSX Edge Overlay (AZ1) X
ESXi Management VLAN (AZ2) X
vMotion VLAN (AZ2) X
vSAN VLAN (AZ2) X
NSX Host Overlay (AZ2) X
NSX Edge Uplink01 (AZ2) X X
NSX Edge Uplink02 (AZ2) X X
NSX Edge Overlay (AZ2) X

Network Pool

By default, you will have only one network pool after deploying VCF with the VMware Cloud Builder. One of the first tasks is to commission the hosts in AZ2 within the SDDC Manager. To do this, you need to have a second network pool that will contain AZ2-specific networks for vMotion and vSAN for the hosts in AZ2.

stretch management domain

Commision hosts AZ2 hosts

Once the AZ2-specific network pool has been created, you can commission the hosts in the SDDC Manager. Click on the "Commission Hosts" button to start the wizard.

stretch

Make sure you perform all the necessary checks from the checklist and click the "Proceed" button.

vcf

Add all AZ2 hosts and perform the validation. Once the validation is complete, click the "Next" button.

commission host

Click the 'Commission' button to initiate the host commissioning task in the SDDC Manager.

Management Domain

Once the commission task has been completed, you will see the AZ2 hosts in the inventory as Unassigned. These ESXi hosts can eventually be used to stretch the management domain.

Stretching

vSAN witness node

Make sure that you have already deployed and configured the vSAN Witness node according to the VMware by Broadcom documentation, and register the vSAN Witness in the management domain as shown below.

VCF

Stretching the Management Domain

Retrieving IDs of Unassigned ESXi Hosts

With the prerequisites in place, we are now ready to communicate with the API of the SDDC Manager. Let’s start by collecting the ID numbers of the Unassigned ESXi hosts that we commissioned earlier.

Click on Developer Center, then API Explorer, and select Hosts. Choose GET from the options. Enter "UNASSIGNED_USEABLE" as the value for the status parameter and click Execute.

VCF

You will eventually receive a response with all the ESXi hosts that have the status UNASSIGNED_USEABLE. Note down all the IDs of these ESXi hosts, as we will use them in a later step.

Stretching

Retrieving ID of cluster

We now need to retrieve the ID of the management domain cluster that we want to stretch between availability zones.

Click on Developer Center, then API Explorer, and select Clusters. Choose GET from the options and click Execute.

Management Domain

You will eventually receive a response with the details of the management domain cluster. Note down the ID of the cluster, as we will use it in a later step.

VCF

Creating the JSON specification

Based on the information gathered in the previous steps, we are now ready to create a JSON specification to initiate the stretch management domain task. Since the VCF management domain has been deployed with the vSphere Distributed Switch profile "Profile-1," there is no need to configure additional uplinks in the JSON specification.

In the hostSpecs, enter the following details for each host:

Key Value
hostname FQDN of the ESXi Host
networkProfileName The network profile name we have created earlier
vdsName The vSphere distributed switch name that is in use by AZ1
id The ID of the host we have gathered earlier
licenseKey The ESXi license that needs to be applied on the ESXi host

In the networkSpec section, under networkProfiles, enter the following details:

Key Value
name The network profile name we have created earlier
ipAddressPoolName The NSX IP Pool that will be used (will be created in the next specification section)
uplinkProfileName The NSX uplink Profile that will be used (will be created in the next specification section)
vdsName The vSphere distributed switch name that is in use by AZ1

In the networkSpec section, under nsxClusterSpec, enter the following details:

Key Value
description Description that will be used for the IP Pool in NSX
name A name that will be used for the IP Pool in NSX
uplinkProfileName A name that will be used for the uplink Profile in NSX
vdsName The vSphere distributed switch name that is in use by AZ1

In the networkSpec section, under uplinkProfiles, provide the following details:

Key Value
name A name that will be used for the uplink profile
transportVlan The VLAN for the host TEP communication

In the witnessSpec, enter the following details:

Key Value
fqdn The FQDN of the vSAN witness node
vsanCidr The CIDR that is used by the vSAN witness node
vsanIp The IP address of the vSAN witness node
{
"clusterStretchSpec": {
  "hostSpecs": [
   {
    "hostname": "m01-esx20001.domain.internal",
    "hostNetworkSpec": {
     "networkProfileName": "m01-np02",
     "vmNics": [
      {
       "id": "vmnic0",
       "uplink": "uplink1",
       "vdsName": "m01-cl01-vds01"
      },
      {
       "id": "vmnic1",
       "uplink": "uplink2",
       "vdsName": "m01-cl01-vds01"
      }
     ]
    },
    "id": "1e826b08-52c5-4e95-85c9-b205c777f55a",
    "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX"
   },
   {
    "hostname": "m01-esx20002.domain.internal",
    "hostNetworkSpec": {
     "networkProfileName": "m01-np02",
     "vmNics": [
      {
       "id": "vmnic0",
       "uplink": "uplink1",
       "vdsName": "m01-cl01-vds01"
      },
      {
       "id": "vmnic1",
       "uplink": "uplink2",
       "vdsName": "m01-cl01-vds01"
      }
     ]
    },
    "id": "c59ec647-90fa-4537-a330-0debc63dc26b",
    "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX"
   },
   {
    "hostname": "m01-esx20003.domain.internal",
    "hostNetworkSpec": {
     "networkProfileName": "m01-np02",
     "vmNics": [
      {
       "id": "vmnic0",
       "uplink": "uplink1",
       "vdsName": "m01-cl01-vds01"
      },
      {
       "id": "vmnic1",
       "uplink": "uplink2",
       "vdsName": "m01-cl01-vds01"
      }
     ]
    },
    "id": "b1e3481f-fd65-4db9-8c8a-2e2138d017a8",
    "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX"
   },
   {
    "hostname": "m01-esx20004.domain.internal",
    "hostNetworkSpec": {
     "networkProfileName": "m01-np02",
     "vmNics": [
      {
       "id": "vmnic0",
       "uplink": "uplink1",
       "vdsName": "m01-cl01-vds01"
      },
      {
       "id": "vmnic1",
       "uplink": "uplink2",
       "vdsName": "m01-cl01-vds01"
      }
     ]
    },
    "id": "ae56f9b0-ddee-4f28-8389-f0ec82ea2055",
    "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX"
   }
  ],
  "isEdgeClusterConfiguredForMultiAZ": false,
  "networkSpec": {
   "networkProfiles": [
    {
     "isDefault": false,
     "name": "m01-np02",
     "nsxtHostSwitchConfigs": [
      {
       "ipAddressPoolName": "m01-cl01-az2-tep01",
       "uplinkProfileName": "m01-cl01-vds01-az2-hostswitch",
       "vdsName": "m01-cl01-vds01",
       "vdsUplinkToNsxUplink": [
        {
         "nsxUplinkName": "uplink-1",
         "vdsUplinkName": "uplink1"
        },
        {
         "nsxUplinkName": "uplink-2",
         "vdsUplinkName": "uplink2"
        }
       ]
      }
     ]
    }
   ],
   "nsxClusterSpec": {
    "ipAddressPoolsSpec": [
     {
      "description": "AZ2 ESXi Host Overlay TEP IP Pool",
      "name": "m01-cl01-az2-tep01",
      "subnets": [
       {
        "cidr": "xx.xx.57.0/24",
        "gateway": "xx.xx.57.1",
        "ipAddressPoolRanges": [
         {
          "end": "xx.xx.57.60",
          "start": "xx.xx.57.20"
         }
        ]
       }
      ]
     }
    ],
    "uplinkProfiles": [
     {
      "name": "m01-cl01-vds01-az2-hostswitch",
      "teamings": [
       {
        "activeUplinks": [
         "uplink-1",
         "uplink-2"
        ],
        "name": "DEFAULT",
        "policy": "LOADBALANCE_SRCID",
        "standByUplinks": []
       }
      ],
      "transportVlan": 1108
     }
    ]
   }
  },
  "witnessSpec": {
   "fqdn": "w01-vsw30001.domain.internal",
   "vsanCidr": "xx.xx.80.0/24",
   "vsanIp": "xx.xx.80.22"
  },
  "witnessTrafficSharedWithVsanTraffic": true
}
}

Perform validation of the JSON specification

We now have the JSON specification ready. Before applying the cluster stretch task, we need to validate the JSON specifications.

To do this, navigate to the Developer Center, then to API Explorer, and select Clusters. Choose the POST /v1/clusters/{id}/validations option. Enter the cluster ID we gathered earlier, paste the JSON specification into the clusterUpdateSpec field, and click Execute.

VCF

If the input is valid, you should receive a response with resultStatus set to "SUCCEEDED." Once you see this success message, you can proceed with creating the Cluster Stretch task.

VCF

Update cluster by Stretching a standard vSAN cluster

After a successful validation of the JSON specification, you are ready to update the cluster for stretching.

To do this, navigate to the Developer Center, then to API Explorer, and select Clusters. Choose the PATCH /v1/clusters/{id} option. Enter the cluster ID we gathered earlier, paste the JSON specification into the clusterUpdateSpec field, and click Execute.

VCF

In the response, you should see a task created for stretching the management domain vSAN cluster. This task will also appear in the SDDC Manager's tasks view at the bottom of the screen.

VCF

Issue during stretch task

One issue we encountered was "Configure Static Routes On ESXi Hosts Along With Witness Validation." We were unsure of the cause of this error, so we began by investigating the log files.

Issue

In the log file /var/log/vmware/vcf/domainmanager/domainmanager.log, we found the following error message: "Could not connect to the SSH server @ esxi-fqdn for configuration." This led us to believe that SSH is disabled on the ESXi hosts.

Error:

2024-07-11T08:53:37.320+0000 DEBUG [vcf_dm,668f9d8c598a98687e2d1a17ee230635,8c2c] [c.v.v.c.f.p.a.i.ConfigureStaticRouteOnHostsWithWitnessValidationAction,dm-exec-5]  Trying to ping xx.xx.80.22 from m01-esx20003.domain.internal
2024-07-11T08:53:37.330+0000 DEBUG [vcf_dm,668f9d8c598a98687e2d1a17ee230635,8c2c] [c.v.v.s.c.s.SecurityConfigurationServiceImpl,dm-exec-5]  Security config retrieved {"fipsMode":false}
2024-07-11T08:53:37.332+0000 ERROR [vcf_dm,668f9d8c598a98687e2d1a17ee230635,8c2c] [c.v.evo.sddc.common.util.SshUtil,dm-exec-5]  Unable to create jsch CLI session:
com.jcraft.jsch.JSchException: java.net.ConnectException: Connection refused
        at com.jcraft.jsch.Util.createSocket(Util.java:394)
        at com.jcraft.jsch.Session.connect(Session.java:215)
        at com.vmware.evo.sddc.common.util.SshUtil.getSession(SshUtil.java:678)
        at com.vmware.evo.sddc.common.util.SshUtil.getSession(SshUtil.java:626)
        at com.vmware.evo.sddc.common.util.command.SshCommandExecuter.<init>(SshCommandExecuter.java:46)
        at com.vmware.evo.sddc.common.util.command.SshCommandExecuterFactory.createSshCommandExecuter(SshCommandExecuterFactory.java:71)
        at com.vmware.evo.sddc.common.util.command.SshCommandExecuterFactory.createSshCommandExecuter(SshCommandExecuterFactory.java:42)
        at jdk.internal.reflect.GeneratedMethodAccessor2203.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:343)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:196)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:750)
        at org.springframework.validation.beanvalidation.MethodValidationInterceptor.invoke(MethodValidationInterceptor.java:141)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:184)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:750)
        at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:702)
        at com.vmware.evo.sddc.common.util.command.SshCommandExecuterFactory$$SpringCGLIB$$0.createSshCommandExecuter(<generated>)
        at com.vmware.vcf.common.fsm.plugins.action.impl.ConfigureStaticRouteOnHostsWithWitnessValidationAction.witnessConnectivityCheck(ConfigureStaticRouteOnHostsWithWitnessValidationAction.java:78)
        at com.vmware.vcf.common.fsm.plugins.action.impl.ConfigureStaticRouteOnHostsWithWitnessValidationAction.postValidate(ConfigureStaticRouteOnHostsWithWitnessValidationAction.java:66)
        at com.vmware.vcf.common.fsm.plugins.action.impl.ConfigureStaticRouteOnHostsWithWitnessValidationAction.postValidate(ConfigureStaticRouteOnHostsWithWitnessValidationAction.java:29)
        at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionState.lambda$static$1(FsmActionState.java:23)
        at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionState.invoke(FsmActionState.java:62)
        at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionPlugin.invoke(FsmActionPlugin.java:159)
        at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionPlugin.invoke(FsmActionPlugin.java:144)
        at com.vmware.evo.sddc.orchestrator.core.ProcessingTaskSubscriber.invokeMethod(ProcessingTaskSubscriber.java:400)
        at com.vmware.evo.sddc.orchestrator.core.ProcessingTaskSubscriber.processTask(ProcessingTaskSubscriber.java:561)
        at com.vmware.evo.sddc.orchestrator.core.ProcessingTaskSubscriber.accept(ProcessingTaskSubscriber.java:124)
        at jdk.internal.reflect.GeneratedMethodAccessor887.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at com.google.common.eventbus.Subscriber.invokeSubscriberMethod(Subscriber.java:85)
        at com.google.common.eventbus.Subscriber.lambda$dispatchEvent$0(Subscriber.java:71)
        at com.vmware.vcf.common.tracing.TraceRunnable.run(TraceRunnable.java:59)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.Net.connect0(Native Method)
        at java.base/sun.nio.ch.Net.connect(Net.java:579)
        at java.base/sun.nio.ch.Net.connect(Net.java:568)
        at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:588)
        at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
        at java.base/java.net.Socket.connect(Socket.java:633)
        at java.base/java.net.Socket.connect(Socket.java:583)
        at java.base/java.net.Socket.<init>(Socket.java:507)
        at java.base/java.net.Socket.<init>(Socket.java:287)
        at com.jcraft.jsch.Util$1.run(Util.java:362)
        ... 1 common frames omitted
2024-07-11T08:53:37.333+0000 ERROR [vcf_dm,668f9d8c598a98687e2d1a17ee230635,8c2c] [c.v.e.s.c.u.c.SshCommandExecuter,dm-exec-5]  Could not connect to the SSH server @ m01-esx20003.domain.internal for configuration.
com.jcraft.jsch.JSchException: java.net.ConnectException: Connection refused
        at com.jcraft.jsch.Util.createSocket(Util.java:394)
        at com.jcraft.jsch.Session.connect(Session.java:215)
        at com.vmware.evo.sddc.common.util.SshUtil.getSession(SshUtil.java:678)
        at com.vmware.evo.sddc.common.util.SshUtil.getSession(SshUtil.java:626)
        at com.vmware.evo.sddc.common.util.command.SshCommandExecuter.<init>(SshCommandExecuter.java:46)
        at com.vmware.evo.sddc.common.util.command.SshCommandExecuterFactory.createSshCommandExecuter(SshCommandExecuterFactory.java:71)
        at com.vmware.evo.sddc.common.util.command.SshCommandExecuterFactory.createSshCommandExecuter(SshCommandExecuterFactory.java:42)
        at jdk.internal.reflect.GeneratedMethodAccessor2203.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:343)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:196)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:750)
        at org.springframework.validation.beanvalidation.MethodValidationInterceptor.invoke(MethodValidationInterceptor.java:141)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:184)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:750)
        at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:702)
        at com.vmware.evo.sddc.common.util.command.SshCommandExecuterFactory$$SpringCGLIB$$0.createSshCommandExecuter(<generated>)
        at com.vmware.vcf.common.fsm.plugins.action.impl.ConfigureStaticRouteOnHostsWithWitnessValidationAction.witnessConnectivityCheck(ConfigureStaticRouteOnHostsWithWitnessValidationAction.java:78)
        at com.vmware.vcf.common.fsm.plugins.action.impl.ConfigureStaticRouteOnHostsWithWitnessValidationAction.postValidate(ConfigureStaticRouteOnHostsWithWitnessValidationAction.java:66)
        at com.vmware.vcf.common.fsm.plugins.action.impl.ConfigureStaticRouteOnHostsWithWitnessValidationAction.postValidate(ConfigureStaticRouteOnHostsWithWitnessValidationAction.java:29)
        at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionState.lambda$static$1(FsmActionState.java:23)
        at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionState.invoke(FsmActionState.java:62)
        at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionPlugin.invoke(FsmActionPlugin.java:159)
        at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionPlugin.invoke(FsmActionPlugin.java:144)
        at com.vmware.evo.sddc.orchestrator.core.ProcessingTaskSubscriber.invokeMethod(ProcessingTaskSubscriber.java:400)
        at com.vmware.evo.sddc.orchestrator.core.ProcessingTaskSubscriber.processTask(ProcessingTaskSubscriber.java:561)
        at com.vmware.evo.sddc.orchestrator.core.ProcessingTaskSubscriber.accept(ProcessingTaskSubscriber.java:124)
        at jdk.internal.reflect.GeneratedMethodAccessor887.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at com.google.common.eventbus.Subscriber.invokeSubscriberMethod(Subscriber.java:85)
        at com.google.common.eventbus.Subscriber.lambda$dispatchEvent$0(Subscriber.java:71)
        at com.vmware.vcf.common.tracing.TraceRunnable.run(TraceRunnable.java:59)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.Net.connect0(Native Method)
        at java.base/sun.nio.ch.Net.connect(Net.java:579)
        at java.base/sun.nio.ch.Net.connect(Net.java:568)
        at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:588)
        at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
        at java.base/java.net.Socket.connect(Socket.java:633)
        at java.base/java.net.Socket.connect(Socket.java:583)
        at java.base/java.net.Socket.<init>(Socket.java:507)
        at java.base/java.net.Socket.<init>(Socket.java:287)
        at com.jcraft.jsch.Util$1.run(Util.java:362)
        ... 1 common frames omitted

Stretch task completed

After enabling SSH services on the AZ2 ESXi hosts and retrying the task, it completed successfully. Upon consulting with VMware by Broadcom, we learned that the commissioning task has been modified and that SSH is disabled during this process.

VCF

As shown in the picture below, we now have a stretched VCF Management domain cluster with multiple availability zones.

VCF