This week, I was troubleshooting an issue with one of the vROPS (vRealize Operations Manager) clusters at a customer. I was unable to perform the “bring the cluster online” task from the admin page of vROPS after a restart.
The issue
As shown in the picture below, I was unable to bring the cluster online. The cluster status was “failure” with the error message: Cluster failed to come online.
Performing the "bring cluster online" task results in a failure in the admin portal of vROPS.
Troubleshooting steps
Rebooting the vROPS nodes didn't make any differences, so let's take a look into the log files of vROPS.
Tasks like "Bring cluster online" or "Take Cluster Offline" are being logged in the following log file: /var/log/casa_logs/casa.log.
2022-10-18T10:59:56,906+0000 ERROR [ajp-nio-127.0.0.1-8011-exec-2] [Ff00006U] sysadmin.cassandra.PythonCassandraCommand:296 - Could not run command='/usr/lib/vmware-python-3/bin/python /usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/bin/vcopsConfigureRoles.py --action startServices --force cassandra'
stdout:
2022-10-18T10:59:56,881+0000 [11722] - admin - An unhandled exception occurred, exiting with exit code: 1,
Type: "<class 'urllib.error.URLError'>"
Value: "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>"
Traceback: "Traceback (most recent call last):
File "/usr/lib/python3.7/urllib/request.py", line 1348, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "/usr/lib/python3.7/http/client.py", line 1281, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.7/http/client.py", line 1327, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.7/http/client.py", line 1276, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.7/http/client.py", line 1036, in _send_output
self.send(msg)
File "/usr/lib/python3.7/http/client.py", line 976, in send
self.connect()
File "/usr/lib/python3.7/http/client.py", line 1451, in connect
server_hostname=server_hostname)
File "/usr/lib/python3.7/ssl.py", line 423, in wrap_socket
session=session
File "/usr/lib/python3.7/ssl.py", line 870, in _create
self.do_handshake()
File "/usr/lib/python3.7/ssl.py", line 1139, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/bin/vcopsConfigureRoles.py", line 1487, in <module>
main()
File "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/bin/vcopsConfigureRoles.py", line 1481, in main
doConfigurationsAndActions(loadStateFile, runConfigureRoles, rolesToModify, adminRoleConnectionString, context, runBringSliceOffline, runBringSliceOnline, runRepairRoles, doInitSliceId, runWriteRolesToStateFile, startServices, startServicesOnConfig, stopServices, serviceStatus, joinCasaCluster, waitForFirstbootScripts, setLock, enableDisableServices, disableAllServices, runPromoteNewMaster, oldPostgresMaster, enableHA, replica, enrollmentUserString, enrollmentThumbprintString, offlineReason, useHTTPSOnly, force, jsonOutput, args)
File "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/bin/vcopsConfigureRoles.py", line 1430, in doConfigurationsAndActions
runStartServices(runningRoleStateFile, rolesToModify, enableServices = enableDisableServices, services = args, force = force, context = context)
File "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/bin/vcopsConfigureRoles.py", line 370, in runStartServices
force=force, context=context)
File "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/bin/vcopsPlatformServices.py", line 307, in startPlatformServices
cassandra_check.wait_for_cassandra(this_node_only=True)
File "/usr/lib/vmware-vcopssuite/utilities/vmware/vcops/cassandra/check.py", line 512, in wait_for_cassandra
if self._retry_timeout_sec == 0 or self._are_enough_nodes_up(this_node_only):
File "/usr/lib/vmware-vcopssuite/utilities/vmware/vcops/cassandra/check.py", line 284, in _are_enough_nodes_up
is_CA_enabled = CassandraCheck.get_CA_enabled(logger)
File "/usr/lib/vmware-vcopssuite/utilities/vmware/vcops/cassandra/check.py", line 618, in get_CA_enabled
ca_state = CassandraGetApiExecutor.execute_get_api_and_return_response('localhost', '/config/cassandra/cluster/ca', logger)
File "/usr/lib/vmware-vcopssuite/utilities/vmware/vcops/cassandra/cassandra_get_api_executor.py", line 51, in execute_get_api_and_return_response
status_code, token_pair = vc_ops_http_utilities.login(hostname)
File "/usr/lib/vmware-vcopssuite/utilities/lib/vc_ops_http_utilities.py", line 295, in login
response = opener.open(authorization_request)
File "/usr/lib/python3.7/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/usr/lib/python3.7/urllib/request.py", line 543, in _open
'_open', req)
File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/usr/lib/python3.7/urllib/request.py", line 1391, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/usr/lib/python3.7/urllib/request.py", line 1350, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>
2022-10-18T10:59:56 ERROR [11722] - root - An unhandled exception occurred, exiting with exit code: 1,
Type: "<class 'urllib.error.URLError'>"
Value: "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>"
Traceback: "Traceback (most recent call last):
File "/usr/lib/python3.7/urllib/request.py", line 1348, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "/usr/lib/python3.7/http/client.py", line 1281, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.7/http/client.py", line 1327, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.7/http/client.py", line 1276, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.7/http/client.py", line 1036, in _send_output
self.send(msg)
File "/usr/lib/python3.7/http/client.py", line 976, in send
self.connect()
File "/usr/lib/python3.7/http/client.py", line 1451, in connect
server_hostname=server_hostname)
File "/usr/lib/python3.7/ssl.py", line 423, in wrap_socket
session=session
File "/usr/lib/python3.7/ssl.py", line 870, in _create
self.do_handshake()
File "/usr/lib/python3.7/ssl.py", line 1139, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/bin/vcopsConfigureRoles.py", line 1487, in <module>
main()
File "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/bin/vcopsConfigureRoles.py", line 1481, in main
doConfigurationsAndActions(loadStateFile, runConfigureRoles, rolesToModify, adminRoleConnectionString, context, runBringSliceOffline, runBringSliceOnline, runRepairRoles, doInitSliceId, runWriteRolesToStateFile, startServices, startServicesOnConfig, stopServices, serviceStatus, joinCasaCluster, waitForFirstbootScripts, setLock, enableDisableServices, disableAllServices, runPromoteNewMaster, oldPostgresMaster, enableHA, replica, enrollmentUserString, enrollmentThumbprintString, offlineReason, useHTTPSOnly, force, jsonOutput, args)
File "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/bin/vcopsConfigureRoles.py", line 1430, in doConfigurationsAndActions
runStartServices(runningRoleStateFile, rolesToModify, enableServices = enableDisableServices, services = args, force = force, context = context)
File "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/bin/vcopsConfigureRoles.py", line 370, in runStartServices
force=force, context=context)
File "/usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/bin/vcopsPlatformServices.py", line 307, in startPlatformServices
cassandra_check.wait_for_cassandra(this_node_only=True)
File "/usr/lib/vmware-vcopssuite/utilities/vmware/vcops/cassandra/check.py", line 512, in wait_for_cassandra
if self._retry_timeout_sec == 0 or self._are_enough_nodes_up(this_node_only):
File "/usr/lib/vmware-vcopssuite/utilities/vmware/vcops/cassandra/check.py", line 284, in _are_enough_nodes_up
is_CA_enabled = CassandraCheck.get_CA_enabled(logger)
File "/usr/lib/vmware-vcopssuite/utilities/vmware/vcops/cassandra/check.py", line 618, in get_CA_enabled
ca_state = CassandraGetApiExecutor.execute_get_api_and_return_response('localhost', '/config/cassandra/cluster/ca', logger)
File "/usr/lib/vmware-vcopssuite/utilities/vmware/vcops/cassandra/cassandra_get_api_executor.py", line 51, in execute_get_api_and_return_response
status_code, token_pair = vc_ops_http_utilities.login(hostname)
File "/usr/lib/vmware-vcopssuite/utilities/lib/vc_ops_http_utilities.py", line 295, in login
response = opener.open(authorization_request)
File "/usr/lib/python3.7/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/usr/lib/python3.7/urllib/request.py", line 543, in _open
'_open', req)
File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/usr/lib/python3.7/urllib/request.py", line 1391, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/usr/lib/python3.7/urllib/request.py", line 1350, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>
"
The following error message is being logged during the bring up of the vROPS cluster: error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired.
The solution
Note: Make sure to have a snapshot in place of the vROPS nodes before doing any changes.
The first thing that came in my mind was: Do the issue still persist with the default self signed certificates?
So I followed the VMware KB on reloading the vROPS nodes certificates with the default certificates.
The following steps needs to be performed on all of the vROPS nodes
unset -f pathprepend
unset -f pathremove
unset -f pathappend
$VMWARE_PYTHON_BIN /usr/lib/vmware-casa/bin/activate_web_certificate.py DEFAULT
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcopssuite/utilities/bin/restartHttpd.py
Note: The unset commands are required as a result of the python version differences from 6.x/7.x to 8.x to avoid errors.
Running the commands all of the vROPS nodes.
The newly self signed SSL certificate will be active on the vROPS nodes after the reload of the default certificates.
You can verify that in your internet browser by checking the certificate details.
Self signed certificates has been configured.
This is the moment we all have been waiting for. Will the bring the cluster online task work with the new default SSL certificates?
The cluster is finally online again.
After discovering that the SSL certificate was causing the issue, we requested a new domain SSL certificate and configured it on the vROPS nodes. Performing a cluster restart worked without any issues. We can conclude that an expired domain signed SSL certificate will cause issues during the bring up of the vROPS cluster.