Reducing the CPU Quota in vCloud Director 9.7 results in a error
The error message
We were getting an error message when trying to reduce the CPU quota of an OrgVDC to 92 Ghz in vCD 9.7: “Unable to update the Organization VDC configuration because the new value for CPU resource is less than current usage by all Virtual Machines within this Organization VDC”. The error message in combination with the general and allocation view from the OrgVDC got me a bit of confused.
The OrgVDC that was giving us the error message was using 88 Ghz of the total amount of 104 Ghz shown in the picture below.
the next next thing i noticed was the “used CPU allocation” in the allocation view from the OrgVDC. The OrgVDC was using 88 Thz of the 104 GHz configured CPU quota. 88 Thz is quite a lot for an OrgVDC, right? This is an GUI bug according to VMware.
In the task details we got the same error message with an job ID.
Troubleshooting
The first thing to do is to check the total amount of CPU that is in used by the OrgVDC. The total amount of Ghz that was in used by the OrgVDC on the vCenter level was 90Ghz. We have found the reason why we couldn’t lower the CPU quota, but vCD is showing different values as shown in the previous pictures. This could only lead to an issue in vCloud director, so i started to investigate the logs of the vCD cells. Most of the time, you will find additional information in the logs that is not available in the web GUI. To investigate the logs on vCD cells, open SSH sessions to all the vCD cells.
Once connected to the vCD cells, go into the logs folder by using the following command:
cd /opt/vmware/vcloud-director/logs
Now we will do a search in the logs folder to see which error logs contains the error message. With the following command, we will get the file name that contains the error message. To do so, use the grep command with the JOB ID from the task details as shown below:
grep -il "JOBID"
Both vCD cells had a file called “vcloud-container-debug.log ” that had an entry with the JOB ID. In the first cell we could see the following error message:
2019-11-22 12:49:48,298 | ERROR | task-service-activity-pool-72 | VdcService | Error updating VDC | requestId=358fadf6-eba4-43e4-b3e5-4119789c52ac,request=PUT https://domainname.com/api/admin/vdc/3c2cb132-6ac3-477a-bede-34eb072d9c21,requestTime=1574423385377,remoteAddress=10.0.10.241:53478,userAgent=Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/2010...,accept=application/*+json;version 32.0 vcd=3080855e-47b0-4c40-9df3-3d744e66e3c2,task=547186d2-5d71-4fd7-a68f-16370343c5ac activity=(com.vmware.vcloud.backendbase.management.system.TaskActivity,urn:uuid:547186d2-5d71-4fd7-a68f-16370343c5ac) com.vmware.vcloud.valc.exception.InvalidConfigException: Unable to update the Organization VDC configuration because the new value for CPU resource is less than current usage by all Virtual Machines within this Organization VDC. at com.vmware.vcloud.valc.activities.UpdateComputeActivity$FinalPhase.invoke(UpdateComputeActivity.java:316) at com.vmware.vcloud.activity.executors.ActivityRunner.runPhase(ActivityRunner.java:175) at com.vmware.vcloud.activity.executors.ActivityRunner.run(ActivityRunner.java:112) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: com.vmware.vcloud.fabric.compute.ValidationException: ValidationException VRP_NEWLY_CONFIGURED_CPU_RESOURCE_INSUFFICIENT at com.vmware.vcloud.fabric.compute.vrp.GenericVirtualResourcePool.validateCapacityUsage(GenericVirtualResourcePool.java:575) at
In the second cell:
2019-11-22 12:49:46,768 | ERROR | compute-fabric-activity-pool71 | GenericVirtualResourcePool | Current usage of 98,000Mhz cpu is more than the newlyconfigured cpu 92,000Mhz. | requestId=358fadf6-eba4-43e4-b3e5-4119789c52ac,request=PUT https://domainname.com/api/admin/vdc/3c2cb132-6ac3-477a-bede-34eb072d9c21,requestTime=1574423385377,remoteAddress=10.0.10.241:53478,userAgent=Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/2010...,accept=application/*+json;version 32.0 vcd=3080855e-47b0-4c40-9df3-3d744e66e3c2,task=547186d2-5d71-4fd7-a68f-16370343c5ac activity=(com.vmware.vcloud.backendbase.management.system.TaskActivity,urn:uuid:547186d2-5d71-4fd7-a68f-16370343c5ac) activity=(com.vmware.vcloud.valc.activities.UpdateVdcActivity,urn:uuid:1051200b-ecc1-4247-9c00-7d3b2285e92a) activity=(com.vmware.vcloud.valc.activities.UpdateComputeActivity,urn:uuid:145c4284-7727-4441-8ad3-781ead932016)
The additional additional errors logs we needed was located on the second vCD cell. Error Message: Current usage of 98,000Mhz cpu is more than the newlyconfigured cpu 92,000Mhz. The current usage in the log is matching the current usage from our calculation we did before. We tried to lower the CPU quota to 100 Ghz and that worked without any issues. With help from VMware we managed to find the root cause of the issue.
Resolution
To identify the VMs, we need to find the vrp_id first of the OrgVDC :
SELECT * FROM vrp where name like '%OrgName%'
The next thing is to check all the cpu limits and vCPUs from the VMs in that particular OrgVDC:
We are multiplying the vCPU times the current vCPU speed from the OrgVDC. In this example we have an OrgVDC with a vCPU speed of 2000 Mhz.
Change the vrp_id to the ID you received during the first query.
SELECT name,moref,vcpu_count,cpu_limit,(vcpu_count*2000) as vCPU FROM vm_inv where moref in( SELECT vmmoref FROM computevm where deployment_status in('DEPLOYED', 'PENDING_DEPLOYMENT') and vrp_id = 0x871A32890EB24B62B310F6E1C1E77C7A )
The cpu_limit should match with the vCPU. In our case there were like 8 VMs that didn’t match:
What we can conclude is the following:
The vCPU speed for the OrgvDC had at one point a value of 3Ghz, and is now at a value of 2Ghz. During the time of 3Ghz, there were 8 VMs active, which have since remained powered on, consuming an extra 1 Ghz per VM.
SELECT vmc.name as vApp,vappvm.name as VM, vm_inv.moref, vm_inv.vcpu_count, vm_inv.cpu_limit,(vm_inv.vcpu_count*2000) as vCPU FROM vm_inv left join vm on vm.moref = vm_inv.moref left join vapp_vm vappvm on vappvm.svm_id = vm.id left join vm_container vmc on vmc.sg_id = vappvm.vapp_id left join computevm cvm on cvm.vmmoref = vm_inv.moref where cvm.vrp_id = 0x871A32890EB24B62B310F6E1C1E77C7A and vm_inv.cpu_limit = 3000 and cvm.deployment_status in('DEPLOYED', 'PENDING_DEPLOYMENT') order by vmc.name
We have now filtered all the machines that had an cpu_limit of 3000 that didn’t matched.
These VMs need to have a reboot to correctly apply the new speed of 2Ghz. Once completed, the current usage of 98Ghz, will reduce to 90Ghz. After that, we were able to reduce the CPU Quota to 92Ghz
I hope that this simple trick will help you to easily find additional information in the log files of vCD, that could potentially help you to find the root cause of your issue.