Symptoms:


Zerto Live Failover or Failover test is taking a long time to complete. VPG cannot commit or VM's cannot be powered on after the operation. In Zerto 4.5 update 1, there was a hot fix applied to increase the timeout value for VCD to 70 seconds. Beyond Zerto 4.5 update 1, there is a tweak available to increase this timeout value: t_ZvmVCDSleepAfterCreateTempVolumeInSeconds

Note the tweak will simply mask the underlying root cause which is outlined below:

Identifying this in the logs for both ZVM & VCD: A few pointers to look out for to identify this in an environment within the logs, these are just advisory items to look out for in the logs.

ZVM: For every action performed, you should see delays between a start and an end of a task that should be performed quickly. For example, a long delay in the following can be an identifiable trigger in the logs:

aab03342,baf5e799,17-01-13 09:20:57.93,I,1350,VCDProxy, PowerOnVM,Powering on VM=vm-60831 - we send the PowerOn command to vCD. 
aab03342,baf5e799,17-01-13 09:22:22.48,I,1350,VCDProxy,PowerOnVM,Done - we get a successful response back 2 minutes after. 

VCD: Looking for long delays between placement of VM & placement complete of VM's is also a trigger within the VCD logs, example below:

Placement of VM begins:
2017-02-02 12:16:21,760 | INFO     | task-service-activity-pool-540 | PlacementSolverImpl    

Placement of VM is complete:
2017-02-02 12:28:42,133 | DEBUG    | storage-fabric-activity-pool-5685 | SdrsPlacementManagerImpl



Cause:


VCD is slowing down the process as it is assessing the potential of where to place the new virtual machines on the host(s) datastores. During testing this can consume up to 50% of the associated failover time as VCD assesses this. This is in relation to VCD 8.1.X.

The root cause of this is a fragmented or slow database, and cache misses with the listing of datastores. 



Solution:


VCD maintenance window to perform DBA tasks on the VCD DB in particular a rebuild of the table indexes along with the clearance of the VCD temp tables.


Steps to Resolution:


1) First, stop the vCloud Director services on all cells in the environment. This is done by quiescing the cells to stop new tasks from starting and then stopping the services.
   
    Use the Cell Management Tool to Quiesce and Shut Down a Server
    http://pubs.vmware.com/vcd-810/topic/com.vmware.vcloud.install.doc_810/GUID-65C8B7B6-EC5E-4BDA-8564-56DD6671F5FE.html

2) Once the services have been stopped on all vCloud Director cells we can then make the changes to the vCD database. Before doing so however, you must ensure that a backup is taken immediately before making the changes, in the event that it is required to roll back the changes.

3) Once the backup is taken and all vCD cells stopped, then run the statements to clear the temporary database information:

    delete from QRTZ_SCHEDULER_STATE;
    delete from QRTZ_FIRED_TRIGGERS;
    delete from QRTZ_PAUSED_TRIGGER_GRPS;
    delete from QRTZ_CALENDARS;
    delete from QRTZ_TRIGGER_LISTENERS;
    delete from QRTZ_BLOB_TRIGGERS;
    delete from QRTZ_CRON_TRIGGERS;
    delete from QRTZ_SIMPLE_TRIGGERS;
    delete from QRTZ_TRIGGERS;
    delete from QRTZ_JOB_LISTENERS;
    delete from QRTZ_JOB_DETAILS;

    delete from compute_resource_inv;
    delete from custom_field_manager_inv;
    delete from cluster_compute_resource_inv;
    delete from datacenter_inv;
    delete from datacenter_network_inv;
    delete from datastore_inv;
    delete from datastore_profile_inv;
    delete from dv_portgroup_inv;
    delete from dv_switch_inv;
    delete from folder_inv;
    delete from managed_server_inv;
    delete from managed_server_datastore_inv;
    delete from managed_server_network_inv ;
    delete from network_inv;
    delete from resource_pool_inv;
    delete from storage_pod_inv;
    delete from storage_profile_inv;
    delete from task_inv;
    delete from vm_inv;
    delete from property_map;
   
4) Further, it is required to rebuild the indexes in the vCloud Director database. Normally for this, we would recommend using the following in MS SQL, which is the same step taken by some vCloud Director database upgrades:

    USE [database-name]
    GO
    EXEC sp_MSforeachtable @command1="print '?' DBCC DBREINDEX ('?', ' ', 80)"
    GO


5) After making the changes above, we would like to start the vCloud Director cells. Please start the services on this first cell initially using the following steps:

    Start vCloud Director Services
    http://pubs.vmware.com/vcd-810/topic/com.vmware.vcloud.install.doc_810/GUID-24B996D4-4786-49F6-9FD0-66FE928BBB7F.html

    Please note that you can monitor the startup progress by tailing the cell logs on the vCD cell by running 'tail -f /opt/vmware/vcloud-director/logs/cell.log'


6) Once the vCloud Director UI is available once again, please monitor the reconnects to the vCenter Servers under System, Manage & Monitor, vSphere Resources, vCenter Servers. As the vCenter Server inventories have been cleared above, it may take upwards of 30 minutes for all Syncing Inventory statuses to complete, depending on the vCenter sizes.


7) Once the vCenter reconnects have completed, please also start any further cells in the environment one at a time leaving 10-15 minutes between the confirmation with 'tail -f /opt/vmware/vcloud-director/logs/cell.log' of the 100% startup and starting the next cell.
 


Affected Versions:


Zerto 4.5 update 1 and Beyond


Hypervisor:


VMware