VCAP6-DCV Deployment – Objective 2.3 – Troubleshoot Complex Storage Solutions


Main Study Page

Objectives 2.3 are broke down as the following

  • Analyze and resolve storage multi-pathing and failover issues
  • Troubleshoot storage device connectivity
  • Analyze and resolve Virtual SAN configuration issues
  • Troubleshoot iSCSI connectivity issues
  • Analyze and resolve NFS issues
  • Troubleshoot RDM issues

This objective will come down to experience more than anything but I will try and cover what I can.


Analyze and resolve storage multi-pathing and failover issues

Multipathing and failover issues can occur for many different reasons and as with a lot of this objective it will come down to on the field experience.  I will only concentrate on the vSphere front rather than the looking at troubleshooting a switch or SAN controller failure.  In my lab I have a Nimble appliance with one controller and my hosts are configured with a software iSCSI adapter with two VMKernel interfaces

vcap2.3-01

The failover sequence is as follows

  1. The connection along a given path is detected as offline
  2. ESXi host stops its iSCSI session
  3. As a result the iSCSI task is aborted
  4. The Native Muli-pathing Plugin (NMP) detects host status of 0x1 – translates to NO_CONNECT
  5. Once recieved NMP will send a TEST_UNIT_READY (TUR) command down the path to confirm that it is down
  6. If this fails the Path Selection Policy (PSP) actives the next path for the device LUN
  7. The NMP retires the queue commands down this path to ensure they complete successfully following the failover
  8. ESXi host sends a LUN reset if there is a pending SCSI reservation against the device or LUN – this ensures that the SCSI-2 based reservation from the previous initiator is broken
  9. ESXi host can now retry the next command in the queue
  10. Logs report the path failover was successful
  11. Software iSCSI initiator will also report the session as ONLINE

To see more detail see VMware KB.  To see this process in the logs you must view the VMKernel.log found on the host /var/log/vmkernal.log

vcap2.3-02

vcap2.3-03

Performance can be monitored by running esxtop.  VMware KB goes into detail which can be found here.  

To monitor performance per HBA first I run esxtop when connected to a host.  To change the display to disk view HBA mode I press D

vcap2.3-04

To monitor storage performance on a per LUN basic I press U

vcap2.3-05

Other than the read and write columns the other column details are explain here

vcap2.3-11

Notice on the example of performance monitor per LUN my IOPS at that point in time for vmhba41 adapter was reported at 582.30.  That adapter is my software iSCSI adapter.


Troubleshoot storage device connectivity

Following the vSphere troubleshooting guide I have picked out some of the items listed, find vSphere Troubleshooting.

Storage device may not appear correctly in the Web Client or not all device are available for all of the hosts.

  • Check cable connectivity
  • Check zoning for FC
  • Check access control config – for iSCSI additional checks for CHAP, IP-based filtering and initiator name-based access control is setup correctly
  • Make sure cross patching is correct between storage controllers
  • For any changes rescan the HBA / host

The maximum queue depth can be changed on a ESXi host.  If a ESXi host generates more commands to a LUN than the LUN queue depth can handle, the excess commands are queued in the VMKernel which increases latency.

To change the queue depth on a FC HBA run the following and reboot the host

>escli system module parameters set -p parameter=value -m module

For parameters value and module value see

vcap2.3-10

For iSCSI run the following

>esxcli system module parameters set -m iscsi_vmk -p iscsivmk_LunQDepth=value


Analyze and resolve Virtual SAN configuration issues

Virtual SAN has its own troubleshooting guide that details more than I can cover.  This can be found here.

Things to look out for at a high level will be

  • Physical hardware compatibility
  • Supported VSAN configuration / disk group configuration
  • Absent vs Degraded
  • Expected failure behaviour for HDD / SDD failure, host failure or network connectivity failure
  • Adding new HDD or SDD
  • Identical  VLAN / MTU / subnets across the hosts
  • Is multicast used on the network – mulitple VSAN clusters on the same network must have unique mulitcast address
  • Check flow control is enabled

Once VSAN has been configured some of the stand out commands are as follows.  To view what VMKernel interface is being used run the following – this displays the interfae name, multicast group address, master mulitcast address and master multicast port

>esxcli vsan network list

To see what vSwitch it is attached to and settings such as MTU run the following

>esxcli network ip interface list

To check the cluster status

>esxcli vsan cluster get


Troubleshoot iSCSI connectivity issues

iSCSI connectivity is IP so we can easily troubleshoot, to test the target device is pingable or to test if anything else on the iSCSI subnet is pingable I can specify my iSCSI VMKernal interface using the follow

>vmkping -I vmk3 10.10.13.211

vcap2.3-06

Notice the first time I run this I get no response, I then change the VMKernal interface and it replies ok

There could be an issue with MTU configuration, the MTU settings must match end to end for it to work correctly – that is from hosts to switch to storage device.  To test MTU settings are correct run the following – I am testing MTU setting of 9000 (minus 28 bytes for overhead) to another ESXi host

>vmkping -I vmk3 10.0.99.40 -d -s 8972

vcap2.3-08

Other things to check for iSCSI could include

  • CHAP authentication is configured correctly
  • Access control to the storage device has been configured correctly either by initiator IP or name

Analyze and resolve NFS issues

Similar to iSCSI troubleshooting NFS starts with troubleshooting IP connectivity.  Run a vmkping to specify the interface and to test MTU settings.

Additionally run netcat (nc) command to see if the port is reachable on the NFS server – default port 2049.

>nc -z 10.10.11.200 2049

vcap2.3-09

Permissions on the NFS server but also be set correctly, make sure the permissions for the ESXi host have not been set to Read-only and make sure the volume has not been mounted as Read-only on the host.


Troubleshoot RDM issues

Storage vendors might require that VMs with RDMs ignore SCSI INQUIRY data cached by ESXi.  When a host first connects to the target storage device it issues the SCSI INQUIRY command to obtain basic identification data from the device which ESXi will cache, the data remains unchanged after this.

To configure the VM with a RDM to ignore SCSI INQUIRY cache add the following to the .vmx file

>scsix:y.ignoreDeviceInquiryCache = “true”

Where x = the SCSI controller number and y = the SCSI target number of the RDM

Leave a comment

Your email address will not be published. Required fields are marked *