VCAP6-DCV Deployment - Objective 2.3 - Troubleshoot Complex Storage Solutions

Objectives 2.3 are broke down as the following

Analyze and resolve storage multi-pathing and failover issues
Troubleshoot storage device connectivity
Analyze and resolve Virtual SAN configuration issues
Troubleshoot iSCSI connectivity issues
Analyze and resolve NFS issues
Troubleshoot RDM issues

This objective will come down to experience more than anything but I will try and cover what I can.

Analyze and resolve storage multi-pathing and failover issues

Multipathing and failover issues can occur for many different reasons and as with a lot of this objective it will come down to on the field experience. I will only concentrate on the vSphere front rather than the looking at troubleshooting a switch or SAN controller failure. In my lab I have a Nimble appliance with one controller and my hosts are configured with a software iSCSI adapter with two VMKernel interfaces

The failover sequence is as follows

The connection along a given path is detected as offline
ESXi host stops its iSCSI session
As a result the iSCSI task is aborted
The Native Muli-pathing Plugin (NMP) detects host status of 0x1 - translates to NO_CONNECT
Once recieved NMP will send a TEST_UNIT_READY (TUR) command down the path to confirm that it is down
If this fails the Path Selection Policy (PSP) actives the next path for the device LUN
The NMP retires the queue commands down this path to ensure they complete successfully following the failover
ESXi host sends a LUN reset if there is a pending SCSI reservation against the device or LUN - this ensures that the SCSI-2 based reservation from the previous initiator is broken
ESXi host can now retry the next command in the queue
Logs report the path failover was successful
Software iSCSI initiator will also report the session as ONLINE

To see more detail see VMware KB. To see this process in the logs you must view the VMKernel.log found on the host /var/log/vmkernal.log

Performance can be monitored by running esxtop. VMware KB goes into detail which can be found here.

To monitor performance per HBA first I run esxtop when connected to a host. To change the display to disk view HBA mode I press D

To monitor storage performance on a per LUN basic I press U

Other than the read and write columns the other column details are explain here

Notice on the example of performance monitor per LUN my IOPS at that point in time for vmhba41 adapter was reported at 582.30. That adapter is my software iSCSI adapter.

Troubleshoot storage device connectivity

Following the vSphere troubleshooting guide I have picked out some of the items listed, find vSphere Troubleshooting.

Storage device may not appear correctly in the Web Client or not all device are available for all of the hosts.

Check cable connectivity
Check zoning for FC
Check access control config - for iSCSI additional checks for CHAP, IP-based filtering and initiator name-based access control is setup correctly
Make sure cross patching is correct between storage controllers
For any changes rescan the HBA / host

The maximum queue depth can be changed on a ESXi host. If a ESXi host generates more commands to a LUN than the LUN queue depth can handle, the excess commands are queued in the VMKernel which increases latency.

To change the queue depth on a FC HBA run the following and reboot the host

>escli system module parameters set -p parameter=value -m module

For parameters value and module value see

For iSCSI run the following

>esxcli system module parameters set -m iscsi_vmk -p iscsivmk_LunQDepth=value

Analyze and resolve Virtual SAN configuration issues

Virtual SAN has its own troubleshooting guide that details more than I can cover. This can be found here.

Things to look out for at a high level will be

Physical hardware compatibility
Supported VSAN configuration / disk group configuration
Absent vs Degraded
Expected failure behaviour for HDD / SDD failure, host failure or network connectivity failure
Adding new HDD or SDD
Identical VLAN / MTU / subnets across the hosts
Is multicast used on the network - mulitple VSAN clusters on the same network must have unique mulitcast address
Check flow control is enabled

Once VSAN has been configured some of the stand out commands are as follows. To view what VMKernel interface is being used run the following - this displays the interfae name, multicast group address, master mulitcast address and master multicast port

>esxcli vsan network list

To see what vSwitch it is attached to and settings such as MTU run the following

>esxcli network ip interface list

To check the cluster status

>esxcli vsan cluster get

Troubleshoot iSCSI connectivity issues

iSCSI connectivity is IP so we can easily troubleshoot, to test the target device is pingable or to test if anything else on the iSCSI subnet is pingable I can specify my iSCSI VMKernal interface using the follow

>vmkping -I vmk3 10.10.13.211

Notice the first time I run this I get no response, I then change the VMKernal interface and it replies ok

There could be an issue with MTU configuration, the MTU settings must match end to end for it to work correctly - that is from hosts to switch to storage device. To test MTU settings are correct run the following - I am testing MTU setting of 9000 (minus 28 bytes for overhead) to another ESXi host

>vmkping -I vmk3 10.0.99.40 -d -s 8972

Other things to check for iSCSI could include

CHAP authentication is configured correctly
Access control to the storage device has been configured correctly either by initiator IP or name

Analyze and resolve NFS issues

Similar to iSCSI troubleshooting NFS starts with troubleshooting IP connectivity. Run a vmkping to specify the interface and to test MTU settings.

Additionally run netcat (nc) command to see if the port is reachable on the NFS server - default port 2049.

>nc -z 10.10.11.200 2049

Permissions on the NFS server but also be set correctly, make sure the permissions for the ESXi host have not been set to Read-only and make sure the volume has not been mounted as Read-only on the host.

Troubleshoot RDM issues

Storage vendors might require that VMs with RDMs ignore SCSI INQUIRY data cached by ESXi. When a host first connects to the target storage device it issues the SCSI INQUIRY command to obtain basic identification data from the device which ESXi will cache, the data remains unchanged after this.

To configure the VM with a RDM to ignore SCSI INQUIRY cache add the following to the .vmx file

>scsix:y.ignoreDeviceInquiryCache = “true”

Where x = the SCSI controller number and y = the SCSI target number of the RDM

vJenner Blog

VCAP6-DCV Deployment - Objective 2.3 - Troubleshoot Complex Storage Solutions

Leave a comment Cancel reply

Share this:

Leave a comment Cancel reply