Objectives 2.3 are broke down as the following
- Analyze and resolve storage multi-pathing and failover issues
- Troubleshoot storage device connectivity
- Analyze and resolve Virtual SAN configuration issues
- Troubleshoot iSCSI connectivity issues
- Analyze and resolve NFS issues
- Troubleshoot RDM issues
This objective will come down to experience more than anything but I will try and cover what I can.
Analyze and resolve storage multi-pathing and failover issues
Multipathing and failover issues can occur for many different reasons and as with a lot of this objective it will come down to on the field experience. I will only concentrate on the vSphere front rather than the looking at troubleshooting a switch or SAN controller failure. In my lab I have a Nimble appliance with one controller and my hosts are configured with a software iSCSI adapter with two VMKernel interfaces
The failover sequence is as follows
- The connection along a given path is detected as offline
- ESXi host stops its iSCSI session
- As a result the iSCSI task is aborted
- The Native Muli-pathing Plugin (NMP) detects host status of 0x1 - translates to NO_CONNECT
- Once recieved NMP will send a TEST_UNIT_READY (TUR) command down the path to confirm that it is down
- If this fails the Path Selection Policy (PSP) actives the next path for the device LUN
- The NMP retires the queue commands down this path to ensure they complete successfully following the failover
- ESXi host sends a LUN reset if there is a pending SCSI reservation against the device or LUN - this ensures that the SCSI-2 based reservation from the previous initiator is broken
- ESXi host can now retry the next command in the queue
- Logs report the path failover was successful
- Software iSCSI initiator will also report the session as ONLINE
To see more detail see VMware KB. To see this process in the logs you must view the VMKernel.log found on the host /var/log/vmkernal.log
Performance can be monitored by running esxtop. VMware KB goes into detail which can be found here.
To monitor performance per HBA first I run esxtop when connected to a host. To change the display to disk view HBA mode I press D
To monitor storage performance on a per LUN basic I press U
Other than the read and write columns the other column details are explain here
Notice on the example of performance monitor per LUN my IOPS at that point in time for vmhba41 adapter was reported at 582.30. That adapter is my software iSCSI adapter.
Troubleshoot storage device connectivity
Following the vSphere troubleshooting guide I have picked out some of the items listed, find vSphere Troubleshooting.
Storage device may not appear correctly in the Web Client or not all device are available for all of the hosts.
- Check cable connectivity
- Check zoning for FC
- Check access control config - for iSCSI additional checks for CHAP, IP-based filtering and initiator name-based access control is setup correctly
- Make sure cross patching is correct between storage controllers
- For any changes rescan the HBA / host
The maximum queue depth can be changed on a ESXi host. If a ESXi host generates more commands to a LUN than the LUN queue depth can handle, the excess commands are queued in the VMKernel which increases latency.
To change the queue depth on a FC HBA run the following and reboot the host
>escli system module parameters set -p parameter=value -m module
For parameters value and module value see
For iSCSI run the following
>esxcli system module parameters set -m iscsi_vmk -p iscsivmk_LunQDepth=value
Analyze and resolve Virtual SAN configuration issues
Virtual SAN has its own troubleshooting guide that details more than I can cover. This can be found here.
Things to look out for at a high level will be
- Physical hardware compatibility
- Supported VSAN configuration / disk group configuration
- Absent vs Degraded
- Expected failure behaviour for HDD / SDD failure, host failure or network connectivity failure
- Adding new HDD or SDD
- Identical VLAN / MTU / subnets across the hosts
- Is multicast used on the network - mulitple VSAN clusters on the same network must have unique mulitcast address
- Check flow control is enabled
Once VSAN has been configured some of the stand out commands are as follows. To view what VMKernel interface is being used run the following - this displays the interfae name, multicast group address, master mulitcast address and master multicast port
>esxcli vsan network list
To see what vSwitch it is attached to and settings such as MTU run the following
>esxcli network ip interface list
To check the cluster status
>esxcli vsan cluster get
Troubleshoot iSCSI connectivity issues
iSCSI connectivity is IP so we can easily troubleshoot, to test the target device is pingable or to test if anything else on the iSCSI subnet is pingable I can specify my iSCSI VMKernal interface using the follow
>vmkping -I vmk3 10.10.13.211
Notice the first time I run this I get no response, I then change the VMKernal interface and it replies ok
There could be an issue with MTU configuration, the MTU settings must match end to end for it to work correctly - that is from hosts to switch to storage device. To test MTU settings are correct run the following - I am testing MTU setting of 9000 (minus 28 bytes for overhead) to another ESXi host
>vmkping -I vmk3 10.0.99.40 -d -s 8972
Other things to check for iSCSI could include
- CHAP authentication is configured correctly
- Access control to the storage device has been configured correctly either by initiator IP or name
Analyze and resolve NFS issues
Similar to iSCSI troubleshooting NFS starts with troubleshooting IP connectivity. Run a vmkping to specify the interface and to test MTU settings.
Additionally run netcat (nc) command to see if the port is reachable on the NFS server - default port 2049.
>nc -z 10.10.11.200 2049
Permissions on the NFS server but also be set correctly, make sure the permissions for the ESXi host have not been set to Read-only and make sure the volume has not been mounted as Read-only on the host.
Troubleshoot RDM issues
Storage vendors might require that VMs with RDMs ignore SCSI INQUIRY data cached by ESXi. When a host first connects to the target storage device it issues the SCSI INQUIRY command to obtain basic identification data from the device which ESXi will cache, the data remains unchanged after this.
To configure the VM with a RDM to ignore SCSI INQUIRY cache add the following to the .vmx file
>scsix:y.ignoreDeviceInquiryCache = “true”
Where x = the SCSI controller number and y = the SCSI target number of the RDM