UNIX Health Check

Topics: AIX, Oracle, SDD, Storage, System Admin

RAC OCR and VOTE LUNs

Consisting naming is nog required for Oracle ASM devices, but LUNs used for the OCR and VOTE functions of Oracle RAC environments must have the same device names on all RAC systems. If the names for the OCR and VOTE devices are different, create a new device for each of these functions, on each of the RAC nodes, as follows:

First, check the PVIDs of each disk that is to be used as an OCR or VOTE device on all the RAC nodes. For example, if you're setting up a RAC cluster consisting of 2 nodes, called node1 and node2, check the disks as follows:

root@node1 # lspv | grep vpath | grep -i none
vpath6          00f69a11a2f620c5                    None
vpath7          00f69a11a2f622c8                    None
vpath8          00f69a11a2f624a7                    None
vpath13         00f69a11a2f62f1f                    None
vpath14         00f69a11a2f63212                    None

root@node2 /root # lspv | grep vpath | grep -i none
vpath4          00f69a11a2f620c5                    None
vpath5          00f69a11a2f622c8                    None
vpath6          00f69a11a2f624a7                    None
vpath9          00f69a11a2f62f1f                    None
vpath10         00f69a11a2f63212                    None

As you can see, vpath6 on node 1 is the same disk as vpath4 on node 2. You can determine this by looking at the PVID.

Check the major and minor numbers of each device:

root@node1 # cd /dev
root@node1 # lspv|grep vpath|grep None|awk '{print $1}'|xargs ls -als
0 brw-------    1 root     system       47,  6 Apr 28 18:56 vpath6
0 brw-------    1 root     system       47,  7 Apr 28 18:56 vpath7
0 brw-------    1 root     system       47,  8 Apr 28 18:56 vpath8
0 brw-------    1 root     system       47, 13 Apr 28 18:56 vpath13
0 brw-------    1 root     system       47, 14 Apr 28 18:56 vpath14

root#node2 # cd /dev
root@node2 # lspv|grep vpath|grep None|awk '{print $1}'|xargs ls -als
0 brw-------    1 root     system       47,  4 Apr 29 13:33 vpath4
0 brw-------    1 root     system       47,  5 Apr 29 13:33 vpath5
0 brw-------    1 root     system       47,  6 Apr 29 13:33 vpath6
0 brw-------    1 root     system       47,  9 Apr 29 13:33 vpath9
0 brw-------    1 root     system       47, 10 Apr 29 13:33 vpath10

Now, on each node set up a consisting naming convention for the OCR and VOTE devices. For example, if you wish to set up 2 ORC and 3 VOTE devices:

On server node1:

# mknod /dev/ocr_disk01 c 47 6
# mknod /dev/ocr_disk02 c 47 7
# mknod /dev/voting_disk01 c 47 8
# mknod /dev/voting_disk02 c 47 13
# mknod /dev/voting_disk03 c 47 14

On server node2:

mknod /dev/ocr_disk01 c 47 4
mknod /dev/ocr_disk02 c 47 5
mknod /dev/voting_disk01 c 47 6
mknod /dev/voting_disk02 c 47 9
mknod /dev/voting_disk03 c 47 10

This will result in a consisting naming convention for the OCR and VOTE devices on bothe nodes:

root@node1 # ls -als /dev/*_disk*
0 crw-r--r-- 1 root system  47,  6 May 13 07:18 /dev/ocr_disk01
0 crw-r--r-- 1 root system  47,  7 May 13 07:19 /dev/ocr_disk02
0 crw-r--r-- 1 root system  47,  8 May 13 07:19 /dev/voting_disk01
0 crw-r--r-- 1 root system  47, 13 May 13 07:19 /dev/voting_disk02
0 crw-r--r-- 1 root system  47, 14 May 13 07:20 /dev/voting_disk03

root@node2 # ls -als /dev/*_disk*
0 crw-r--r-- 1 root system  47,  4 May 13 07:20 /dev/ocr_disk01
0 crw-r--r-- 1 root system  47,  5 May 13 07:20 /dev/ocr_disk02
0 crw-r--r-- 1 root system  47,  6 May 13 07:21 /dev/voting_disk01
0 crw-r--r-- 1 root system  47,  9 May 13 07:21 /dev/voting_disk02
0 crw-r--r-- 1 root system  47, 10 May 13 07:21 /dev/voting_disk03

Topics: AIX, SAN, SDD, System Admin ↑

Method error when running cfgmgr

If you see the following error when running cfgmgr:

Method error (/usr/lib/methods/fcmap >> /var/adm/essmap.out):
        0514-023 The specified device does not exist in the
                 customized device configuration database.

This is caused when you have ESS driver filesets installed, but no ESS (type 2105) disks in use on the system. Check the type of disks by running:

# lsdev -Cc disk | grep 2105

If no type 2105 disks are found, you can uninstall any ESS driver filesets:

# installp -u ibm2105.rte ibmpfe.essutil.fibre.data ibmpfe.essutil.rte

Topics: AIX, SAN, SDD, Storage ↑

PVID trouble

To add a PVID to a disk, enter:

# chdev -l vpathxx -a pv=yes

To clear all reservations from a previously used SAN disk:

# chpv -C vpathxx

Topics: Installation, SAN, SDD ↑

SDD upgrade from 1.6.X to 1.7.X

Whenever you need to perform an upgrade of SDD (and it is wise to keep it up-to-date), make sure you check the SDD documentation before doing this. Here's the quick steps to perform to do the updates.

Check for any entries in the errorlog that could interfere with the upgrades:
# errpt -a | more
Check if previously installed packages are OK:
# lppchk -v
Commit any previously installed packages:
# installp -c all
Make sure to have a recent mksysb image of the server and before starting the updates to the rootvg, do an incremental TSM backup. Also a good idea is to prepare the alt_disk_install on the second boot disk.
For HACMP nodes: check the cluster status and log files to make sure the cluster is stable and ready for the upgrades.
Update fileset devices.fcp.disk.ibm to the latest level using smitty update_all.
For ESS environments: Update host attachment script ibm2105 and ibmpfe.essutil to the latest available levels using smitty update_all.
Enter the lspv command to find out all the SDD volume groups.
Enter the lsvgfs command for each SDD volume group to find out which file systems are mounted, e.g.:
# lsvgfs vg_name
Enter the umount command to unmount all file systems belonging to the SDD volume groups.
Enter the varyoffvg command to vary off the volume groups.
If you are upgrading to an SDD version earlier than 1.6.0.0; or if you are upgrading to SDD 1.6.0.0 or later and your host is in a HACMP environment with nonconcurrent volume groups that are varied-on on other host, that is, reserved by other host, run the vp2hd volume_group_name script to convert the volume group from the SDD vpath devices to supported storage hdisk devices. Otherwise, you skip this step.
Stop the SDD server:
# stopsrc -s sddsrv
Remove all the SDD vpath devices:
# rmdev -dl dpo -R
Use the smitty command to uninstall the SDD. Enter smitty deinstall and press Enter. The uninstallation process begins. Complete the uninstallation process.
If you need to upgrade the AIX operating system, you could perform the upgrade now. If required, reboot the system after the operating system upgrade.
Use the smitty command to install the newer version of the SDD. Note: it is also possible to do smitty update_all to simply update the SDD fileset, without first uninstalling it; but IBM recommends doing an uninstall first, then patch the OS, and then do an install of the SDD fileset.
Use the smitty device command to configure all the SDD vpath devices to the Available state.
Enter the lsvpcfg command to verify the SDD configuration.
If you are upgrading to an SDD version earlier than 1.6.0.0, run the hd2vp volume_group_name script for each SDD volume group to convert the physical volumes from supported storage hdisk devices back to the SDD vpath devices.
Enter the varyonvg command for each volume group that was previously varied offline.
Enter the lspv command to verify that all physical volumes of the SDD volume groups are SDD vpath devices.
Check for any errors:
# errpt | more
# lppchk -v
# errclear 0
Enter the mount command to mount all file systems that were unmounted.

Attention: If the physical volumes on an SDD volume group's physical volumes are mixed with hdisk devices and SDD vpath devices, you must run the dpovgfix utility to fix this problem. Otherwise, SDD will not function properly:

# dpovgfix vg_name

Topics: Hardware, SAN, SDD, Storage ↑

How-to replace a failing HBA using SDD storage

This is a procedure how to replace a failing HBA or fibre channel adapter, when used in combination with SDD storage:

Determine which adapter is failing (0, 1, 2, etcetera):
# datapath query adapter
Check if there are dead paths for any vpaths:
# datapath query device
Try to set a "degraded" adapter back to online using:
# datapath set adapter 1 offline
# datapath set adapter 1 online
(that is, if adapter "1" is failing, replace it with the correct adapter number).
If the adapter is still in a "degraded" status, open a call with IBM. They most likely require you to take a snap from the system, and send the snap file to IBM for them to analyze and they will conclude if the adapter needs to be replaced or not.
Involve the SAN storage team if the adapter needs to be replaced. They will have to update the WWN of the failing adapter when the adapter is replaced for a new one with a new WWN.
If the adapter needs to be replaced, wait for the IBM CE to be onsite with the new HBA adapter. Note the new WWN and supply that to the SAN storage team.
Remove the adapter:
# datapath remove adapter 1
(replace the "1" with the correct adapter that is failing).
Check if the vpaths now all have one less path:
# datapath query device | more
De-configure the adapter (this will also de-configure all the child devices, so you won't have to do this manually), by running: diag, choose Task Selection, Hot Plug Task, PCI Hot Plug manager, Unconfigure a Device. Select the correct adapter, e.g. fcs1, set "Unconfigure any Child Devices" to "yes", and "KEEP definition in database" to "no". Hit ENTER.
Replace the adapter: Run diag and choose Task Selection, Hot Plug Task, PCI Hot Plug manager, Replace/Remove a PCI Hot Plug Adapter. Choose the correct device (be careful, you won't see the adapter name here, but only "Unknown", because the device was unconfigured).
Have the IBM CE replace the adapter.
Close any events on the failing adapter on the HMC.
Validate that the notification LED is now off on the system, if not, go back into diag, choose Task Selection, Hot Plug Task, PCI Hot Plug Manager, and Disable the attention LED.
Check the adapter firmware level using:
# lscfg -vl fcs1
(replace this with the actual adapter name).

And if required, update the adapter firmware microcode. Validate if the adapter is still functioning correctly by running:
# errpt
# lsdev -Cc adapter
Have the SAN admin update the WWN.
Run:
# cfgmgr -S
Check the adapter and the child devices:
# lsdev -Cc adapter
# lsdev -p fcs1
# lsdev -p fscsi1
(replace this with the correct adapter name).
Add the paths to the device:
# addpaths
Check if the vpaths have all paths again:
# datapath query device | more

Topics: SAN, SDD, Storage ↑

Vpath commands

Check the relation between vpaths and hdisks:

# lsvpcfg

Check the status of the adapters according to SDD:

# datapath query adapter

Check on stale partitions:

# lsvg -o | lsvg -i | grep -i stale

Topics: PowerHA / HACMP, SAN, SDD, Storage ↑

Reservation bit

If you wish to get rid of the SCSI disk reservation bit on SCSI, SSA and VPATH devices, there are two ways of achieving this:

Firstly, HACMP comes along with some binaries that do this job:

# /usr/es/sbin/cluster/utilities/cl_SCSIdiskreset /dev/vpathx

Secondly, there is a little (not official) IBM binary tool called "lquerypr". This command is part of the SDD driver fileset. It can also release the persistant reservation bit and clear all reservations:

First check if you have any reservations on the vpath:

# lquerypr -vh /dev/vpathx

Clear it as follows:

# lquerypr -ch /dev/vpathx

In case this doesn't work, try the following sequence of commands:

# lquerypr -ch /dev/vpathx
# lquerypr -rh /dev/vpathx
# lquerypr -ph /dev/vpathx

If you'd like to see more information about lquerypr, simply run lquerypr without any options, and it will display extensive usage information.

For SDD, you should be able to use the following command to clear the persistant reservation:

# lquerypr -V -v -c /dev/vpathXX

For SDDPCM, use:

# pcmquerypr -V -v -c /dev/hdiskXX

Number of results found for topic SDD: 7.
Displaying results: 1 - 7.

Order

No time to lose? Need to know what's wrong with
your UNIX system now? Then get started TODAY!