UNIX Health Check - PowerHA / HACMP

AIX 5.3 end-of-service

The EOM date (end of marketing) has been announced for AIX 5.3: 04/11; meaning that AIX 5.3 will no longer be marketed by IBM from April 2011, and that it is now time for customers to start thinking about upgrading to AIX 6.1. The EOS (end of service) date for AIX 5.3 is 04/12, meaning AIX 5.3 will be serviced by IBM until April 2012. After that, IBM will only service AIX 5.3 for an additional fee. The EOL (end of life) date is 04/16, which is the end of life date at April 2016. The final technology level for AIX 5.3 is technology level 12. Some service packs for TL12 will be released though.

IBM has also announced EOM and EOS dates for HACMP 5.4 and PowerHA 5.5, so if you're using any of these versions, you also need to upgrade to PowerHA 6.1:

Sep 30, 2010: EOM HACMP 5.4, PowerHA 5.5
Sep 30, 2011: EOS HACMP 5.4
Sep 30, 2012: EOS HACMP 5.5

Topics: AIX, EMC, Installation, PowerHA / HACMP, SAN, System Admin ↑

Quick setup guide for HACMP

Use this procedure to quickly configure an HACMP cluster, consisting of 2 nodes and disk heartbeating.

Prerequisites:

Make sure you have the following in place:

Have the IP addresses and host names of both nodes, and for a service IP label. Add these into the /etc/hosts files on both nodes of the new HACMP cluster.
Make sure you have the HACMP software installed on both nodes. Just install all the filesets of the HACMP CD-ROM, and you should be good.
Make sure you have this entry in /etc/inittab (as one of the last entries):
clinit:a:wait:/bin/touch /usr/es/sbin/cluster/.telinit
In case you're using EMC SAN storage, make sure you configure you're disks correctly as hdiskpower devices. Or, if you're using a mksysb image, you may want to follow this procedure EMC ODM cleanup.

Steps:

Create the cluster and its nodes:
```
# smitty hacmp
Initialization and Standard Configuration
Configure an HACMP Cluster and Nodes
```
Enter a cluster name and select the nodes you're going to use. It is vital here to have the hostnames and IP address correctly entered in the /etc/hosts file of both nodes.

Create an IP service label:

# smitty hacmp
Initialization and Standard Configuration
Configure Resources to Make Highly Available
Configure Service IP Labels/Addresses
Add a Service IP Label/Address

Enter an IP Label/Address (press F4 to select one), and enter a Network name (again, press F4 to select one).

Set up a resource group:
```
# smitty hacmp
Initialization and Standard Configuration
Configure HACMP Resource Groups
Add a Resource Group
```
Enter the name of the resource group. It's a good habit to make sure that a resource group name ends with "rg", so you can recognize it as a resource group. Also, select the participating nodes. For the "Fallback Policy", it is a good idea to change it to "Never Fallback". This way, when the primary node in the cluster comes up, and the resource group is up-and-running on the secondary node, you won't see a failover occur from the secondary to the primary node.

Note: The order of the nodes is determined by the order you select the nodes here. If you put in "node01 node02" here, then "node01" is the primary node. If you want to have this any other way, now is a good time to correctly enter the order of node priority.

Add the Servie IP/Label to the resource group:

# smitty hacmp
Initialization and Standard Configuration
Configure HACMP Resource Groups
Change/Show Resources for a Resource Group (standard)

Select the resource group you've created earlier, and add the Service IP/Label.

Run a verification/synchronization:
```
# smitty hacmp
Extended Configuration
Extended Verification and Synchronization
```
Just hit [ENTER] here. Resolve any issues that may come up from this synchronization attempt. Repeat this process until the verification/synchronization process returns "Ok". It's a good idea here to select to "Automatically correct errors".
Start the HACMP cluster:
```
# smitty hacmp
System Management (C-SPOC)
Manage HACMP Services
Start Cluster Services
```
Select both nodes to start. Make sure to also start the Cluster Information Daemon.
Check the status of the cluster:
```
# clstat -o
# cldump
```
Wait until the cluster is stable and both nodes are up.

Basically, the cluster is now up-and-running. However, during the Verification & Synchronization step, it will complain about not having a non-IP network. The next part is for setting up a disk heartbeat network, that will allow the nodes of the HACMP cluster to exchange disk heartbeat packets over a SAN disk. We're assuming here, you're using EMC storage. The process on other types of SAN storage is more or less similar, except for some differences, e.g. SAN disks on EMC storage are called "hdiskpower" devices, and they're called "vpath" devices on IBM SAN storage.

First, look at the available SAN disk devices on your nodes, and select a small disk, that won't be used to store any data on, but only for the purpose of doing the disk heartbeat. It is a good habit, to request your SAN storage admin to zone a small LUN as a disk heartbeating device to both nodes of the HACMP cluster. Make a note of the PVID of this disk device, for example, if you choose to use device hdiskpower4:

# lspv | grep hdiskpower4
hdiskpower4   000a807f6b9cc8e5    None

So, we're going to set up the disk heartbeat network on device hdiskpower4, with PVID 000a807f6b9cc8e5:

Create an concurrent volume group:
```
# smitty hacmp
System Management (C-SPOC)
HACMP Concurrent Logical Volume Management
Concurrent Volume Groups
Create a Concurrent Volume Group
```
Select both nodes to create the concurrent volume group on by pressing F7 for each node. Then select the correct PVID. Give the new volume group a name, for example "hbvg".

Set up the disk heartbeat network:

# smitty hacmp
Extended Configuration
Extended Topology Configuration
Configure HACMP Networks
Add a Network to the HACMP Cluster

Select "diskhb" and accept the default Network Name.

Run a discovery:

# smitty hacmp
Extended Configuration
Discover HACMP-related Information from Configured Nodes

Add the disk device:

# smitty hacmp
Extended Configuration
Extended Topology Configuration
Configure HACMP Communication Interfaces/Devices
Add Communication Interfaces/Devices
Add Discovered Communication Interface and Devices
Communication Devices

Select the disk device on both nodes by selecting the same disk on each node by pressing F7.

Run a Verification & Synchronization again, as described earlier above. Then check with clstat and/or cldump again, to check if the disk heartbeat network comes online.

Topics: AIX, PowerHA / HACMP, System Admin ↑

NFS mounts on HACMP failing

When you want to mount an NFS file system on a node of an HACMP cluster, there are a couple of items you need check, before it will work:

Make sure the hostname and IP address of the HACMP node are resolvable and provide the correct output, by running:
```
# nslookup [hostname]
# nslookup [ip-address]
```
The next thing you will want to check on the NFS server, if the node names of your HACMP cluster nodes are correctly added to the /etc/exports file. If they are, run:
# exportfs -va

The last, and tricky item you will want to check is, if a service IP label is defined as an IP alias on the same adapter as your nodes hostname, e.g.:

# netstat -nr
Routing tables
Destination   Gateway       Flags  Refs  Use    If  Exp  Groups

Route Tree for Protocol Family 2 (Internet):
default       10.251.14.1   UG      4    180100 en1  -     -
10.251.14.0   10.251.14.50  UHSb    0         0 en1  -     -
10.251.14.50  127.0.0.1     UGHS    3    791253 lo0  -     -

The example above shows you that the default gateway is defined on the en1 interface. The next command shows you where your Service IP label lives:

# netstat -i
Name  Mtu   Network   Address         Ipkts   Ierrs Opkts
en1   1500  link#2    0.2.55.d3.75.77 2587851 0      940024
en1   1500  10.251.14 node01          2587851 0      940024
en1   1500  10.251.20 serviceip       2587851 0      940024
lo0   16896 link#1                    1912870 0     1914185
lo0   16896 127       loopback        1912870 0     1914185
lo0   16896 ::1                       1912870 0     1914185

As you can see, the Service IP label (in the example above called "serviceip") is defined on en1. In that case, for NFS to work, you also want to add the "serviceip" to the /etc/exports file on the NFS server and re-run "exportfs -va". And you should also make sure that hostname "serviceip" resolves to an IP address correctly (and of course the IP address resolves to the correct hostname) on both the NFS server and the client.

Topics: AIX, EMC, PowerHA / HACMP, SAN, Storage, System Admin ↑

Missing disk method in HACMP configuration

Issue when trying to bring up a resource group: For example, the hacmp.out log file contains the following:

cl_disk_available[187] cl_fscsilunreset fscsi0 hdiskpower1 false cl_fscsilunreset[124]: openx(/dev/hdiskpower1, O_RDWR, 0, SC_NO_RESERVE): Device busy cl_fscsilunreset[400]: ioctl SCIOLSTART id=0X11000 lun=0X1000000000000 : Invalid argument

To resolve this, you will have to make sure that the SCSI reset disk method is configured in HACMP. For example, when using EMC storage:

Make sure emcpowerreset is present in /usr/lpp/EMC/Symmetrix/bin/emcpowerreset.

Then add new custom disk method:

Enter into the SMIT fastpath for HACMP "smitty hacmp".
Select Extended Configuration.
Select Extended Resource Configuration.
Select HACMP Extended Resources Configuration.
Select Configure Custom Disk Methods.
Select Add Custom Disk Methods.

      Change/Show Custom Disk Methods

Type or select values in entry fields.
Press Enter AFTER making all desired changes.

                                                 [Entry Fields]
* Disk Type (PdDvLn field from CuDv)             disk/pseudo/power
* New Disk Type                                  [disk/pseudo/power]
* Method to identify ghost disks                 [SCSI3]
* Method to determine if a reserve is held       [SCSI_TUR]
* Method to break reserve [/usr/lpp/EMC/Symmetrix/bin/emcpowerreset]
  Break reserves in parallel                     true
* Method to make the disk available              [MKDEV]

Topics: PowerHA / HACMP, System Admin ↑

Synchronizing 2 HACMP nodes

In order to keep users and all their related settings and crontab files synchronized, here's a script that you can use to do this for you:

sync.ksh

Topics: AIX, Networking, PowerHA / HACMP ↑

Using an alternative MAC address

HACMP is capable of using an alternative MAC address in combination with its service address. So, how do you set this MAC address without HACMP, just using the command line? (Could come in handy, in case you wish to configure the service address on a system, without having to start HACMP).

# ifconfig enX down
# ifconfig enX detach
# chdev -l entX -a use_alt_addr=yes
# chdev -l entX -a alt_addr=0x00xxxxxxxxxx
# ifconfig enX xxx.xxx.xxx.xxx
# ifconfig enX up

And if you wish to remove it again:

# ifconfig enX down
# ifconfig enX detach
# chdev -l entX -a use_alt_addr=no
# chdev -l entX -a alt_addr=0x00000000000

Topics: AIX, PowerHA / HACMP, System Admin ↑

Email messages from the cron daemon

Some user accounts, mostly service accounts, may create a lot of email messages, for example when a lot of commands are run by the cron daemon for a specific user. There are a couple of ways to deal with this:

1. Make sure no unnecesary emails are sent at all

To avoid receiving messages from the cron daemon; one should always redirect the output of commands in crontabs to a file or to /dev/null. Also make sure to redirect STDERR as well:

0 * * * * /path/to/command > /path/to/logfile 2>&1
1 * * * * /path/to/command > /dev/null 2>&1

2. Make sure the commands in the crontab actually exist

An entry in a crontab with a command that does not exits, will generate an email message from the cron daemon to the user, informing the user about this issue. This is something that may occur on HACMP clusters where crontab files are synchronized on all HACMP nodes. They need to be synchronize on all the nodes, just in case a resource group fails over to a standby node. However, the required file systems containing the commands may not be available on all the nodes at all time. To get around that, test if the command exists first:

0 * * * * [ -x /path/to/command ] && /path/to/command > /path/to/logfile 2>&1

3. Clean up the email messages regularly

The last way of dealing with this, is to add another cron entry to a users crontab; that cleans out the mailbox every night, for example the next command that deletes all but the last 1000 messages from a users mailbox:

0 * * * * echo d1-$(let num="$(echo f|mail|tail -1|awk '{print $2}')-1000";echo $num)|mail >/dev/null

4. Forward the email to the user

Very effective: Create a .forward file in the users home directory, to forward all email messages to the user. If the user starts receiving many, many emails, he/she will surely do somehting about it, when it gets annoying.

Topics: Monitoring, PowerHA / HACMP ↑

HACMP auto-verification

HACMP automatically runs a verification every night, usually around mid-night. With a very simple command you can check the status of this verification run:

# tail -10 /var/hacmp/log/clutils.log 2>/dev/null|grep detected|tail -1

If this shows a returncode of 0, the cluster verification ran without any errors. Anything else, you'll have to investigate. You can use this command on all your HACMP clusters, allowing you to verify your HACMP cluster status every day.

With the following smitty menu you can change the time when the auto-verification runs and if it should produce debug output or not:

# smitty clautover.dialog

You can check with:

# odmget HACMPcluster
# odmget HACMPtimersvc

Be aware that if you change the runtime of the auto-verification that you have to synchronize the cluster afterwards to update the other nodes in the cluster.

Topics: LVM, PowerHA / HACMP, System Admin ↑

VGDA out of sync

With HACMP, you can run into the following error during a verification/synchronization:

WARNING: The LVM time stamp for shared volume group: testvg is inconsistent with the time stamp in the VGDA for the following nodes: host01

To correct the above condition, run verification & synchronization with "Automatically correct errors found during verification?" set to either 'Yes' or 'Interactive'. The cluster must be down for the corrective action to run.

This can happen when you've added additional space to a logical volume/file system from the command line instead of using the smitty hacmp menu. But you certainly don't want to take down the entire HACMP cluster to solve this message.

First of all, you don't. The cluster will fail-over nicely anyway, without these VGDA's being in sync. But, still, it is an annoying warning, that you would like to get rid off.

Have a look at your shared logical volumes. By using the lsattr command, you can see if they are actually in sync or not:

host01 # lsattr -Z: -l testlv -a label -a copies -a size -a type -a strictness -Fvalue
/test:1:809:jfs2:y:

host02 # lsattr -Z: -l testlv -a label -a copies -a size -a type -a strictness -Fvalue
/test:1:806:jfs2:y:

Well, there you have it. One host reports testlv having a size of 806 LPs, the other says it's 809. Not good. You will run into this when you've used the extendlv and chfs commands to increase the size of a shared file system. You should have used the smitty menu.

The good thing is, HACMP will sync the VGDA's if you do some kind of logical volume operation through the smitty hacmp menu. So, either increase the size of a shared logical volume through the smitty menu with just one LP (and of course, also increase the size of the corresponding file system); Or, you can create an additional shared logical volume through smitty of just one LP, and then remove it again afterwards.

When you've done that, simply re-run the verification/synchronization, and you'll notice that the warning message is gone. Make sure you run the lsattr command again on your shared logical volumes on all the nodes in your cluster to confirm.

Topics: AIX, Networking, PowerHA / HACMP ↑

Specifying the default gateway on a specific interface

When you're using HACMP, you usually have multiple network adapters installed and thus multiple network interface to handle with. If AIX configured the default gateway on a wrong interface (like on your management interface instead of the boot interface), you might want to change this, so network traffic isn't sent over the management interface. Here's how you can do this:

First, stop HACMP or do a take-over of the resource groups to another node; this will avoid any problems with applications when you start fiddling with the network configuration.

Then open up a virtual terminal window to the host on your HMC. Otherwise you would loose the connection, as soon as you drop the current default gateway.

Now you need to determine where your current default gateway is configured. You can do this by typing:

# lsattr -El inet0
# netstat -nr

The lsattr command will show you the current default gateway route and the netstat command will show you the interface it is configured on. You can also check the ODM:

# odmget -q"attribute=route" CuAt

Now, delete the default gateway like this:

# lsattr -El inet0 | awk '$2 ~ /hopcount/ { print $2 }' | read GW
# chdev -l inet0 -a delroute=${GW}

If you would now use the route command to specifiy the default gateway on a specific interface, like this:

# route add 0 [ip address of default gateway: xxx.xxx.xxx.254] -if enX

You will have a working entry for the default gateway. But... the route command does not change anything in the ODM. As soon as your system reboots; the default gateway is gone again. Not a good idea.

A better solution is to use the chdev command:

# chdev -l inet0 -a addroute=net,-hopcount,0,,0,[ip address of default gateway]

This will set the default gateway to the first interface available.

To specify the interface use:

# chdev -l inet0 -a addroute=net,-hopcount,0,if,enX,,0,[ip address of default gateway]

Substitute the correct interface for enX in the command above.

If you previously used the route add command, and after that you use chdev to enter the default gateway, then this will fail. You have to delete it first by using route delete 0, and then give the chdev command.

Afterwards, check fi the new default gateway is properly configured:

# lsattr -El inet0
# odmget -q"attribute=route" CuAt

And ofcourse, try to ping the IP address of the default gateway and some outside address. Now reboot your system and check if the default gateway remains configured on the correct interface. And startup HACMP again!

Number of results found for topic PowerHA / HACMP: 31.
Displaying results: 11 - 20.

Order

No time to lose? Need to know what's wrong with
your UNIX system now? Then get started TODAY!