AIX rootvg failure monitoringAIX has a new “critical volume group” capability which will monitor for the loss or failure of a volume group. You can apply this to any volume group, including rootvg. If applied to rootvg, then you can monitor for the loss of the root volume group.
This feature may be useful if your AIX LPAR experiences a loss of SAN connectivity e.g. total loss of access to SAN storage and/or all SAN switches. Typically, when this happens, AIX will continue to run, in memory for a period of time and will not immediately crash. Often you can still log on to the AIX system but if you attempt to write a file you’ll see in I/O error. But even then the system may (potentially) remain up. When the SAN issue is resolved the AIX system may continue running, with file systems in read-only mode (or not, it depends) but to really resolve the issue you would still need to reboot the AIX LPAR in order for it regain access to its disks. This can result in the need to run fsck against file systems. Note that the behaviour you encounter will be impacted by a variety of factors, such as length and type of outage. As always, your mileage may vary!
You can encounter this behaviour with both VSCSI and NPIV SAN booted LPARs. This new AIX VG option, which caters for the scenario described above, is not enabled by default. From the chvg man page:
“-r y | n Changes the critical volume group (VG) option of the volume group.
n
y to the critical VG, any I/O request failure starts the Logical Volume Manager (LVM) metadata write operation to check the state of the disk before returning the I/O failure. If the critical VG option is set to rootvg and if the volume group losses access to quorum set of disks (or all disks if quorum disabled), instead of moving the VG to an offline state, the node is crashed and a message is displayed on the console.”
PowerHA also caters for and supports this now....and should already be enabled by default. You want this feature enabled for your HA clusters so that they respond appropriately to loss of the root volume group and initiate a failover.
smitty sysmirror -> Custom Cluster Configuration
-> Events -> System Events -> Change/Show Event Response
(smitty cm_c
Type or select values in entry fields. Press Enter AFTER making all desired changes.
* Event Name
* Resp * Acti
"Exploitation of LVM rootvg failure monitoring
AIX LVM has recently added the capability to change a volume group to be a known as critical volume group. Though PowerHA has allowed critical volume groups in the past, that only applied to non-operating system/data volume groups. PowerHA v7.2 now also takes advantage of this functionality specifically for rootvg. If the volume group is set to the critical VG, any I/O request failure starts the Logical Volume Manager (LVM) metadata write operation to check the state of the disk before returning the I/O failure. If the critical VG option is set to rootvg and if the volume group losses access to quorum set of disks (or all disks if quorum is disabled), instead of moving the VG to an offline state, the node is crashed and a message is displayed on the console. You can set and validate rootvg as a critical volume group by executing the commands shown below. The command has to run once since we are using the CAA distributed command clcmd.
# clcmd chvg -r y rootvg # clcmd lsvg rootvg |grep CRIT DISK BLOCK SIZE: 512 CRITICAL VG: yes DISK BLOCK SIZE: 512 CRITICAL VG: yes"
To test this new feature in my lab, I simulated a disk "failure" or accidental unmapping/removal of a rootvg disk from an LPAR.
On the AIX LPAR, prior to disk failure simulation, I turn on the “CRITICAL VG” option for rootvg.
# oslevel -s 7200-00-00-0000
# lsvg rootvg | grep CRIT DISK BLOCK SIZE: 512
# chvg -r y rootvg
# lsvg rootvg | grep CRIT DISK BLOCK SIZE: 512
On the VIOS, I unmap the rootvg disk from the corresponding vhost adapter:
$ lsmap -vadapter vhost30 SVSA Phys --------------- ---- vhost30 U828
VTD Status LUN Backing device volu Physloc Mirr
VTD Status LUN Backing device volu Physloc Mirr
$ lu -list | grep volume-A | grep Lab volu volu
$ lu -unmap -luudid acc7
$ lsmap -vadapter vhost30 SVSA Phys --------------- ---- vhost30 U828
VTD Status LUN Backing device volu Physloc Mirr
On the AIX LPAR, I attempt to create (write) a file in /tmp (which resides in rootvg):
# touch /tmp/mynewfile
The LPAR stops responding immediately. I can no longer connect or login to it.
I then remap the disk and restart the LPAR. The AIX error report shows that the system halted due a critical VG going offline.
$ lu -map -luudid acc7
$ lsmap -vadapter vhost30 SVSA Phys --------------- ---- vhost30 U828
VTD Stat LUN Backing device volu Physloc Mirr
VTD Stat LUN Backing device volu Physloc Mirr
# errpt -a ... LABEL: KERNEL_PANIC IDENTIFIER: 225E3B63
Date/Time: Wed Mar 1 14:27:51 AEDT 2017 Sequence Number: 215 Machine Id: 00F94F584C00 Node Id: aix72lab Class: S Type: TEMP WPAR: Global Resource Name: PANIC
Description SOFTWARE PROGRAM ABNORMALLY TERMINATED
Recommended Actions PERFORM PROBLEM DETERMINATION PROCEDURES
Detail Data ASSERT STRING
PANIC STRING Critical VG Force off, halting.
This feature is available with AIX 6.1 or later.
IV52743: ADD CRITICAL VG SUPPORT FOR ROOTVG APPLIES TO AIX 7100-03 http
|
Hi Chris,
I gave "Only log the event" as response. But still the node crashes
and reboots. And I am not able to see any notification regarding that
(because, 'odmget HACMPevent' entry is 'action = "NOTIFY_ONLY"'). I
don't see any logs regarding rootvg failure except the 'KERNEL PANIC' in
errpt. If it is supposed to notify somewhere, where would it be or if
there is a way to monitor this process, please do help.
Thanks
Did you sync/verify the cluster, after you changed the event?
# odmget HACMPeventmgr
HACMPeventmgr:
name = "ROOTVG"
action = "NOTIFY_ONLY"
active = 1
You
could try (temporarily) disabling the critical VG option, in order to
test effectivness of this function. Use the errpt command to monitor the
events logged. If the node is "crashing" as you say, then the AIX errpt
command will provide some clues as to what happened at the time. There
may also be something logged to /var/adm/ras/syslog.caa (but I'm not
sure). It's very difficult to diagnose and troubleshoot this kind of
problem, in the comments section of a blog. If this capability is not
behaving as you expect, then you may need to check the levels of AIX
& PowerHA that are installed. Are you running the latest supported
levels of each? If you need urgent assistance, please open a PMR with
IBM support. Thank you for your comment.
HI,
In change/show event response, if we give the response as "Only log
the event", what would be the behavior and how to monitor that process.
Please explain.
Thanks
Hi,
The PowerHA knowledge centre states: "You can use the rootvg system
event to monitor the loss of access to rootvg. If the system loses
access, PowerHA SystemMirror logs an event in the system error log and
reboots the system by default. You can change this setting using SMIT to
log an event but not reboot the system".
https://www.ibm.com/support/knowledgecenter/en/SSPHQG_7.2.2/com.ibm.powerha.plangd/ha_plan_loss_quorom.htm
You can use the AIX errpt command to display each event that is logged.
Thanks for your comment.
regards,
Chris