AIX rootvg failure monitoring (Chris's AIX Blog)

The developerWorks Connections platform will be sunset on December 31, 2019. On January 1, 2020, this blog will no longer be available. More details available on our FAQ.

AIX rootvg failure monitoring

cggibbo Mar 2 2017 Comments (4) Visits (17204)

2 people like this

AIX has a new “critical volume group” capability which will monitor for the loss or failure of a volume group. You can apply this to any volume group, including rootvg. If applied to rootvg, then you can monitor for the loss of the root volume group.

This feature may be useful if your AIX LPAR experiences a loss of SAN connectivity e.g. total loss of access to SAN storage and/or all SAN switches. Typically, when this happens, AIX will continue to run, in memory for a period of time and will not immediately crash. Often you can still log on to the AIX system but if you attempt to write a file you’ll see in I/O error. But even then the system may (potentially) remain up. When the SAN issue is resolved the AIX system may continue running, with file systems in read-only mode (or not, it depends) but to really resolve the issue you would still need to reboot the AIX LPAR in order for it regain access to its disks. This can result in the need to run fsck against file systems. Note that the behaviour you encounter will be impacted by a variety of factors, such as length and type of outage. As always, your mileage may vary!

You can encounter this behaviour with both VSCSI and NPIV SAN booted LPARs. This new AIX VG option, which caters for the scenario described above, is not enabled by default. From the chvg man page:

“-r y | n Changes the critical volume group (VG) option of the volume group.

n Disables the critical VG option of the volume group.

y Enables the critical VG option of the volume group. If the volume group is set

to the critical VG, any I/O request failure starts the Logical Volume Manager

(LVM) metadata write operation to check the state of the disk before

returning the I/O failure. If the critical VG option is set to rootvg and if the

volume group losses access to quorum set of disks (or all disks if quorum

disabled), instead of moving the VG to an offline state, the node is crashed

and a message is displayed on the console.”

PowerHA also caters for and supports this now....and should already be enabled by default. You want this feature enabled for your HA clusters so that they respond appropriately to loss of the root volume group and initiate a failover.

smitty sysmirror -> Custom Cluster Configuration -> Events -> System Events -> Change/Show Event Response (smitty cm_change_show_sys_event)

Change/Show Event Response

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

[Entry Fields]

* Event Name ROOTVG +

* Response Log event and reboot +

* Active Yes +

"Exploitation of LVM rootvg failure monitoring

AIX LVM has recently added the capability to change a volume group to be a known as critical volume group. Though PowerHA has allowed critical volume groups in the past, that

only applied to non-operating system/data volume groups. PowerHA v7.2 now also takes advantage of this functionality specifically for rootvg. If the volume group is set to the critical VG, any I/O request failure starts the Logical Volume Manager (LVM) metadata write operation to check the state of the disk before returning the I/O failure. If the critical VG option is set to rootvg and if the volume group losses access to quorum set of disks (or all disks if quorum is disabled), instead of moving the VG to an offline state, the node is crashed and a message is displayed on the console. You can set and validate rootvg as a critical volume group by executing the commands shown below. The command has to run once since we are using the CAA distributed command clcmd.

# clcmd chvg -r y rootvg

# clcmd lsvg rootvg |grep CRIT

DISK BLOCK SIZE: 512 CRITICAL VG: yes

DISK BLOCK SIZE: 512 CRITICAL VG: yes"

To test this new feature in my lab, I simulated a disk "failure" or accidental unmapping/removal of a rootvg disk from an LPAR.

On the AIX LPAR, prior to disk failure simulation, I turn on the “CRITICAL VG” option for rootvg.

# oslevel -s

7200-00-00-0000

# lsvg rootvg | grep CRIT

DISK BLOCK SIZE: 512 CRITICAL VG: no

# chvg -r y rootvg

# lsvg rootvg | grep CRIT

DISK BLOCK SIZE: 512 CRITICAL VG: yes

On the VIOS, I unmap the rootvg disk from the corresponding vhost adapter:

$ lsmap -vadapter vhost30

SVSA Physloc Client Partition ID

--------------- -------------------------------------------- ------------------

vhost30 U8286.42A.214F58V-V2-C38 0x0000001d

VTD vtscsi59

Status Available

LUN 0x8100000000000000

Backing device volume-AIX71LabImage-b8588f28-00000033-boot--26bf124d-40b3.acc7e518f3f561d4719fcecfff88b0f1

Physloc

Mirrored N/A

VTD vtscsi60

Status Available

LUN 0x8200000000000000

Backing device volume-AIX71LabImage-b8588f28-00000033-data--1830bd26-ba42.81823bdb90537250b95b3b2f35bb55f9

Physloc

Mirrored N/A

$ lu -list | grep volume-A | grep Lab

volume-AIX71LabImage-b~ 10240 7790 acc7e518f3f561d4719fcecfff88b0f1

volume-AIX71LabImage-b~ 1024 1021 81823bdb90537250b95b3b2f35bb55f9

$ lu -unmap -luudid acc7e518f3f561d4719fcecfff88b0f1

$ lsmap -vadapter vhost30

SVSA Physloc Client Partition ID

--------------- -------------------------------------------- ------------------

vhost30 U8286.42A.214F58V-V2-C38 0x0000001d

VTD vtscsi60

Status Available

LUN 0x8200000000000000

Backing device volume-AIX71LabImage-b8588f28-00000033-data--1830bd26-ba42.81823bdb90537250b95b3b2f35bb55f9

Physloc

Mirrored N/A

On the AIX LPAR, I attempt to create (write) a file in /tmp (which resides in rootvg):

# touch /tmp/mynewfile

The LPAR stops responding immediately. I can no longer connect or login to it.

I then remap the disk and restart the LPAR. The AIX error report shows that the system halted due a critical VG going offline.

$ lu -map -luudid acc7e518f3f561d4719fcecfff88b0f1 -vadapter vhost30

$ lsmap -vadapter vhost30

SVSA Physloc Client Partition ID

--------------- -------------------------------------------- ------------------

vhost30 U8286.42A.214F58V-V2-C38 0x0000001d

VTD vtscsi59

Status Available

LUN 0x8100000000000000

Backing device volume-AIX71LabImage-b8588f28-00000033-boot--26bf124d-40b3.acc7e518f3f561d4719fcecfff88b0f1

Physloc

Mirrored N/A

VTD vtscsi60

Status Available

LUN 0x8200000000000000

Backing device volume-AIX71LabImage-b8588f28-00000033-data--1830bd26-ba42.81823bdb90537250b95b3b2f35bb55f9

Physloc

Mirrored N/A

# errpt -a

...

LABEL: KERNEL_PANIC

IDENTIFIER: 225E3B63

Date/Time: Wed Mar 1 14:27:51 AEDT 2017

Sequence Number: 215

Machine Id: 00F94F584C00

Node Id: aix72lab

Class: S

Type: TEMP

WPAR: Global

Resource Name: PANIC

Description

SOFTWARE PROGRAM ABNORMALLY TERMINATED

Recommended Actions

PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data

ASSERT STRING

PANIC STRING

Critical VG Force off, halting.

This feature is available with AIX 6.1 or later.

IV52743: ADD CRITICAL VG SUPPORT FOR ROOTVG APPLIES TO AIX 7100-03

http://www-01.ibm.com/support/docview.wss?uid=isg1IV52743

Tags: failure aix vg monitoring cgaix rootvg gibson chvg group chris volume critical

Comments (4)

Add a Comment

Quarantine this Entry

vsmvignesh commented Dec 13 2017 Conversation Permalink

Hi Chris,
I gave "Only log the event" as response. But still the node crashes and reboots. And I am not able to see any notification regarding that (because, 'odmget HACMPevent' entry is 'action = "NOTIFY_ONLY"'). I don't see any logs regarding rootvg failure except the 'KERNEL PANIC' in errpt. If it is supposed to notify somewhere, where would it be or if there is a way to monitor this process, please do help.

Thanks

cggibbo commented Dec 22 2017 Conversation Permalink

Did you sync/verify the cluster, after you changed the event?

# odmget HACMPeventmgr

HACMPeventmgr:
name = "ROOTVG"
action = "NOTIFY_ONLY"
active = 1

You could try (temporarily) disabling the critical VG option, in order to test effectivness of this function. Use the errpt command to monitor the events logged. If the node is "crashing" as you say, then the AIX errpt command will provide some clues as to what happened at the time. There may also be something logged to /var/adm/ras/syslog.caa (but I'm not sure). It's very difficult to diagnose and troubleshoot this kind of problem, in the comments section of a blog. If this capability is not behaving as you expect, then you may need to check the levels of AIX & PowerHA that are installed. Are you running the latest supported levels of each? If you need urgent assistance, please open a PMR with IBM support. Thank you for your comment.

vsmvignesh commented Dec 12 2017 Conversation Permalink

HI,
In change/show event response, if we give the response as "Only log the event", what would be the behavior and how to monitor that process. Please explain.

Thanks

cggibbo commented Dec 13 2017 Conversation Permalink

Hi,

The PowerHA knowledge centre states: "You can use the rootvg system event to monitor the loss of access to rootvg. If the system loses access, PowerHA SystemMirror logs an event in the system error log and reboots the system by default. You can change this setting using SMIT to log an event but not reboot the system".
https://www.ibm.com/support/knowledgecenter/en/SSPHQG_7.2.2/com.ibm.powerha.plangd/ha_plan_loss_quorom.htm

You can use the AIX errpt command to display each event that is logged.

Thanks for your comment.

regards,
Chris

Blogs

Chris's AIX Blog

About this blog

Related posts

US-Based Foundation ...

Testing AIX Live Upd...

IBM Storage Insights...

IBM Spectrum Control...

IBM Power Systems Bi...

Tags

Selected Tags

Related Tags

AIX rootvg failure monitoring

Send Email Notification

Quarantine this entry

Mark as Duplicate

Comments (4)