DLPAR not doing it for you anymore? (Chris's AIX Blog)

The developerWorks Connections platform will be sunset on December 31, 2019. On January 1, 2020, this blog will no longer be available. More details available on our FAQ.

DLPAR not doing it for you anymore?

cggibbo Feb 25 2011 Comments (18) Visits (110168)

5 people like this

I received an email from one of my customers recently that simply said:

“mate …on lpar30, for some reason the IBM.DRM and others are failing to start .. um .. any chance you could have a quick look at that ?”

So I asked, “OK, so this use to work right?”. To which I received a relatively confused reply, ”.....yep it did....actually no....it’s never worked....or has it???...I’m not sure...”.

Based on my experience, the most common issue that prevents DLPAR operations from working are network problems. Before diving into the deep end and trying to debug RSCT, it’s always best to start with the basics. For example, can you ping the HMC from the LPAR? Can you ping the LPAR from the HMC? If either of these tests fails, check the network configuration on both components before doing anything else.

On the HMC check the network settings first e.g.

· Click on HMC Configuration and then Customize Network Settings.

– Verify the IP address, netmask, default gateway, network routes, DNS server are all set correctly.

– Check the LPAR communications box in HMC configuration screen for LAN adapter that is used for HMC-to-LPAR communication.

– By the way, unlike POWER4 systems, LPARs on POWER5 and POWER6 systems do not depend on host name resolution for DLPAR operations.

· Check routing on the LPAR and the HMC.

– Use ping and the HMC’s Test Network Connectivity task to verify the LPAR and the HMC can communicate with each other.

If you check the network and you are happy that the LPAR and the HMC can communicate, then perhaps you need to re-initialise the RMC subsystems on the AIX LPAR. Run the following commands:

# /usr/sbin/rsct/bin/rmcctrl –z

# /usr/sbin/rsct/bin/rmcctrl –A

# /usr/sbin/rsct/bin/rmcctrl –p

Wait up to 5 minutes before trying DLPAR again. If DLPAR still doesn’t work i.e. the HMC is still reporting no values for DCaps, and the IBM.DRM subsystem still won’t start, try using the recfgct command.

WARNING: The recfgct command referenced above is *not* supported for use by customers without direct IBM support instructions. It erases all RSCT configuration info and makes it look like the node was just installed. This may be fine for DLPAR recycling, but if you have any other products dependent on RSCT on the partition in question, you will be *broken*. In particular, PowerHA 7 will crash, and Tivoli SAMP will have all its cluster info destroyed, partitioning it from the rest of the domain until it can be manually re-added (and it may also crash, depending on the presence of resources). If you find that DLPAR is not working, and all other network checks and even the RMC recycling (-z/-A/-p) does not work, it is strongly recommended that you use the ctsnap command to gather data and contact IBM support. (Capturing iptrace for a few minutes would not be a bad idea either. A complementary tcpdump on the HMC would also be good, but this may not be possible for most customers given HMC's access restrictions.) Then, if you wish to proceed with recfgct and find that it does resolve whatever the problem was, it would be equally wise to gather another ctsnap after the partition is once again connected to the HMC, to compare to the previous one.

CAUTION: Running the recfgct command on a node in a RSCT peer domain or in a Cluster Aware AIX (CAA) environment should NOT be done before taking other precautions first. This note is not designed to cover all CAA or other RSCT cluster considerations so if you have an application that is RSCT aware such as PowerHA, VIOS Storage Pools and several others do not proceed until you have contacted support. If you need to determine if your system is a member of a CAA cluster then please refer to the Reliable Scalable Cluster Technology document titled, "Troubleshooting the resource monitoring and control (RMC) subsystem". http://www-01.ibm.com/support/knowledgecenter/SGVKBA_3.1.5/com.ibm.rsct315.trouble/bl507_diagrmc.htm

hscroot@hmc1:~> lspartition -dlpar

.....

<#5> LPAR:<24*9117-MMB*10284FP, , 192.168.1.15>

Active:<1>, OS:<AIX, 6.1, 6100-05-01-1016>, DCaps:<0x0>, CmdCaps:<0x0, 0x0>, PinnedMem:<768>

.....

# /usr/sbin/rsct/install/bin/recfgct

Wait 5 minutes. This should resolve your DLPAR issue. The IBM.DRM subsystem should now be active and there should be good (non-zero) values for DCaps:

# lssrc -g rsct_rm

Subsystem Group PID Status

IBM.DRM rsct_rm 6881300 active

IBM.CSMAgentRM rsct_rm 7274530 active

IBM.ServiceRM rsct_rm 6029480 active

IBM.AuditRM rsct_rm 6357058 active

IBM.ERRM rsct_rm 4456566 active

IBM.LPRM rsct_rm 6946986 active

hscroot@hmc1:~> lspartition -dlpar

....

<#5> LPAR:<24*9117-MMB*10284FP, , 192.168.1.15>

Active:<1>, OS:<AIX, 6.1, 6100-05-01-1016>, DCaps:<0xc5f>, CmdCaps:<0x1b, 0x1b>, PinnedMem:<994>

....

Only run the rmcctrl and recfgct commands if you believe something has become corrupt in the RMC configuration of the LPAR. The fastest way to fix a broken configuration or to clear out the RMC ACL files after cloning (via alt_disk migration) is to use the recfgct command.

These daemons should work “out of the box” and are not typically the cause of DLPAR issues. However, you can try stopping and starting the daemons when troubleshooting DLPAR issues.

The rmcctrl -z command just stops the daemons. The rmcctrl -A command ensures that the subsystem group (rsct) and the subsystem (ctrmc) objects are added to the SRC, and an appropriate entry added to the end of /etc/inittab and it starts the daemons.

The rmcctrl –p command enables the daemons for remote client connections i.e. from the HMC to the LPAR and vice versa.

If you are familiar with the System Resource Controller (SRC) you might be tempted to use stopsrc and startsrc commands to stop and start these daemons.

Do not do it; use the rmcctrl commands instead.

If /var is 100% full, use chfs to expand it. If there is no more space available, examine subdirectories and remove unnecessary files (for example, trace.*, core, and so forth). If /var is full, RMC subsystems may fail to function correctly.

The polling interval for the RMC daemons on the LPAR to check with the HMC daemons is 5-7 minutes; so you need to wait long enough for the daemons to start up and synchronize.

The Resource Monitoring and Control (RMC) daemons are part of the Reliable, Scalable Cluster Technology (RSCT) and are controlled by the System Resource Controller (SRC). These daemons run in all LPARs and communicate with equivalent RMC daemons running on the HMC. The daemons start automatically when the operating system starts and synchronize with the HMC RMC daemons.

The daemons in the LPARs and the daemons on the HMC must be able to communicate over the network for DLPAR operations to succeed. This is not the network connection between the managed system (FSP) and the HMC; it is the network connection between the operating system (AIX) in each LPAR and the HMC.

Note: Apart from rebooting, there is no way to stop and start the RMC daemons on the HMC.

The following links also contain some (out dated) information relating to DLPAR verification and troubleshooting. Even though it is quite old some of it is still relevant today and is good a place to start.

The most common reasons for failures with Dynamic Logical LPARing

http://www-01.ibm.com/support/docview.wss?rs=111&context=SWG10&q1=dcaps&uid=isg3T1010615&loc=en_US&cs=utf-8&lang=en

Dynamic LPAR tips and checklists for RMC authentication and authorization

http://www.ibm.com/developerworks/systems/articles/DLPARchecklist.html

Setting up the HMC/LPARs hostname and network (old but interesting)

http://computerstec.com/aixhowto/aixh18.htm

lsLPAR -dlpar, DCAPs values (old but still applies)

http://community.livejournal.com/eserver/18517.html

The previous link (above) provides some information relating to the values for DCaps and what they mean (also out dated):

0 - DR CPU capable (can move CPUs)

1 - DR MEM capable (can move memory)

2 - DR I/O capable (can move I/O resources)

3 - DR PCI Bridge (can move PCI bridges)

4 - DR Entitlement (POWER 5 can change shared entitlement)

5 - Multiple DR CPU (AIX 5.3 can move 2+ CPUs at once)

0x3f = max, and 0xf is common for AIX 5.2

If you are interested in how HMC and LPAR authentication works with DLPAR, then read on. Otherwise, happy DLPARing!

HMC and LPAR authentication (RSCT authentication)

The diagram below outlines how the HMC and an LPAR authenticate with each other in order for DLPAR operations to work. RSCT authentication is used to ensure the HMC is communicating with the correct LPAR.

Authentication is the process of ensuring that another party is who it claims to be. Authorization is the process by which a cluster software component grants or denies resources based on certain criteria. The RSCT component that implements authorization is RMC. It uses access control list (ACL) files to control user access to resources.

The RMC component subsystem uses cluster security services to map the operating system user identifiers, specified in the ACL file, to network security identifiers to determine if the user has the correct permissions. This is performed by the identity mapping service, which uses information stored in the identity mapping files ctsec_map.global and ctsec_map.local.

The RSCT authorization process in detail:

1. On the HMC: DMSRM pushes down the secret key and HMC IP address to NVRAM when it detects a new CEC; this process is repeated every five minutes. Each time an HMC is rebooted or DMSRM is restarted, a new key is used.

2. On the AIX LPAR: CSMAgentRM, through RTAS (Run-time Abstraction Services), reads the key and HMC IP address out from NVRAM. It will then authenticate the HMC. This process is repeated every five minutes on a LPAR to detect a new HMCs and if the key has changed. An HMC with a new key is treated as a new HMC and will go though the authentication and authorization processes again.

3. On the AIX LPAR: After authenticating the HMC, CSMAgentRM will contact the DMSRM on the HMC to create a ManagedNode resource in order to identify itself as a LPAR of this HMC. CSMAgentRM then creates a compatible ManagementServer resource on AIX. This can be displayed on AIX with the lssrsrc command. e.g.

root@aix6 / # lsrsrc "IBM.ManagementServer"

Resource Persistent Attributes for IBM.ManagementServer

resource 1:

Name = "192.168.1.244"

Hostname = "192.168.1.244"

ManagerType = "HMC"

LocalHostname = "10.153.3.133"

ClusterTM = "9078-160"

ClusterSNum = ""

ActivePeerDomain = ""

NodeNameList = {"aix6"}

4. On the AIX LPAR: After the creation of the ManagedNode and ManagementServer resources on the HMC and AIX respectively, CSMAgentRM grants HMC permission to access necessary resource classes on the LPAR. After granting the HMC permission, CSMAgentRM will change its ManagedNode, on the HMC, Status to 1. (It should be noted that without proper permission on AIX, the HMC would be able to establish a session with the LPAR but will not be able to query for OS information, DLPAR capabilities, or execute DLPAR commands afterwards.)

5. On the HMC: After the ManagedNode Status is changed to 1, LparCmdRM establishes a session with the LPAR, queries for operating system information and DLPAR capabilities, notifies CIMOM about the DLPAR capabilities of the LPAR, and then waits for the DLPAR commands from users.

Comments (18)

Add a Comment

Quarantine this Entry

purdym commented Dec 12 2015 Comment Permalink

Good article. For me the solution was the firewall rules on the interface in the HMC. Ensure RMC is allowed through the HMC's firewall.
hmc management -> change network setting -> lan adapter -> choose adapter -> details
-> firewall settings -> select RMC -> 'allow incoming'

Ensure RMC is now under 'Allowed hosts'.

Musfar commented May 7 2015 Comment Permalink

I ran into the problem recently. And after trying several things, we narrowed down to the network restrictions between the HMC subnet and LPARs. Learnt that the subnet of HMC is a general computing, while the LPARs' subnet is restricted. So, we requested the firewall team to allow HMC IP over port 657 to access restricted subnet of the LPARs. Hope this will be of some help to others.

M. Veidt commented Feb 11 2015 Comment Permalink

Hi Chris, great post. Also a common reason for RMC not working: HMC hostname is not resolvable via DNS or local lookup from the client LPAR. This should explicitly checked a well.

POWERHAguy commented Feb 6 2015 Comment Permalink

I did prove rcotter comment about running recfgct does indeed crash a PowerHA node cluster. So as my original notes state, take powerha and caa down before hand.

POWERHAguy commented Jan 29 2015 Comment Permalink

I just encountered a similar problem for cloning a primary node in PowerHA and making it a standby node. Though the environment as far as PowerHA was concerned worked, but errpt every 60 seconds was recording the errors I listed below. I went through the procedures above, though I stopped PowerHA on the standby node before starting. After performing the steps I got cthags was no longer found. I rebooted and all worked but I wasn't real happy with that.
So I recreated my environment again. The only step difference this time was when stopping PowerHA on the stby node I also stopped CAA services, went through steps, then restarted and told it to start CAA services again. Now this exact clmgr syntax only works with PowerHA 7.1.3 SP1 or above. Earlier versions of CAA/HA have different options/commands to stop/start CAA individually. This seems to have worked for me, hopefully it works for others. clmgr stop node dtcu0_stby WHEN=now MANAGE=offline STOP_CAA=yes stopsrc -g rsct_rm; stopsrc -g rsct /usr/bin/odmdelete -o CuAt -q 'attribute=node_uuid' /usr/sbin/rsct/bin/mknodeid -f (when I ran this step I go no output, I think it just pulls in the existing one from repos disk for node but not sure) lsattr -El cluster0 /usr/sbin/rsct/bin/lsnodeid /usr/sbin/rsct/install/bin/recfgct clmgr start node web WHEN=now MANAGE=auto START_CAA=yes ________________________________________________________________________________________- LABEL: CONFIGRM_ONLINEFAIL IDENTIFIER: E509DBCA LABEL: CONFIGRM_STARTED_ST IDENTIFIER: DE84C4DB LABEL: SRC_RSTRT IDENTIFIER: CB4A951F LABEL: CONFIGRM_EXIT_ONLIN IDENTIFIER: 68FD23E8

Rick Cotter commented Mar 22 2014 Comment Permalink

Warning: The recfgct command referenced above is *not* supported for use by customers without direct IBM support instructions. It erases all RSCT configuration info and makes it look like the node was just installed. This may be fine for DLPAR recycling, but if you have any other products dependent on RSCT on the partition in question, you will be *broken*. In particular, PowerHA 7 will crash, and Tivoli SAMP will have all its cluster info destroyed, partitioning it from the rest of the domain until it can be manually re-added (and it may also crash, depending on the presence of resources). If you find that DLPAR is not working, and all other network checks and even the RMC recycling (-z/-A/-p) does not work, it is strongly recommended that you use the ctsnap command to gather data and contact IBM support. (Capturing iptrace for a few minutes would not be a bad idea either. A complementary tcpdump on the HMC would also be good, but this may not be possible for most customers given HMC's access restrictions.) Then, if you wish to proceed with recfgct and find that it does resolve whatever the problem was, it would be equally wise to gather another ctsnap after the partition is once again connected to the HMC, to compare to the previous one.

cggibbo commented Feb 12 2014 Comment Permalink

Sounds looks like you might have a "ghost" adapters info on the HMC and in the VIOS. So now things are out of sync. You could try the following (at your own risk), to resolve the problem: From oem_setup_env on the VIOS: # /usr/sbin/drmgr -a -c slot -s U911X.MXX.1234E8C-V1-C164 -d 5 ; where the location code matches your adapter/slot config. Reconfigure the slot in the VIOS from padmin: $ cfgdev If the above works as expected then you should be able to remove the VFC adapter now, as padmin: $ rmdev -dev vfchostXYZ Then the HMC DLPAR remove on the slot should complete and leave the HMC and VIOS partition in a consistent state.

Nolte commented Feb 11 2014 Comment Permalink

Hi Chris, thank for the article very useful. But i have a problem: Adding in a dlpar mode a virtual fiber channel adapter to my vio, i have received an error in communication, but i have clicked "OK" in the window. After reset the connection following this article, the problem is that is not possible to remove the adaper from the vio running profile because : "0931-009 You specified a drc_name for a resource which is not assigned to this partition." effectively i don't have the vfchost on vio..(not even in "unknow state"). is like an error from HMC and a my error clicking "OK" and not "cancel" when the communication is bad. obviously after the reset is possible to make dlpar actions but is not possible delete the adapter even whit --force option in chhwres command. Thanks.

VEUT_xu_ma commented Mar 12 2013 Comment Permalink

Thanks for your article. I resolve my problem.

hillanes commented Nov 21 2012 Comment Permalink

Thks Chris, What about IVM environment, without HMC. I cant start RMC service. How can I troubleshoot my environment?

Activities	To Do List	High Priority Activities
Blogs	Latest Entries	Public Blogs Listing
Files	Shared With Me	Pinned Folders
Forums	I'm an Owner	Public Forums
Wikis	I'm an Owner	Public Wikis
My Home

Blogs

Chris's AIX Blog

About this blog

Related posts

Testing AIX Live Upd...

IBM Storage Insights...

IBM Spectrum Control...

IBM Power Systems Bi...

Tip: Rebooting with ...

Tags

Selected Tags

Related Tags

DLPAR not doing it for you anymore?

Send Email Notification

Quarantine this entry

Mark as Duplicate

Comments (18)