Why is smtctl slow?

 

A colleague of mine asked me for help with a mysterious issue. They claimed that the smtctl command was taking a very long time to run. He was comparing the run time of smtctl across two different LPARs. When he ran the command on one system (LPAR1) it returned almost immediately (which is the expected and typical behaviour of this command).

 

LPAR1 # timex smtctl -t 4

smtctl: SMT is now enabled. It will persist across reboots if

        you run the bosboot command before the next reboot.

 

real 2.84

user 0.40

sys  0.93

 

Running it on the other LPAR (LPAR2), was slower. Much slower.

 

LPAR2 # timex smtctl -t 4

smtctl: SMT is now enabled. It will persist across reboots if

        you run the bosboot command before the next reboot.

 

real 1456.56

user 0.41

sys  1.08

 

Why was this happening? Both LPARs were on the same S924 and had exactly the same amount of CPU/memory. I asked my colleague to collect some truss data from both the fast and the "slow" system i.e. truss -adeflo /tmp/smtctl_truss.out smtctl -t 4

 

Looking at the truss output, the first thing that stood out to me was that there were lots of "/usr/lib/dr/scripts/all/wpar_drs" processes being executed. I did not know why. There were no WPARs configured on this system and no DLPAR operations being performed. Also, on LPAR1, I did not observe any wpar_drs processes from smtctl!

 

LPAR2 # grep wpar smtctl_truss.out | wc -l

    1834

 

LPAR1 # grep wpar smtctl_truss.out | wc -l

      0

      

I wondered if this was a bug in an older AIX TL/SP, perhaps the two systems were running different AIX levels? But both systems were running AIX 7200-05-01-2038. Then I discovered that the "/usr/lib/dr/scripts/all/wpar_drs” script did not exist on LPAR1. I checked some of my internal Lab systems and the script did not exist on these either. Very curious. Something was off and I had not seen this before, as smtctl was usually very quick. We started digging around in "/usr/lib/dr/scripts" for some clues. On LPAR1, we found that there were some drmgr_scriptinfo* scripts in that directory. The drmgr_scriptinfo* files were tiny little database files containing the name/location of scripts to run during a DLPAR operation.

 

LPAR1 # ls -l /usr/lib/dr/scripts

total 32

drwxr-xr-x    2 root     system          256 Aug 31 23:45 all

-rw-------    1 root     system          765 Jul 02 2020  drmgr_scriptinfo.00F6DB0A4C00

-rw-------    1 root     system          765 Aug 31 23:45 drmgr_scriptinfo.LPAR1

-rw-------    1 root     system          765 Jul 02 2020  drmgr_scriptinfo.localhost

LPAR1 # ls -ltr /usr/lib/dr/scripts/all/

total 104

-r-x------    1 root     system        20365 Jun 01 2015  aud_acct_dr

-r-x------    1 root     system         4463 Jul 02 2020  cache_mgt_dr

-r-x------    1 root     system        10177 Aug 18 2020  ctrmc_MDdr

-r--------    1 root     system        10177 Aug 18 2020  .ctrmc_MDdr

On LPAR2, the wpar_drs script was located here.

 

LPAR2 # ls -al /usr/lib/dr/scripts/all/

total 152

drwxr-xr-x    2 root     system          256 Apr  5 16:43 ./

drwxr-xr-x    3 root     system         4096 Apr  5 16:43 ../

-r--------    1 root     system        20361 Jun 27 2012  .aud_acct_dr

-r--------    1 root     system        10177 Sep 24 2020  .ctrmc_MDdr

-r-x------    1 root     system        20361 Aug 22 2013  aud_acct_dr

-r-x------    1 root     system        10177 Sep 24 2020  ctrmc_MDdr

-r-x------    1 root     system         6442 Aug  8 2010  wpar_drs

 

As far as I could gather, this script was delivered via the bos.wpars fileset. But was not sure how (or when) it was copied to /usr/lib/dr/scripts/all/.

 

# lslpp -f bos.wpars|grep wpar_dr

                        /usr/lib/corrals/wpar_drs

                        /usr/sbin/wpar_drh

 

So, as an experiment, to rule out /usr/lib/dr/scripts/all/wpar_drs as the cause of the “smtctl slowness”, I proposed we moved it out of that directory and then tried running smtctl. My colleague was a little nervous about this, "Do we know which program calls this wpar_drs?", "If that program cannot find it, will it abend the smtctl operation?". To which I replied, I believe drmgr calls it and will only run it if it exists in /usr/lib/dr/scripts/all. To confirm this, I asked him to run "drmgr -l" and we saw wpar_drs listed (along with the other scripts located in the /usr/lib/dr/scripts/all).

 

LPAR2 # drmgr -l

DR Install Root Directory: /usr/lib/dr/scripts

Syslog ID: DRMGR

------------------------------------------------------------

/usr/lib/dr/scripts/all/aud_acct_dr              WPAR DR Script for Auditing and Accounting

        Vendor:IBM,     Version:1,      Date:03232007

        Script Timeout:60,      Admin Override Timeout:0

        Memory DR Percentage:100

        Resources Supported:

                Resource Name: wmig-checkpoint   Resource Usage: Checkpoint of Auditing and Accounting within a WPAR

                Resource Name: wmig-restart      Resource Usage: Restart of Auditing and Accounting within a WPAR

------------------------------------------------------------

/usr/lib/dr/scripts/all/ctrmc_MDdr               DR script to refresh Management Domain configuration

        Vendor:IBM,     Version:2,      Date:05252010

        Script Timeout:10,      Admin Override Timeout:0

        Memory DR Percentage:100

        Resources Supported:

                Resource Name: pmig      Resource Usage: Partition migration for RSCT Management Domain

                Resource Name: phib      Resource Usage: Partition hibernation for RSCT Management Domain

------------------------------------------------------------

/usr/lib/dr/scripts/all/wpar_drs                 WPAR DR script for DR Events

        Vendor:IBM,     Version:1,      Date:05312007

        Script Timeout:0,       Admin Override Timeout:0

        Memory DR Percentage:100

        Resources Supported:

                Resource Name: cpu       Resource Usage: Propagate CPU add/remove

                Resource Name: mem       Resource Usage: Propagate Memory add/remove

                Resource Name: capacity  Resource Usage: Propagate Capacity changes

                Resource Name: var_weight        Resource Usage: Propagate Var Weight changes

------------------------------------------------------------

 

After my colleague moved the /usr/lib/dr/scripts/all/wpar_drs script out of the way, and ran "drmgr -b" (to rebuild the DRMGR Scriptinfo file), he ran "drmgr -l" again and the wpar_drs script was no longer listed.

 

LPAR2 # mv /usr/lib/dr/scripts/all/wpar_drs /tmp/cg/wpar_drs.backup

 

LPAR2 # drmgr -b

Rebuild of Scriptinfo file is complete

 

LPAR2 # drmgr -l

DR Install Root Directory: /usr/lib/dr/scripts

Syslog ID: DRMGR

------------------------------------------------------------

/usr/lib/dr/scripts/all/aud_acct_dr              WPAR DR Script for Auditing and Accounting

        Vendor:IBM,     Version:1,      Date:03232007

        Script Timeout:60,      Admin Override Timeout:0

        Memory DR Percentage:100

        Resources Supported:

                Resource Name: wmig-checkpoint   Resource Usage: Checkpoint of Auditing and Accounting within a WPAR

                Resource Name: wmig-restart      Resource Usage: Restart of Auditing and Accounting within a WPAR

------------------------------------------------------------

/usr/lib/dr/scripts/all/ctrmc_MDdr               DR script to refresh Management Domain configuration

        Vendor:IBM,     Version:2,      Date:05252010

        Script Timeout:10,      Admin Override Timeout:0

        Memory DR Percentage:100

        Resources Supported:

                Resource Name: pmig      Resource Usage: Partition migration for RSCT Management Domain

                Resource Name: phib      Resource Usage: Partition hibernation for RSCT Management Domain

------------------------------------------------------------

 

He ran smtctl again. It was "fast".

 

LPAR2 # timex smtctl -t 4

smtctl: SMT is now enabled. It will persist across reboots if

        you run the bosboot command before the next reboot.

 

real 2.88

user 0.01

sys  0.57

 

This resolved his "problem". I still don’t know why this was happening (or how the system ended up in this configuration). And we’d probably need to engage AIX support to work out if this is a bug or something else. My colleague was not able to engage AIX support since the machine he was using did not have a SW maintenance agreement (and his benchmark project was close to completion anyway). But at least smtctl was now “fast”.

 

In summary, it appears that smtctl was calling the drmgr process and in turn the drmgr scripts (including wpar_drs). But why was it calling wpar_drs over 1800 times? I have no idea. It felt like a bug, but I’ll probably never know, unless I come across this issue again.

 

32768420: 87687581: 0.1035:        statx("/usr/lib/dr/scripts/drmgr_scriptinfo.LPAR2", 0x2FF22600, 128, 010) = 0

32768420: 87687581: 0.1038:        kopen("/usr/lib/dr/scripts/drmgr_scriptinfo.LPAR2", O_RDONLY|O_LARGEFILE) = 6

32768420: 87687581: 0.1060:        statx("/usr/lib/dr/scripts/drmgr_scriptinfo.LPAR2", 0x2FF22738, 128, 010) = 0

32768420: 87687581: 0.1062:        kopen("/usr/lib/dr/scripts/drmgr_scriptinfo.LPAR2", O_RDONLY|O_LARGEFILE) = 6

 

Thanks for reading!