Why is smtctl slow?
A colleague of mine asked me for help with a mysterious issue. They claimed that the smtctl command was taking a very long time to run. He was comparing the run time of smtctl across two different LPARs. When he ran the command on one system (LPAR1) it returned almost immediately (which is the expected and typical behaviour of this command).
LPAR1 # timex smtctl -t 4
smtctl:
SMT is now enabled. It will persist across reboots if
you run the bosboot command before the
next reboot.
real 2.84
user
0.40
sys 0.93
Running it on the other LPAR (LPAR2), was slower. Much slower.
LPAR2
# timex smtctl -t 4
smtctl:
SMT is now enabled. It will persist across reboots if
you run the bosboot command before the
next reboot.
real 1456.56
user
0.41
sys 1.08
Why was this happening? Both LPARs were on the same S924 and had exactly the same amount of CPU/memory. I asked my colleague to collect some truss data from both the fast and the "slow" system i.e. truss -adeflo /tmp/smtctl_truss.out smtctl -t 4
Looking at the truss output, the first thing that stood out to me was that there were lots of "/usr/lib/dr/scripts/all/wpar_drs" processes being executed. I did not know why. There were no WPARs configured on this system and no DLPAR operations being performed. Also, on LPAR1, I did not observe any wpar_drs processes from smtctl!
LPAR2 #
grep wpar smtctl_truss.out | wc -l
1834
LPAR1 # grep wpar smtctl_truss.out
| wc -l
0
I wondered if this was a bug in an older AIX TL/SP, perhaps the two systems were running different AIX levels? But both systems were running AIX 7200-05-01-2038. Then I discovered that the "/usr/lib/dr/scripts/all/wpar_drs” script did not exist on LPAR1. I checked some of my internal Lab systems and the script did not exist on these either. Very curious. Something was off and I had not seen this before, as smtctl was usually very quick. We started digging around in "/usr/lib/dr/scripts" for some clues. On LPAR1, we found that there were some drmgr_scriptinfo* scripts in that directory. The drmgr_scriptinfo* files were tiny little database files containing the name/location of scripts to run during a DLPAR operation.
LPAR1 #
ls -l /usr/lib/dr/scripts
total
32
drwxr-xr-x 2 root
system 256 Aug 31 23:45
all
-rw------- 1 root
system 765 Jul 02
2020 drmgr_scriptinfo.00F6DB0A4C00
-rw------- 1 root
system 765 Aug 31 23:45
drmgr_scriptinfo.LPAR1
-rw------- 1 root
system 765 Jul 02
2020 drmgr_scriptinfo.localhost
LPAR1 #
ls -ltr /usr/lib/dr/scripts/all/
total
104
-r-x------ 1 root
system 20365 Jun 01
2015 aud_acct_dr
-r-x------ 1 root
system 4463 Jul 02
2020 cache_mgt_dr
-r-x------ 1 root
system 10177 Aug 18
2020 ctrmc_MDdr
-r-------- 1 root
system 10177 Aug 18
2020 .ctrmc_MDdr
On LPAR2, the wpar_drs script was located here.
LPAR2 # ls
-al /usr/lib/dr/scripts/all/
total
152
drwxr-xr-x 2 root
system 256 Apr 5 16:43 ./
drwxr-xr-x 3 root
system 4096 Apr 5 16:43 ../
-r-------- 1 root
system 20361 Jun 27
2012 .aud_acct_dr
-r-------- 1 root
system 10177 Sep 24
2020 .ctrmc_MDdr
-r-x------ 1 root
system 20361 Aug 22
2013 aud_acct_dr
-r-x------ 1 root
system 10177 Sep 24
2020 ctrmc_MDdr
-r-x------ 1 root system 6442 Aug 8 2010
wpar_drs
As far as I could gather, this script was delivered via the bos.wpars fileset. But was not sure how (or when) it was copied to /usr/lib/dr/scripts/all/.
#
lslpp -f bos.wpars|grep wpar_dr
/usr/lib/corrals/wpar_drs
/usr/sbin/wpar_drh
So, as an experiment, to rule out /usr/lib/dr/scripts/all/wpar_drs as the cause of the “smtctl slowness”, I proposed we moved it out of that directory and then tried running smtctl. My colleague was a little nervous about this, "Do we know which program calls this wpar_drs?", "If that program cannot find it, will it abend the smtctl operation?". To which I replied, I believe drmgr calls it and will only run it if it exists in /usr/lib/dr/scripts/all. To confirm this, I asked him to run "drmgr -l" and we saw wpar_drs listed (along with the other scripts located in the /usr/lib/dr/scripts/all).
LPAR2 #
drmgr -l
DR
Install Root Directory: /usr/lib/dr/scripts
Syslog
ID: DRMGR
------------------------------------------------------------
/usr/lib/dr/scripts/all/aud_acct_dr WPAR DR Script for Auditing and
Accounting
Vendor:IBM, Version:1, Date:03232007
Script Timeout:60, Admin Override Timeout:0
Memory DR Percentage:100
Resources Supported:
Resource Name:
wmig-checkpoint Resource Usage: Checkpoint
of Auditing and Accounting within a WPAR
Resource Name:
wmig-restart Resource Usage: Restart
of Auditing and Accounting within a WPAR
------------------------------------------------------------
/usr/lib/dr/scripts/all/ctrmc_MDdr DR script to refresh Management
Domain configuration
Vendor:IBM, Version:2, Date:05252010
Script Timeout:10, Admin Override Timeout:0
Memory DR Percentage:100
Resources Supported:
Resource Name: pmig Resource Usage: Partition migration for
RSCT Management Domain
Resource Name: phib Resource Usage: Partition hibernation for
RSCT Management Domain
------------------------------------------------------------
/usr/lib/dr/scripts/all/wpar_drs WPAR DR script for DR Events
Vendor:IBM, Version:1, Date:05312007
Script Timeout:0, Admin Override Timeout:0
Memory DR
Percentage:100
Resources Supported:
Resource Name:
cpu Resource Usage: Propagate CPU
add/remove
Resource Name:
mem Resource Usage: Propagate
Memory add/remove
Resource Name:
capacity Resource Usage: Propagate
Capacity changes
Resource Name:
var_weight Resource Usage:
Propagate Var Weight changes
------------------------------------------------------------
After my colleague moved the /usr/lib/dr/scripts/all/wpar_drs script out of the way, and ran "drmgr -b" (to rebuild the DRMGR Scriptinfo file), he ran "drmgr -l" again and the wpar_drs script was no longer listed.
LPAR2 # mv
/usr/lib/dr/scripts/all/wpar_drs /tmp/cg/wpar_drs.backup
LPAR2
# drmgr -b
Rebuild
of Scriptinfo file is complete
LPAR2 #
drmgr -l
DR
Install Root Directory: /usr/lib/dr/scripts
Syslog
ID: DRMGR
------------------------------------------------------------
/usr/lib/dr/scripts/all/aud_acct_dr WPAR DR Script for Auditing and
Accounting
Vendor:IBM, Version:1, Date:03232007
Script Timeout:60, Admin Override Timeout:0
Memory DR Percentage:100
Resources Supported:
Resource Name:
wmig-checkpoint Resource Usage:
Checkpoint of Auditing and Accounting within a WPAR
Resource Name:
wmig-restart Resource Usage: Restart
of Auditing and Accounting within a WPAR
------------------------------------------------------------
/usr/lib/dr/scripts/all/ctrmc_MDdr DR script to refresh Management
Domain configuration
Vendor:IBM, Version:2, Date:05252010
Script Timeout:10, Admin Override Timeout:0
Memory DR Percentage:100
Resources Supported:
Resource Name: pmig Resource Usage: Partition migration for
RSCT Management Domain
Resource Name: phib Resource Usage: Partition hibernation for
RSCT Management Domain
------------------------------------------------------------
He ran smtctl again. It was "fast".
LPAR2 #
timex smtctl -t 4
smtctl:
SMT is now enabled. It will persist across reboots if
you run the bosboot command before the
next reboot.
real 2.88
user
0.01
sys 0.57
This resolved his "problem". I still don’t know why this was happening (or how the system ended up in this configuration). And we’d probably need to engage AIX support to work out if this is a bug or something else. My colleague was not able to engage AIX support since the machine he was using did not have a SW maintenance agreement (and his benchmark project was close to completion anyway). But at least smtctl was now “fast”.
In summary, it appears that smtctl was calling the drmgr process and in turn the drmgr scripts (including wpar_drs). But why was it calling wpar_drs over 1800 times? I have no idea. It felt like a bug, but I’ll probably never know, unless I come across this issue again.
32768420: 87687581: 0.1035: statx("/usr/lib/dr/scripts/drmgr_scriptinfo.LPAR2", 0x2FF22600, 128,
010) = 0
32768420: 87687581: 0.1038: kopen("/usr/lib/dr/scripts/drmgr_scriptinfo.LPAR2",
O_RDONLY|O_LARGEFILE) = 6
32768420: 87687581: 0.1060: statx("/usr/lib/dr/scripts/drmgr_scriptinfo.LPAR2", 0x2FF22738, 128,
010) = 0
32768420: 87687581: 0.1062:
kopen("/usr/lib/dr/scripts/drmgr_scriptinfo.LPAR2",
O_RDONLY|O_LARGEFILE) = 6
Thanks for reading!