Skip to content

The mysterious case of the shared storage pool that would not start...........after a power outage!

There was a power failure in our DC a while ago. This impacted one of our VIOS Shared Storage Pool (SSP) systems. We wanted to share with you the story of what happened when we tried to restart the VIOS SSP.

We had a single node VIOS SSP cluster (yes, you read that right, a single node). When electrical power was restored to the DC, we powered on the server and then activated the VIOS node. It booted fine, but the SSP failed to start.

The cluster -status command returned an error, stating that it was unable to connect to the (VIOS SSP) database.

[padmin@sys854-vios1]$ cluster -status

Unable to connect to Database. Please check the filesystem space, network, and system resources of the VIOS. If the error persists, contact your service representative.
Unable to connect to Database. Please check the filesystem space, network, and system resources of the VIOS. If the error persists, contact your service representative.

If you're curious, this VIOS was running version 4.1 of the VIOS operating system.

[padmin@sys854-vios1]$ ioslevel

4.1.0.10

In the /home/ios/logs/viod.log log file we saw the following message repeated, over and over:

...
[2024-09-05 20:10:36.372] 0x1113 viod_primary.c viod_read_lock_ecf
1.128.1.56 5269 ERR -- Unable to get read lock on file /var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VIOSCFG/Election/dbn.ecf
errno:26
[2024-09-05 20:10:36.372] 0x1113 viod_primary.c viod_get_primary_ecf 1.128.1.56 3399 ERR -- failed to get ecf read lock rc: -1
...

The VIOS error report contained this information:

LABEL: POOL_START_FAILURE
IDENTIFIER: 042E4B14
Date/Time: Thu Sep 5 19:39:21 EDT 2024
Sequence Number: 265
Machine Id: 00CC53154C00
Node Id: sys854-vios1
Class: S
Type: PERM
WPAR: Global

Resource Name: POOL

Description
Pool start failure.

Detail Data
LINE NUMBER:
740

DETECTING MODULE
pool.c,Aug 29 2023 14:51:48 d2023_35A6

EVENT DATA:
poolId=000000000A080C640000000065BAC357
psa_flags=0x8
**0967-176 Resources are busy.**

We noticed that the SSP file system was mounted. This was promising, as this meant that, although the SSP database was not fully operational, at least the data file system, containing all of the SSP volumes for ourLPARs, was still intact.

[padmin@sys854-vios1]$ df | grep SSP
/var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310 837812224 775171864 8% - - /var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310

[root@sys854-vios1]# pooladm pool list
Pool Path
------------------------------
/var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310

All we had to do now was determine what the problem was with the SSP database.

If you're not aware, the VIOS uses Postgres as its database for storing the SSP configuration. On a VIOS with a properly functioning SSP setup, you'll find Postgres processes running under the vpgadmin userid. For example:

[padmin@sys854-vios1]$ ps -fu vpgadmin
UID PID PPID C STIME TTY TIME CMD
vpgadmin 5046724 16253270 0 Sep 11 - 0:12 postgres: CM: background writer
vpgadmin 11206928 16253270 0 Sep 11 - 1:53 postgres: CM: checkpointer
vpgadmin 11469148 14811394 0 Sep 11 - 0:16 postgres: SSP: walwriter
vpgadmin 12452214 16253270 0 Sep 11 - 0:33 postgres: CM: walwriter
vpgadmin 13697434 14811394 0 Sep 11 - 0:00 postgres: SSP: autovacuum launcher
vpgadmin 14090508 16253270 0 Sep 11 - 7:33 postgres: CM: logger
vpgadmin 14614900 14811394 0 Sep 11 - 0:00 postgres: SSP: logical replication launcher
vpgadmin 14811394 13828406 0 Sep 11 - 22:24 /usr/ios/db/postgres15/bin/postgres -D /var/vio/SSP/sys854sspcluster/D_E_F_A_U_L
vpgadmin 14876990 14811394 0 Sep 11 - 7:53 postgres: SSP: logger
vpgadmin 14942716 14811394 0 Sep 11 - 0:10 postgres: SSP: archiver last was 000000010000000000000016.00000060.backup
vpgadmin 15597846 16253270 0 Sep 11 - 0:00 postgres: CM: autovacuum launcher
vpgadmin 15925668 14811394 0 Sep 11 - 0:11 postgres: SSP: checkpointer
vpgadmin 16253270 1 0 Sep 11 - 19:42 /usr/ios/db/postgres15/bin/postgres -D /home/ios/CM/DB
vpgadmin 16449904 14811394 0 Sep 11 - 0:11 postgres: SSP: background writer
vpgadmin 17432894 16253270 0 Sep 11 - 0:01 postgres: CM: logical replication launcher

[padmin@sys854-vios1]$

[root@sys854-vios1]# /usr/ios/db/bin/postgres -V
/usr/ios/db/postgres15/bin/postgres -V
postgres (PostgreSQL) 15.3

We found some internal notes that suggested the error may be related to a corrupt dbn.ecf file. This file is important in a multi-node VIOS SSP cluster, as it is used to elect the primary node in the cluster. So,we tried the following steps to repair the file.

[root@sys854-vios1]# lsattr -El cluster0 -a node_uuid
node_uuid 1e6a23e2-c084-11ee-8006-98be947267d2 OS image identifier True

[root@sys854-vios1]# stopsrc -s vio_daemon

[root@sys854-vios1]# rm /var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VIOSCFG/DB/PG/postmaster.pid

[root@sys854-vios1]# cd /var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VIOSCFG/Election/

[root@sys854-vios1]# mv dbn.ecf dbn.ecf.20240906.old

[root@sys854-vios1]# echo "ecf_version:1" > dbn.ecf

[root@sys854-vios1]# echo "cluster_id:1ea1377ec08411ee800698be947267d2" >> dbn.ecf

[root@sys854-vios1]# cat dbn.ecf
ecf_version:1
cluster_id:1ea1377ec08411ee800698be947267d2
[root@sys854-vios1]#

[root@sys854-vios1]# startsrc -s vio_daemon

[root@sys854-vios1]# lssrc -ls vio_daemon | grep DBN
DBN NODE ID: 1e6a23e2c08411ee800698be947267d2
DBN Role: Primary

To no avail. The same dbn.ecf errors appeared in the viod.log file and the problem remained. None of the vhostX adapters would configure.

[padmin@sys854-vios1]$ cfgdev
Some error messages may contain invalid information
for the Virtual I/O Server environment.
Method error (/usr/lib/methods/cfg_vt_ngdisk -l vtscsi0 ):
        0514-033 Cannot perform the requested function because the
                 specified device is dependent on an attribute which does
                 not exist.
Method error (/usr/lib/methods/cfg_vt_ngdisk -l vtscsi1 ):
       0514-033 Cannot perform the requested function because the
                specified device is dependent on an attribute which does
                not exist.

Method error (/usr/lib/methods/cfg_vt_ngdisk -l vtscsi2 ):
       0514-033 Cannot perform the requested function because the
                specified device is dependent on an attribute which does
                not exist.

Method error (/usr/lib/methods/cfg_vt_ngdisk -l vtscsi3 ):
       0514-033 Cannot perform the requested function because the
                specified device is dependent on an attribute which does
                not exist.

This particular VIOS was used for development purposes in our lab. Regular backups were not captured for this system, so, there was no mksysb (backupios) taken for this VIOS, prior to the power outage. However, we did find a couple of old (auto) SSP cluster backups in the cfgbackups directory in /home/padmin, from April of this year.

[padmin@sys854-vios1]$ ls -ltr cfgbackups
total 50208
-rw-r--r-- 1 padmin staff 7164 Apr 22 22:00 autoviosbr_sys854-vios1.tar.gz
-rw-r--r-- 1 padmin staff 25697221 Apr 22 22:01 autoviosbr_SSP.sys854sspcluster.tar.gz

These backups were automatically created when the VIOS and SSP configuration was last updated/changed. The following entry in root's crontab enabled this feature:

[root@sys854-vios1]# crontab -l | grep vio
0 * * * * /usr/ios/sbin/autoviosbr -start 1>/dev/null 2>/dev/null

So, worst case, we did have a backup (although old) that we could use to recover the SSP configuration if needed.

Because we could still access the SSP file system, we decided to copy the SSP volumes (files) to another system, in case we needed to recover the LPARs. This would be a last resort!

We copied (scp\'ed) the files in /var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VOL1 to another system.

[root@sys854-vios1]# cd /var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310
[root@sys854-vios1]# scp -pr VOL1 lpar1:/export/sys854_VOLS/
...

On the remote system:

root@lpar1 /export/sys854_VOLS/VOL1 \# **ls -ltr**
total 167772192
-rwx------ 1 root system 21474836480 Jan 31 2024 an70lpar2-rootvg.048362b608a143c723fb0d23a4f623d5
-rwx------ 1 root system 21474836480 Jan 31 2024 an70lpar3-rootvg.58843c5f94233028f17dfdfe9345fb46
--w------- 1 root system 327 Jan 31 2024 .an70lpar1-rootvg.1880f7e6b63f621e0528bbfcbde91872
--w------- 1 root system 327 Jan 31 2024 .an70lpar4-rootvg.3036fe48792a0471d56cf9c8af19c54e
--w------- 1 root system 327 Jan 31 2024 .an70lpar2-rootvg.048362b608a143c723fb0d23a4f623d5
--w------- 1 root system 327 Jan 31 2024 .an70lpar3-rootvg.58843c5f94233028f17dfdfe9345fb46
-rwx------ 1 root system 21474836480 Jan 31 2024 an70lpar1-rootvg.1880f7e6b63f621e0528bbfcbde91872
-rwx------ 1 root system 21474836480 Jan 31 2024 an70lpar4-rootvg.3036fe48792a0471d56cf9c8af19c54e

Our thought was that if we had to restore the SSP configuration from backup, or even worse, rebuild the VIOS SSP from scratch and reload the LPARs, we could use these SSP volume files. We recalled that there was a procedure we had used in the past to clone a SSP LPAR. The steps involve using the dd command to import an uncompressed copy of an SSP volume into a new (empty) SSP disk. You can review the procedure here:

Manually importing and exporting volumes

https://www.ibm.com/docs/en/powervc-cloud/2.2.1?topic=images-manually-importing-exporting-volumes

But, we really did not want to rebuild the whole environment again, as this would've required a lot of time and effort. Instead, we decided to persist with more troubleshooting.

The next thing we tried was to attempt to manually start the SSP DB.

[root@sys854-vios1\]# stopsrc -s vio_daemon
[root@sys854-vios1\]# lssrc -s vio_daemon
Subsystem Group PID Status
vio_daemon inoperative**

[root@sys854-vios1\]# su - vpgadmin -c "export
LIBPATH=/usr/ios/db/postgres15/lib; /usr/ios/db/postgres15/bin/pg_ctl start -D
/var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VIOSCFG/DB/PG -l
/tmp/pg_ssp_db.log"
waiting for server to start\.... done
server started
10:11

[root@sys854-vios1\]# cat /tmp/pg_ssp_db.log
2024-09-11 11:56:25.203 GMT [14876956] FATAL: lock file "postmaster.pid" already exists
2024-09-11 11:56:25.203 GMT [14876956] HINT: Is another postmaster
(PID 5046584) running in data directory
"/var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VIOSCFG/DB/PG"?
2024-09-11 11:58:43.108 GMT [14483778] LOG: starting PostgreSQL 15.3
on powerpc-ibm-aix7.1.5.0, compiled by /opt/IBM/xlc/13.1.0/bin/xlc -q64
-qcpluscmt, 64-bit
2024-09-11 11:58:43.109 GMT [14483778] LOG: listening on IPv4 address
"127.0.0.1", port 5432
2024-09-11 11:58:43.130 GMT [14483778] LOG: listening on Unix socket
"/tmp/.s.PGSQL.5432"
2024-09-11 11:58:43.154 GMT [12452114] LOG: database system was shut
down at 2024-09-11 11:57:08 GMT
2024-09-11 11:58:43.178 GMT [14483778] LOG: database system is ready
to accept connections

Then we checked to see of the DB processes were now up and listening.

# su - vpgadmin -c "export LIBPATH=/usr/ios/db/postgres15/lib;/usr/ios/db/postgres15/bin/pg_ctl status -D /var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VIOSCFG/DB/PG"
# proctree \<pid SSP db>
# **netstat -Aan | grep 6090

It appeared that the database would come up for a little bit, then stop. Then it would attempt to restart again. Then stop again. It remained in this a loop forever.

[root@sys854-vios1\]# cg=`su - vpgadmin -c "export
LIBPATH=/usr/ios/db/postgres15/lib; /usr/ios/db/postgres15/bin/pg_ctl
status -D /var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VIOSCFG/DB/PG"` ;
echo $cg | awk -F \) '{print $1}' | awk -F PID: '{print $2}' | xargs proctree ; netstat -Aan | grep 6090

[root@sys854-vios1\]# cg=`su - vpgadmin -c "export LIBPATH=/usr/ios/db/postgres15/lib; /usr/ios/db/postgres15/bin/pg_ctl status -D /v>
6554094 /usr/sbin/srcmstr
15073636 /usr/sbin/vio_daemon
17432844 /usr/ios/db/postgres15/bin/postgres -D /var/vio/SSP/sys854sspcluster/D_E_F_A_U
5046530 postgres: background writer
9830694 postgres: checkpointer
12779940 postgres: walwriter
15335748 postgres: logical replication launcher
15663520 postgres: autovacuum launcher
f1000f0004ada3c0 tcp4 0 0 127.0.0.1.6090 . LISTEN
f1000f0007fda808 stream 0 0 f1000b02a466b020 0 0 0 /tmp/.s.PGSQL.6090

[root@sys854-vios1\]# cg=`su - vpgadmin -c "export
LIBPATH=/usr/ios/db/postgres15/lib; /usr/ios/db/postgres15/bin/pg_ctl
status -D /v>
6554094 /usr/sbin/srcmstr
15073636 /usr/sbin/vio_daemon
17432844 /usr/ios/db/postgres15/bin/postgres -D /var/vio/SSP/sys854sspcluster/D_E_F_A_U\_
5046530 postgres: background writer
9830694 postgres: checkpointer
12779940 postgres: walwriter
15335748 postgres: logical replication launcher
15663520 postgres: autovacuum launcher
f1000f0004ada3c0 tcp4 0 0 127.0.0.1.6090 . LISTEN
f1000f0004ae03c0 tcp4 0 0 127.0.0.1.45822 127.0.0.1.6090 TIME_WAIT
f1000f0004acf3c0 tcp4 0 0 127.0.0.1.45823 127.0.0.1.6090 TIME_WAIT
f1000f0007fda808 stream 0 0 f1000b02a466b020 0 0 0 /tmp/.s.PGSQL.6090

[root@sys854-vios1]# cg=`su - vpgadmin -c "export
LIBPATH=/usr/ios/db/postgres15/lib; /usr/ios/db/postgres15/bin/pg_ctl status -D /v>
f1000f0004ae03c0 tcp4 0 0 127.0.0.1.45822 127.0.0.1.6090 TIME_WAIT
f1000f0004acf3c0 tcp4 0 0 127.0.0.1.45823 127.0.0.1.6090 TIME_WAIT

[root@sys854-vios1]#

This seemed to suggest there was a problem with the Postgres DB configuration. So, we decided to check contents of the postgresql.conf file.

[root@sys854-vios1]# cat /var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VIOSCFG/DB/PG/postgresql.conf [root@sys854-vios1]#

We'd uncovered a huge problem!

The database configuration file was empty!

It was now no surprise that the database was not starting.

All we needed to do now was to find a way to recover, repair, or rebuild the postgresql.conf file. But how?

A sample postgresql.conf file is provided with the ios.database.rte fileset.

[root@sys854-vios1]# lslpp -w /usr/ios/db/postgres15/share/postgresql/postgresql.conf.sample

  File                                        Fileset               Type

  ----------------------------------------------------------------------------

  /usr/ios/db/postgres15/share/postgresql/postgresql.conf.sample

                                            ios.database.rte        File



root@sys854-vios1]# lslpp -f | grep postgresql.conf
/usr/ios/db/postgres15/share/postgresql/postgresql.conf.sample
/usr/ios/db/postgres13/share/postgresql/postgresql.conf.sample

Our first thought was to copy the sample configuration file and make the necessary changes to suite the SSP environment.

# cp /usr/ios/db/postgres15/share/postgresql/postgresql.conf.sample /var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VIOSCFG/DB/PG/postgresql.conf

We stopped the DB and made a few edits to the config file, including changing the listening port to 6090. We thought that perhaps the sample postgres file would already be customised (enough) for the VIOS SSP, but it wasn't.

# su - vpgadmin -c "export LIBPATH=/usr/ios/db/postgres15/lib; /usr/ios/db/postgres15/bin/pg_ctl stop -D /var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VIOSCFG/DB/PG"

# vi /var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VIOSCFG/DB/PG/postgresql.conf
...
#port = 5432 # (change requires restart)
port = 6090
...

This didn't help much. The database continued attempting to start, failing and then restarting again. We were probably missing something in the configuration file and it was proving difficult (and time consuming) to figure out what was wrong or missing in this file. There were many items that needed to be changed in the sample file, for the VIOS SSP. For example:

max_connections = 256                   # (change requires restart)
...
log_directory = '/home/ios/logs/pg_sspdb' # directory where log files are written,
                                        # can be absolute or relative to PGDATA
log_filename = 'pg_sspdb-%d-%H-%M.log'  # log file name pattern,
                                        # can include strftime() escapes
log_file_mode = 0644                    # creation mode for log files,
                                        # begin with 0 to use octal notation
log_truncate_on_rotation = off          # If on, an existing log file with the
                                        # same name as the new log file will be
                                        # truncated rather than appended to.
                                        # But such truncation only occurs on
                                        # time-driven rotation, not on restarts
                                        # or size-driven rotation.  Default is
                                        # off, meaning append to existing files
                                        # in all cases.
log_rotation_age = 1d                   # Automatic rotation of logfiles will
                                        # happen after that time.  0 disables.
log_rotation_size = 20MB                # Automatic rotation of logfiles will
                                        # happen after that much log output.
                                        # 0 disables.
...

We also considered copying a config file from another, different VIOS SSP cluster (as we had few other SSP clusters in the environment to work from). But we decided against it.

One of our colleagues, from the VIOS development team, indicated that a (backup) copy of the customised postgres configuration file was kept in the /usr/ios/db/configuration_files directory on the VIOS.

[root@sys854-vios1\]# ls -ltr /usr/ios/db/configuration_files/postgresql_SSP.conf
-rw-r--r-- 1 root db_users 23349 Jul 25 2023 /usr/ios/db/configuration_files/postgresql_SSP.conf

We were advised to copy this file over the top of the (empty) /var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VIOSCFG/DB/PG/postgresql.conf file.

root@sys854-vios1]# ls -ltr /usr/ios/db/configuration_files/postgresql_SSP.conf
-rw-r--r--    1 root     db_users      23349 Jul 25 2023  /usr/ios/db/configuration_files/postgresql_SSP.conf

[root@sys854-vios1]# ls -ltr /var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VIOSCFG/DB/PG/postgresql.conf

-rw-r--r--    1 vpgadmin db_users      23349 Sep 11 08:14 /var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VIOSCFG/DB/PG/postgresql.conf

Then we attempted to start the database again.

# su - vpgadmin -c "export LIBPATH=/usr/ios/db/postgres15/lib;
/usr/ios/db/postgres15/bin/pg_ctl start -D /var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VIOSCFG/DB/PG -l /tmp/pg_ssp_db.log"

The situation immediately looked a lot better. The netstat command showed lots of database port (listening) activity.

[root@sys854-vios1]# /SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VIOSCFG/DB/PG -l /tmp/pg_ssp_db.log"                                <
waiting for server to start.... done
server started

[root@sys854-vios1]# netstat -Aan | grep 6090
f1000f00069f43c0 tcp4       0      0  10.8.12.100.37575     10.8.12.100.6090      TIME_WAIT
f1000f00068383c0 tcp4       0      0  10.8.12.100.37576     10.8.12.100.6090      TIME_WAIT
f1000f0006840bc0 tcp4       0      0  10.8.12.100.37577     10.8.12.100.6090      TIME_WAIT
f1000f0006b3f3c0 tcp4       0      0  10.8.12.100.37578     10.8.12.100.6090      TIME_WAIT
f1000f00068debc0 tcp4       0      0  10.8.12.100.37579     10.8.12.100.6090      TIME_WAIT
f1000f00068dc3c0 tcp4       0      0  10.8.12.100.37580     10.8.12.100.6090      TIME_WAIT
f1000f00068b03c0 tcp4       0      0  10.8.12.100.37581     10.8.12.100.6090      TIME_WAIT
f1000f00068c43c0 tcp4       0      0  10.8.12.100.37582     10.8.12.100.6090      TIME_WAIT
f1000f0006891bc0 tcp4       0      0  10.8.12.100.37583     10.8.12.100.6090      TIME_WAIT
f1000f0004ad73c0 tcp4       0      0  10.8.12.100.37584     10.8.12.100.6090      TIME_WAIT
f1000f0004af03c0 tcp4       0      0  10.8.12.100.6090      10.8.12.100.37585     TIME_WAIT
f1000f00068583c0 tcp4       0      0  10.8.12.100.37586     10.8.12.100.6090      TIME_WAIT
f1000f00068563c0 tcp4       0      0  10.8.12.100.37587     10.8.12.100.6090      TIME_WAIT
f1000f0006855bc0 tcp4       0      0  10.8.12.100.37588     10.8.12.100.6090      TIME_WAIT
f1000f0006859bc0 tcp4       0      0  10.8.12.100.37589     10.8.12.100.6090      TIME_WAIT
f1000f0004acdbc0 tcp4       0      0  10.8.12.100.37590     10.8.12.100.6090      TIME_WAIT
f1000f0004afc3c0 tcp4       0      0  10.8.12.100.37591     10.8.12.100.6090      TIME_WAIT
f1000f00068553c0 tcp4       0      0  10.8.12.100.37592     10.8.12.100.6090      TIME_WAIT
f1000f00068af3c0 tcp4       0      0  10.8.12.100.37593     10.8.12.100.6090      TIME_WAIT
f1000f00068593c0 tcp4       0      0  10.8.12.100.37594     10.8.12.100.6090      TIME_WAIT
f1000f0006854bc0 tcp4       0      0  10.8.12.100.6090      10.8.12.100.37595     TIME_WAIT
f1000f0004ac6bc0 tcp4       0      0  10.8.12.100.37596     10.8.12.100.6090      TIME_WAIT
f1000f0004b083c0 tcp4       0      0  10.8.12.100.6090      10.8.12.100.37597     TIME_WAIT
f1000f0004add3c0 tcp4       0      0  10.8.12.100.37598     10.8.12.100.6090      TIME_WAIT
f1000f0004ac63c0 tcp4       0      0  10.8.12.100.37599     10.8.12.100.6090      TIME_WAIT
f1000f0004ac73c0 tcp4       0      0  10.8.12.100.37600     10.8.12.100.6090      TIME_WAIT
f1000f0004b043c0 tcp4       0      0  10.8.12.100.37601     10.8.12.100.6090      TIME_WAIT
f1000f00006453c0 tcp4       0      0  10.8.12.100.37602     10.8.12.100.6090      TIME_WAIT
f1000f00006e3bc0 tcp4       0      0  10.8.12.100.37603     10.8.12.100.6090      TIME_WAIT
f1000f0006890bc0 tcp4       0      0  10.8.12.100.37604     10.8.12.100.6090      TIME_WAIT
f1000f00068543c0 tcp4       0      0  10.8.12.100.6090      10.8.12.100.37605     TIME_WAIT
f1000f0004afe3c0 tcp4       0      0  10.8.12.100.37606     10.8.12.100.6090      TIME_WAIT
f1000f0004ac7bc0 tcp4       0      0  10.8.12.100.37607     10.8.12.100.6090      TIME_WAIT
f1000f00068903c0 tcp4       0      0  10.8.12.100.37608     10.8.12.100.6090      TIME_WAIT
f1000f000684b3c0 tcp4       0      0  10.8.12.100.6090      10.8.12.100.37609     TIME_WAIT
f1000f00068c3bc0 tcp4       0      0  10.8.12.100.37618     10.8.12.100.6090      TIME_WAIT
f1000f00068e03c0 tcp4       0      0  10.8.12.100.37619     10.8.12.100.6090      TIME_WAIT
f1000f0006892bc0 tcp4       0      0  10.8.12.100.37621     10.8.12.100.6090      TIME_WAIT
f1000f000685a3c0 tcp4       0      0  10.8.12.100.37622     10.8.12.100.6090      TIME_WAIT
f1000f0004ad03c0 tcp4       0      0  10.8.12.100.37623     10.8.12.100.6090      TIME_WAIT
f1000f0004af6bc0 tcp4       0      0  10.8.12.100.6090      10.8.12.100.37624     TIME_WAIT
f1000f00068db3c0 tcp6       0      0  *.6090                .                   LISTEN
f1000f00068573c0 tcp4       0      0  *.6090                .                   LISTEN
f1000f00069f9808 stream      0      0 f1000b02a531a820                0                0                0 /tmp/.s.PGSQL.6090
[root@sys854-vios1]#

The database now appeared to be running and stable.

[root@sys854-vios1]# su - vpgadmin -c "export
LIBPATH=/usr/ios/db/postgres15/lib; /usr/ios/db/postgres15/bin/pg_ctl status -D /var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VIOSCFG/DB/PG"
pg_ctl: server is running (PID: 14811394)
/usr/ios/db/postgres15/bin/postgres "-D"
"/var/vio/SSP/sys854sspcluster/D_E_F_A_U_L_T_061310/VIOSCFG/DB/PG" "-p" "6090"

The /home/ios/logs/vdba.log file showed that the database was RUNNING. Previously, this file had been empty, even when we had tried to start the database (unsuccessfully) numerous times during our troubleshooting process.

[root@sys854-vios1]# tail /home/ios/logs/vdba.log
VDBA ENDING: PID = 5046600 | Completed at: 11-Sep-2024 07:57:09 | rc = 0
[12714460   11-Sep-2024 11:57:09] dba_db.c               1.5.2.6         stop_op_handler              @   560 : CM database is now STOPPED.
VDBA ENDING: PID = 12714460 | Completed at: 11-Sep-2024 11:57:09 | rc = 0
VDBA STARTING: PID = 15401420 | Started at: 11-Sep-2024 12:21:27
[15401420   11-Sep-2024 12:21:27] vdba.c                 1.3.1.4         get_valid_input              @   263 : argv[1] = -cm
[15401420   11-Sep-2024 12:21:27] vdba.c                 1.3.1.4         get_valid_input              @   263 : argv[2] = -start
[15401420   11-Sep-2024 12:21:27] vdba.c                 1.3.1.4         get_valid_input              @   409 : @@@ op = 'OP_START', db = 'DB_CM'
[15401420   11-Sep-2024 12:21:27] dba_db.c               1.5.2.6         vdba_popen                   @   618 : Incoming cmd='su - vpgadmin -c "export LIBPATH=/usr/ios/db/postgres15/lib; /usr/ios/db/postgres15/bin/pg_ctl start -w -D /home/ios/CM/DB"'
[15401420   11-Sep-2024 12:21:27] dba_db.c               1.5.2.6         start_op_handler             @   491 : CM database is now RUNNING.
VDBA ENDING: PID = 15401420 | Completed at: 11-Sep-2024 12:21:27 | rc = 0

We were able to run a couple of Postgres psql commands to test connections to the database. They were all successful. This was good news.

[root@sys854-vios1]# /usr/ios/db/bin/psql -U vpgadmin -l -p 6080 -P pager=off
                                             List of databases
 Name    |   Owner   | Encoding | Collate | Ctype | ICU Locale | Locale Provider |   Access privileges
-----------+-----------+----------+---------+-------+------------+-----------------+-----------------------
 postgres  | vpgadmin  | LATIN1   | en_US   | en_US |            | libc            |
 template0 | vpgadmin  | LATIN1   | en_US   | en_US |            | libc            | =c/vpgadmin          +
           |           |          |         |       |            |                 | vpgadmin=CTc/vpgadmin
 template1 | vpgadmin  | LATIN1   | en_US   | en_US |            | libc            | =c/vpgadmin          +           |           |          |         |       |            |                 | vpgadmin=CTc/vpgadmin
 vioscm    | viosadmin | LATIN1   | en_US   | en_US |            | libc            |
(4 rows)

[root@sys854-vios1]# export LIBPATH=/usr/ios/db/postgres15/lib

[root@sys854-vios1]# export PGPASSWORD=_`lsattr -El vioscluster0 -a cluster_id -F value`

[root@sys854-vios1]# /usr/ios/db/postgres15/bin/psql -p 6090 -U viosadmin -d postgres -c 'SELECT datname, datallowconn FROM pg_database;'

  datname  | datallowconn
-----------+--------------
 postgres  | t
 vios      | t
 template1 | t
 template0 | f
(4 rows)

[root@sys854-vios1]# /usr/ios/db/bin/psql -U vpgadmin -d vios -p 6090 -P pager=off
psql (15.3)
Type "help" for help.

vios=# \dbn
       List of tablespaces
    Name    |  Owner   | Location
------------+----------+----------
 pg_default | vpgadmin |
 pg_global  | vpgadmin |

(2 rows)
vios=# \dn
      List of schemas
  Name  |       Owner
--------+-------------------
 public | pg_database_owner
 vios   | viosadmin
(2 rows)

vios=# \dt vios.*
                  List of relations
 Schema |          Name          | Type  |   Owner
--------+------------------------+-------+-----------
 vios   | capability             | table | viosadmin
 vios   | client_partition       | table | viosadmin
 vios   | clientimage            | table | viosadmin
 vios   | clientop               | table | viosadmin
 vios   | cluster_version        | table | viosadmin
 vios   | disk                   | table | viosadmin
 vios   | event                  | table | viosadmin
 vios   | event_attribute        | table | viosadmin
 vios   | failure_group          | table | viosadmin
 vios   | file                   | table | viosadmin
 vios   | fileset                | table | viosadmin
 vios   | filetotier             | table | viosadmin
 vios   | importpv               | table | viosadmin
 vios   | lu                     | table | viosadmin
 vios   | map                    | table | viosadmin
 vios   | master_image           | table | viosadmin
 vios   | node_version           | table | viosadmin
 vios   | nodecaching            | table | viosadmin
 vios   | partition              | table | viosadmin
 vios   | resource               | table | viosadmin
 vios   | sp_threshold_alert     | table | viosadmin
 vios   | storage_pool           | table | viosadmin
 vios   | tier                   | table | viosadmin
 vios   | tier_threshold_alert   | table | viosadmin
 vios   | transaction            | table | viosadmin
 vios   | vios_cluster           | table | viosadmin
 vios   | virtual_log            | table | viosadmin
 vios   | virtual_log_repository | table | viosadmin
 vios   | virtual_server_adapter | table | viosadmin
(29 rows)

vios=# \q
[root@sys854-vios1]#

The real test was to run the cluster -status command and see if it would report that status of the cluster, successfully.

[padmin@sys854-vios1]$ cluster -status
Cluster Name         State
sys854sspcluster     OK
    Node Name        MTM           Partition Num  State  Pool State
    sys854-vios1     8286-41A0206C5315         1  OK     OK

It worked!

And the people rejoiced and there was singing in the streets..... ☺

The lu command successfully displayed all of the SSP volumes.

[padmin@sys854-vios1]$ lu -list
POOL_NAME: sys854ssp
TIER_NAME: SYSTEM
LU_NAME                 SIZE(MB)    UNUSED(MB)  UDID
an70lpar1-rootvg        20480       15510       1880f7e6b63f621e0528bbfcbde91872
an70lpar2-rootvg        20480       16160       048362b608a143c723fb0d23a4f623d5
an70lpar3-rootvg        20480       16160       58843c5f94233028f17dfdfe9345fb46
an70lpar4-rootvg        20480       15468       3036fe48792a0471d56cf9c8af19c54e

We were then able to configure the vhostX devices.

[padmin@sys854-vios1]$ cfgdev
[padmin@sys854-vios1]$ lsmap -all
SVSA            Physloc                                      Client Partition ID
--------------- -------------------------------------------- ------------------
vhost0          U8286.41A.06C5315-V1-C50                     0x00000002
VTD                   vcd0
Status                Available
LUN                   0x8100000000000000
Backing device
Physloc
Mirrored              N/A
VTD                   vtscsi0
Status                Available
LUN                   0x8200000000000000
Backing device        an70lpar1-rootvg.1880f7e6b63f621e0528bbfcbde91872
Physloc
Mirrored              N/A
SVSA            Physloc                                      Client Partition ID
--------------- -------------------------------------------- ------------------
vhost1          U8286.41A.06C5315-V1-C54                     0x00000004
VTD                   vcd3
Status                Available
LUN                   0x8100000000000000
Backing device
Physloc
Mirrored              N/A
VTD                   vtscsi3
Status                Available
LUN                   0x8200000000000000
Backing device        an70lpar4-rootvg.3036fe48792a0471d56cf9c8af19c54e
Physloc
Mirrored              N/A
SVSA            Physloc                                      Client Partition ID
--------------- -------------------------------------------- ------------------
vhost2          U8286.41A.06C5315-V1-C52                     0x00000003
VTD                   vcd1
Status                Available
LUN                   0x8100000000000000
Backing device
Physloc
Mirrored              N/A
VTD                   vtscsi1
Status                Available
LUN                   0x8200000000000000
Backing device        an70lpar2-rootvg.048362b608a143c723fb0d23a4f623d5
Physloc
Mirrored              N/A
SVSA            Physloc                                      Client Partition ID
--------------- -------------------------------------------- ------------------
vhost3          U8286.41A.06C5315-V1-C53                     0x00000005
VTD                   vcd2
Status                Available
LUN                   0x8100000000000000
Backing device
Physloc
Mirrored              N/A
VTD                   vtscsi2
Status                Available
LUN                   0x8200000000000000
Backing device        an70lpar3-rootvg.58843c5f94233028f17dfdfe9345fb46
Physloc
Mirrored              N/A
[padmin@sys854-vios1]$

Finally, we successfully started all of the LPARs.

The HMC UI also reported a healthy SSP cluster.

A screenshot of a computer Description automatically generated

A screenshot of a computer Description automatically generated

This brings us to the end of our mysterious tale. What started out as an error message stating 0967-176 Resources are busy turned out to be a problem with an empty database configuration file. It would've been really nice if there had been some sort of message, in one of the log files, somewhere, that had given us some sort of clue that the DB config file was corrupt or empty. We, of course, shared this thought the VIOS SSP development team.

Of course, most people will never need to know or care about the SSP database or the fact that a Postgres database is running underneath. We were just unlucky. The power outage likely caused some sort of corruption (although we can't be 100% certain). In other words, most VIOS SSP users will never need to read this blog post. We're thankful for this.

Until next time.

FYI: Here's a copy of our SSP postgres config file (for reference): https://gibsonnet.net/aix/ibm/postgresql_SSP.conf.txt

Note: The SSP DB changed over time, for all SSP cluster before VIOS 2.2.6.31 the database was SolidDB, but later version have been migrated to PostgreSQL. So if all your node in the cluster are reporting as ON_LEVEL with version older than .2.6.31, then the DB should be SolidDB. If all your node are ON_LEVEL with version 2.2.6.31 or higher, then it should be a PostgreSQL DB.

You can check this with the ps command: - For SolidDB, you'll see "soliddb" process from root user. - For PostgreSQL, you'll see multiple "postgres" process from vpgadmin user.

Here are the port number for each DB : - For SolidDB : 3801/tcp (marked as unassigned in "/etc/services") - For PostgreSQL : 6080/tcp & 6090/tcp (not in "/etc/services")

More information here: https://www.ibm.com/support/pages/how-setup-vios-firewall-ssp-cluster-environment

The Implementing IBM VM Recovery Manager for IBM Power Systems Redbook contains some more information about the SSP database, in the \"The SSP cluster and HSDB database\" section of the book (Pages 122 to 142): https://www.redbooks.ibm.com/abstracts/sg248426.html