Investigating AIX system crashes with minidumpThe AIX minidump facility was introduced with AIX 5.3 TL3. A mini dump is a small compressed dump that is stored to NVRAM when a system crashes or a dump is initiated, and then written to the AIX error log on reboot. It can be used to see some of the system’s state and do some debugging when a full dump is not available. It can also be used to get a quick snapshot of a crash without having to transfer the entire dump from the crashed system to IBM support.
Please refer to the following, official guide, on "How to examine a minidump in AIX".
"Using this crash stack IBM support personnel can then search through the database to find what the fault may mean". "The RAS effort mentioned..is part of an ongoing effort by AIX to increase stability and to make more information available for troubleshooting when a problem occurs. The ability to look at minidump data has helped solve many issues that would otherwise go unresolved".
http
Of course, minidumps have their limitations and do not replace the need for a full system dump in many cases.
"Limitations of minidumps
Here are some examples of how I've used minidump to assist me (and IBM support) in diagnosing the root cause of a system crash.
The first example is from a customer that found that one of their AIX partitions would crash when they ran optmem DPO (Dynamic Platform Optimiser) against one of their new POWER8 E880s. Immediately after optmem ran, one specific LPAR would crash. The LPAR reference LED code would show "888 102 300 C20". This LPAR was installed with AIX 7.1 TL3 SP4. When we tried to restart the partition, it would crash (and dump) several times before it would start successfully. Once the partition booted successfully, we noticed that there were several system dumps (SYSDUMP) and minidumps (COMPRESSED MINIMAL DUMP) shown in the AIX error report.
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION A6DF45AA 0411163117 I O RMCdaemon The daemon is started. 2BFA76F6 0411163117 T S SYSPROC SYSTEM SHUTDOWN BY USER 9DBCFDEE 0411163117 T O errdemon ERROR LOGGING TURNED ON 192AC071 0411162917 T O errdemon ERROR LOGGING TURNED OFF A6DF45AA 0411162517 I O RMCdaemon The daemon is started. 2BFA76F6 0411162517 T S SYSPROC SYSTEM SHUTDOWN BY USER 9DBCFDEE 0411162517 T O errdemon ERROR LOGGING TURNED ON 192AC071 0411162217 T O errdemon ERROR LOGGING TURNED OFF A6DF45AA 0411161117 I O RMCdaemon The daemon is started. 67145A39 0411161017 U S SYSDUMP SYSTEM DUMP F48137AC 0411161017 U O minidump COMPRESSED MINIMAL DUMP << Minidump information 9D035E4D 0411161017 P S SYSVMM DATA STORAGE INTERRUPT, PROCESSOR 9DBCFDEE 0411161017 T O errdemon ERROR LOGGING TURNED ON A6DF45AA 0411155917 I O RMCdaemon The daemon is started. 67145A39 0411155817 U S SYSDUMP SYSTEM DUMP F48137AC 0411155817 U O minidump COMPRESSED MINIMAL DUMP << Minidump information 9D035E4D 0411155817 P S SYSVMM DATA STORAGE INTERRUPT, PROCESSOR 9DBCFDEE 0411155817 T O errdemon ERROR LOGGING TURNED ON A6DF45AA 0411155717 I O RMCdaemon The daemon is started. 67145A39 0411155717 U S SYSDUMP SYSTEM DUMP F48137AC 0411155717 U O minidump COMPRESSED MINIMAL DUMP << Minidump information 9D035E4D 0411155717 P S SYSVMM DATA STORAGE INTERRUPT, PROC 9DBCFDEE 0411155717 T O errdemon ERROR LOGGING TURNED ON 6AEB31F5 0410143317 I H sysplanar0 Platform Resource Reas
I used the minidump reporting tool (mdmprpt) to perform a quick analysis of the cause of the crash, by reviewing the "Stack Trace" information. The mdmprpt tool has a couple of options.
# mdmprpt -? Usage: mdmprpt [[-l seq_no] [-i filename]] | [ -b filename ] [ -F ] [-r]
Process error log entries from the supplied file(s). -l seq_no Format the minidump at the specified sequence number in the error log. -i filename Uses the error log file specified by the filename param. -b filename Uses the binary file specified by the filename param. -F prints only dump failure information. -r print the raw minidump, without formatting.
If no parameters are specified, the most recently logged minidump is shown. You may not specify -b with -i or -l.
Typically, the most common usage is to simply run mdmprpt and redirect the output to a file for review.
# mdmprpt > /tmp
The most relevant section of the report is the "Stack Trace" section(s).
# vi /tmp
MINIDUMP VERSION 4D33 **** 64-bit Kernel, 57 Entries
Last Error Log Entry: Error ID: 9D035E4D Resource Name: SYSVMM Detail Data: 0000000000000000 00007FFFFFFFD080 00000007000000C7 0000000000000115
Symptom Information: Crash Location: [00000000000ADE98] trc_generate_ea+78 Component: COMP Exception Type: 277
Data From CPU #1 (Faulting CPU) **** ... Stack Trace: [00000000000ADE98] trc_generate_ea+78 0000000000000002 F1000A002033C000 C0C0C0C0C0C0C0C1 8000002841900000 00000000025BBC00 00000000000034C8 00000000009F3998 0000000003322000 [00000000000ACF64] trc_inmem_recor+144 0000000000000000 F1000815B0052D00 0000000000007800 000000000000F000 ...
In this case, the stack trace showed messages relating to trc_generate and trc_inmem_recor. Using this information, I was able to find several hits (both internally to IBM and using my preferred internet search tool) for the potential cause of the crash. The problem was a known issue and related to the following APAR.
IV69116: AIX CAN CRASH IBM POWER8 E880 (9119-MHE) WITH > THAN 128 CORES APPLIES TO AIX 7100-03 http
We verified our findings with IBM support (who performed their own internal search) and we concluded that an update would be required. We chose to update the AIX system to AIX 7.1 TL4 SP3, and the problem went away.
In the next example, the customer had updated one of their AIX partitions to AIX 7.1 TL4 SP3 (previously on AIX 7.1 TL4 SP1). This LPAR also housed a single, AIX 5.3 versioned WPAR. The update was successful but whenever they ran the stopwpar command, the entire partition would crash (dump). So, once again, I employed the mdmprpt utiliy to check the stack traces.
MINIDUMP VERSION 4D33 **** 64-bit Kernel, 48 Entries
Last Error Log Entry: Error ID: 9D035E4D Resource Name: SYSVMM Detail Data: 000000000A000000 00000000C000C400 F1000000C01ED500 0000000000000086
Symptom Information: Crash Location: [F1000000C012B968] pcmUserGetDevIn+E8 Component: COMP Exception Type: 134 ... Stack Trace: [F1000000C012B968] pcmUserGetDevIn+E8 000000002FF44EAC 00000000DEADBEEF 00000000A28F47CB 0000000000000005 F1000A000036A500 0000000000010000 F1000000C01D3000 000000002FEF56E0 [F1000000C01D1808] sddUserInterfac+588 0000000000000002 00000000DEADBEEF 00000000A28F47CB 0000000000000088 ...
The report showed us pcmUserGetDevIn and sddUserInterfac in the stack trace. Both are related to the IBM sddpcm device driver. They appeared to be the likely culprits. Searching for these presented us with a couple of potential reasons for the crash, such as too many paths configured for a sddpcm device (which we checked and confirmed was far less than the maximum of 32). So, my next question was, what version of sddpcm was installed? We found that the customer had (unexpectedly) two different versions of sddpcm installed. In the Global LPAR, the sddpcm version was 2.6.9.0 (the latest) and in the vWPAR, 2.6.6.0. During the TL update, the customer had updated sddpcm to the latest level available, but had forgotten to update sddpcm inside the vWPAR as well. We then discovered, that if they simply ran "pcmpath query device" from inside the vWPAR, this would crash the partition. The latest sddpcm readme file contained the following information "5557 Fix for AIX 7.1 crash during pcmpath query device"! This was indeed the cause of the system crash. After the customer updated sddpcm, inside the vWPAR, the system was stable and the problem was resolved.
If you encounter a system crash in the future, and there's minidump data available, why not consider using the minidump reporting tool to analyse the issue? It might just help speed up the root cause analysis of the problem.
|