|  |  | 
|  | Firmware-Assisted Dump | 
|  | ------------------------ | 
|  | July 2011 | 
|  |  | 
|  | The goal of firmware-assisted dump is to enable the dump of | 
|  | a crashed system, and to do so from a fully-reset system, and | 
|  | to minimize the total elapsed time until the system is back | 
|  | in production use. | 
|  |  | 
|  | - Firmware assisted dump (fadump) infrastructure is intended to replace | 
|  | the existing phyp assisted dump. | 
|  | - Fadump uses the same firmware interfaces and memory reservation model | 
|  | as phyp assisted dump. | 
|  | - Unlike phyp dump, fadump exports the memory dump through /proc/vmcore | 
|  | in the ELF format in the same way as kdump. This helps us reuse the | 
|  | kdump infrastructure for dump capture and filtering. | 
|  | - Unlike phyp dump, userspace tool does not need to refer any sysfs | 
|  | interface while reading /proc/vmcore. | 
|  | - Unlike phyp dump, fadump allows user to release all the memory reserved | 
|  | for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem. | 
|  | - Once enabled through kernel boot parameter, fadump can be | 
|  | started/stopped through /sys/kernel/fadump_registered interface (see | 
|  | sysfs files section below) and can be easily integrated with kdump | 
|  | service start/stop init scripts. | 
|  |  | 
|  | Comparing with kdump or other strategies, firmware-assisted | 
|  | dump offers several strong, practical advantages: | 
|  |  | 
|  | -- Unlike kdump, the system has been reset, and loaded | 
|  | with a fresh copy of the kernel.  In particular, | 
|  | PCI and I/O devices have been reinitialized and are | 
|  | in a clean, consistent state. | 
|  | -- Once the dump is copied out, the memory that held the dump | 
|  | is immediately available to the running kernel. And therefore, | 
|  | unlike kdump, fadump doesn't need a 2nd reboot to get back | 
|  | the system to the production configuration. | 
|  |  | 
|  | The above can only be accomplished by coordination with, | 
|  | and assistance from the Power firmware. The procedure is | 
|  | as follows: | 
|  |  | 
|  | -- The first kernel registers the sections of memory with the | 
|  | Power firmware for dump preservation during OS initialization. | 
|  | These registered sections of memory are reserved by the first | 
|  | kernel during early boot. | 
|  |  | 
|  | -- When a system crashes, the Power firmware will save | 
|  | the low memory (boot memory of size larger of 5% of system RAM | 
|  | or 256MB) of RAM to the previous registered region. It will | 
|  | also save system registers, and hardware PTE's. | 
|  |  | 
|  | NOTE: The term 'boot memory' means size of the low memory chunk | 
|  | that is required for a kernel to boot successfully when | 
|  | booted with restricted memory. By default, the boot memory | 
|  | size will be the larger of 5% of system RAM or 256MB. | 
|  | Alternatively, user can also specify boot memory size | 
|  | through boot parameter 'crashkernel=' which will override | 
|  | the default calculated size. Use this option if default | 
|  | boot memory size is not sufficient for second kernel to | 
|  | boot successfully. For syntax of crashkernel= parameter, | 
|  | refer to Documentation/kdump/kdump.txt. If any offset is | 
|  | provided in crashkernel= parameter, it will be ignored | 
|  | as fadump uses a predefined offset to reserve memory | 
|  | for boot memory dump preservation in case of a crash. | 
|  |  | 
|  | -- After the low memory (boot memory) area has been saved, the | 
|  | firmware will reset PCI and other hardware state.  It will | 
|  | *not* clear the RAM. It will then launch the bootloader, as | 
|  | normal. | 
|  |  | 
|  | -- The freshly booted kernel will notice that there is a new | 
|  | node (ibm,dump-kernel) in the device tree, indicating that | 
|  | there is crash data available from a previous boot. During | 
|  | the early boot OS will reserve rest of the memory above | 
|  | boot memory size effectively booting with restricted memory | 
|  | size. This will make sure that the second kernel will not | 
|  | touch any of the dump memory area. | 
|  |  | 
|  | -- User-space tools will read /proc/vmcore to obtain the contents | 
|  | of memory, which holds the previous crashed kernel dump in ELF | 
|  | format. The userspace tools may copy this info to disk, or | 
|  | network, nas, san, iscsi, etc. as desired. | 
|  |  | 
|  | -- Once the userspace tool is done saving dump, it will echo | 
|  | '1' to /sys/kernel/fadump_release_mem to release the reserved | 
|  | memory back to general use, except the memory required for | 
|  | next firmware-assisted dump registration. | 
|  |  | 
|  | e.g. | 
|  | # echo 1 > /sys/kernel/fadump_release_mem | 
|  |  | 
|  | Please note that the firmware-assisted dump feature | 
|  | is only available on Power6 and above systems with recent | 
|  | firmware versions. | 
|  |  | 
|  | Implementation details: | 
|  | ---------------------- | 
|  |  | 
|  | During boot, a check is made to see if firmware supports | 
|  | this feature on that particular machine. If it does, then | 
|  | we check to see if an active dump is waiting for us. If yes | 
|  | then everything but boot memory size of RAM is reserved during | 
|  | early boot (See Fig. 2). This area is released once we finish | 
|  | collecting the dump from user land scripts (e.g. kdump scripts) | 
|  | that are run. If there is dump data, then the | 
|  | /sys/kernel/fadump_release_mem file is created, and the reserved | 
|  | memory is held. | 
|  |  | 
|  | If there is no waiting dump data, then only the memory required | 
|  | to hold CPU state, HPTE region, boot memory dump and elfcore | 
|  | header, is usually reserved at an offset greater than boot memory | 
|  | size (see Fig. 1). This area is *not* released: this region will | 
|  | be kept permanently reserved, so that it can act as a receptacle | 
|  | for a copy of the boot memory content in addition to CPU state | 
|  | and HPTE region, in the case a crash does occur. | 
|  |  | 
|  | o Memory Reservation during first kernel | 
|  |  | 
|  | Low memory                                         Top of memory | 
|  | 0      boot memory size                                       | | 
|  | |           |                |<--Reserved dump area -->|      | | 
|  | V           V                |   Permanent Reservation |      V | 
|  | +-----------+----------/ /---+---+----+-----------+----+------+ | 
|  | |           |                |CPU|HPTE|  DUMP     |ELF |      | | 
|  | +-----------+----------/ /---+---+----+-----------+----+------+ | 
|  | |                                           ^ | 
|  | |                                           | | 
|  | \                                           / | 
|  | ------------------------------------------- | 
|  | Boot memory content gets transferred to | 
|  | reserved area by firmware at the time of | 
|  | crash | 
|  | Fig. 1 | 
|  |  | 
|  | o Memory Reservation during second kernel after crash | 
|  |  | 
|  | Low memory                                        Top of memory | 
|  | 0      boot memory size                                       | | 
|  | |           |<------------- Reserved dump area ----------- -->| | 
|  | V           V                                                 V | 
|  | +-----------+----------/ /---+---+----+-----------+----+------+ | 
|  | |           |                |CPU|HPTE|  DUMP     |ELF |      | | 
|  | +-----------+----------/ /---+---+----+-----------+----+------+ | 
|  | |                                              | | 
|  | V                                              V | 
|  | Used by second                                /proc/vmcore | 
|  | kernel to boot | 
|  | Fig. 2 | 
|  |  | 
|  | Currently the dump will be copied from /proc/vmcore to a | 
|  | a new file upon user intervention. The dump data available through | 
|  | /proc/vmcore will be in ELF format. Hence the existing kdump | 
|  | infrastructure (kdump scripts) to save the dump works fine with | 
|  | minor modifications. | 
|  |  | 
|  | The tools to examine the dump will be same as the ones | 
|  | used for kdump. | 
|  |  | 
|  | How to enable firmware-assisted dump (fadump): | 
|  | ------------------------------------- | 
|  |  | 
|  | 1. Set config option CONFIG_FA_DUMP=y and build kernel. | 
|  | 2. Boot into linux kernel with 'fadump=on' kernel cmdline option. | 
|  | 3. Optionally, user can also set 'crashkernel=' kernel cmdline | 
|  | to specify size of the memory to reserve for boot memory dump | 
|  | preservation. | 
|  |  | 
|  | NOTE: 1. 'fadump_reserve_mem=' parameter has been deprecated. Instead | 
|  | use 'crashkernel=' to specify size of the memory to reserve | 
|  | for boot memory dump preservation. | 
|  | 2. If firmware-assisted dump fails to reserve memory then it | 
|  | will fallback to existing kdump mechanism if 'crashkernel=' | 
|  | option is set at kernel cmdline. | 
|  |  | 
|  | Sysfs/debugfs files: | 
|  | ------------ | 
|  |  | 
|  | Firmware-assisted dump feature uses sysfs file system to hold | 
|  | the control files and debugfs file to display memory reserved region. | 
|  |  | 
|  | Here is the list of files under kernel sysfs: | 
|  |  | 
|  | /sys/kernel/fadump_enabled | 
|  |  | 
|  | This is used to display the fadump status. | 
|  | 0 = fadump is disabled | 
|  | 1 = fadump is enabled | 
|  |  | 
|  | This interface can be used by kdump init scripts to identify if | 
|  | fadump is enabled in the kernel and act accordingly. | 
|  |  | 
|  | /sys/kernel/fadump_registered | 
|  |  | 
|  | This is used to display the fadump registration status as well | 
|  | as to control (start/stop) the fadump registration. | 
|  | 0 = fadump is not registered. | 
|  | 1 = fadump is registered and ready to handle system crash. | 
|  |  | 
|  | To register fadump echo 1 > /sys/kernel/fadump_registered and | 
|  | echo 0 > /sys/kernel/fadump_registered for un-register and stop the | 
|  | fadump. Once the fadump is un-registered, the system crash will not | 
|  | be handled and vmcore will not be captured. This interface can be | 
|  | easily integrated with kdump service start/stop. | 
|  |  | 
|  | /sys/kernel/fadump_release_mem | 
|  |  | 
|  | This file is available only when fadump is active during | 
|  | second kernel. This is used to release the reserved memory | 
|  | region that are held for saving crash dump. To release the | 
|  | reserved memory echo 1 to it: | 
|  |  | 
|  | echo 1  > /sys/kernel/fadump_release_mem | 
|  |  | 
|  | After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region | 
|  | file will change to reflect the new memory reservations. | 
|  |  | 
|  | The existing userspace tools (kdump infrastructure) can be easily | 
|  | enhanced to use this interface to release the memory reserved for | 
|  | dump and continue without 2nd reboot. | 
|  |  | 
|  | Here is the list of files under powerpc debugfs: | 
|  | (Assuming debugfs is mounted on /sys/kernel/debug directory.) | 
|  |  | 
|  | /sys/kernel/debug/powerpc/fadump_region | 
|  |  | 
|  | This file shows the reserved memory regions if fadump is | 
|  | enabled otherwise this file is empty. The output format | 
|  | is: | 
|  | <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size> | 
|  |  | 
|  | e.g. | 
|  | Contents when fadump is registered during first kernel | 
|  |  | 
|  | # cat /sys/kernel/debug/powerpc/fadump_region | 
|  | CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0 | 
|  | HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0 | 
|  | DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0 | 
|  |  | 
|  | Contents when fadump is active during second kernel | 
|  |  | 
|  | # cat /sys/kernel/debug/powerpc/fadump_region | 
|  | CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020 | 
|  | HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000 | 
|  | DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000 | 
|  | : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000 | 
|  |  | 
|  | NOTE: Please refer to Documentation/filesystems/debugfs.txt on | 
|  | how to mount the debugfs filesystem. | 
|  |  | 
|  |  | 
|  | TODO: | 
|  | ----- | 
|  | o Need to come up with the better approach to find out more | 
|  | accurate boot memory size that is required for a kernel to | 
|  | boot successfully when booted with restricted memory. | 
|  | o The fadump implementation introduces a fadump crash info structure | 
|  | in the scratch area before the ELF core header. The idea of introducing | 
|  | this structure is to pass some important crash info data to the second | 
|  | kernel which will help second kernel to populate ELF core header with | 
|  | correct data before it gets exported through /proc/vmcore. The current | 
|  | design implementation does not address a possibility of introducing | 
|  | additional fields (in future) to this structure without affecting | 
|  | compatibility. Need to come up with the better approach to address this. | 
|  | The possible approaches are: | 
|  | 1. Introduce version field for version tracking, bump up the version | 
|  | whenever a new field is added to the structure in future. The version | 
|  | field can be used to find out what fields are valid for the current | 
|  | version of the structure. | 
|  | 2. Reserve the area of predefined size (say PAGE_SIZE) for this | 
|  | structure and have unused area as reserved (initialized to zero) | 
|  | for future field additions. | 
|  | The advantage of approach 1 over 2 is we don't need to reserve extra space. | 
|  | --- | 
|  | Author: Mahesh Salgaonkar <[email protected]> | 
|  | This document is based on the original documentation written for phyp | 
|  | assisted dump by Linas Vepstas and Manish Ahuja. |