blob: 21bf8d24ce73867d5f4e9d078d47667c664ea590 [file] [log] [blame] [view]
# STG
STG stands for Symbol-Type Graph.
# Overview
STG models Application Binary Interfaces. It supports extraction of ABIs from
DWARF and ingestion of BTF and libabigail XML into its model. Its primary
purpose is monitoring an ABI for changes over time and reporting such changes in
a comprehensible fashion.
STG captures symbol information, the size and layout of structs, function
argument and return types and much more, in a graph representation. Difference
reporting happens via a graph comparison.
Currently, STG functionality is exposed as two command-line tools, `stg` (for
ABI extraction) and `stgdiff` (for ABI comparison), and a native file format.
## Model
STG's model is an *abstraction* which does not and cannot capture every possible
interface property, invariant or behaviour. Conversely, the model includes
distinctions which are API significant but not ABI significant.
Concretely, STG's model is a rooted, connected, directed graph where each kind
of node corresponds to a meaningful ABI entity such as a symbol, function type
or struct member.
Nodes have specific attributes, such as name or size. Outgoing edges specify
things like return type. STG's model does not impose any constraints on which
nodes may be joined by edges.
Each node has an identity. However, for the purpose of comparison, nodes are
considered equal if they are of the same kind, have the same attributes and
matching outgoing edges and all nodes reachable via a pair of matching edges are
(recursively) equal. Renumbering nodes, (de)duplicating nodes and
adding/removing unreachable nodes do not affect this relationship.
### Symbols
As modelled by STG, symbols correspond closely to ELF symbols as seen in
`.dynsym` for shared object files or in `.symtab` for object files. In the case
of the Linux kernel, the `.symtab` is enriched with metadata and the effective
"ksymtab" is actually a subset of the ELF symbols together with CRC and
namespace information.
STG links symbols to their source-level types where these are known. Symbols
defined purely in assembly language will not have type information.
The symbol table is contained in the root node of the graph, which is an
*Interface* node.
### Types
STG models the C, C++ and (to a limited extent) Rust type systems.
For example, C++ template value parameters are poorly modelled for the simple
reason that this would require modelling C++ *values* as well as types,
something that DWARF itself doesn't do to the full extent permitted by C++20.
As type definitions are in general mutually recursive, an STG ABI is in general
a cyclic graph.
The root node of the graph can also contain a list of interface types, which may
not necessarily be reachable from the interface symbols.
## Supported Input Formats, Parsers and Limitations
STG can read its own native format for processing or comparison. It can also
process libabigail XML and BTF (`.BTF` ELF sections), with some limitations due
to model, design and implementation differences including missing features.
### Kinds of Node
STG has the following kinds of node.
* **Special** - used for `void` and `...`
* **Pointer / Reference** - `*`, `&` and `&&`
* **Pointer to Member** - `foo::*`
* **Typedef** - `typedef` and `using ... = ...`
* **Qualified** - `const` and friends
* **Primitive** - concrete types such as `int` and friends
* **Array** - `foo[N]` - there is no distinction between zero and
indeterminate length in the model
* **Base Class** - inheritance metadata
* **Method** - (only) virtual function
* **Member** - data member
* **Variant Member** - discriminated member
* **Struct / Union** - `struct foo` etc., Rust tuples too
* **Enumeration** - including the underlying value type - only values that are
within the range of signed 64-bit integer are correctly modelled
* **Variant** - for Rust enums holding data
* **Function** - multiple argument, single return type
* **ELF Symbol** - name, version, ELF metadata, Linux kernel metadata
* **Interface** - top-level collection of symbols and types
An STG ABI consists of a rooted, connected graph of such nodes, and *nothing
else*. STG is blind to anything that cannot be represented by its model.
### Native Format
STG's native file format is a protocol buffer text format. It is suitable for
revision control, rather than human consumption. It is effectively described by
[`stg.proto`](../stg.proto).
In this textual serialisation of ABI graphs, external node identifiers and node
order are chosen to minimise file changes when a small subset of the graph
changes.
As an example, this is the definition of the **Typedef** node kind:
```proto
message Typedef {
fixed32 id = 1;
string name = 2;
fixed32 referred_type_id = 3;
}
```
### Abigail (a.k.a. libabigail XML)
[libabigail](https://sourceware.org/libabigail/) is another project for ABI
monitoring. It uses a format that can be parsed as XML.
This command will transform Abigail into STG:
```shell
stg --abi library.xml --output library.stg
```
The main features modelled in Abigail but not STG are:
* source file, line and column information
* C++ access specifiers (public, protected, private)
The Abigail reader has these distinct phases of operation:
1. text parsed into an XML tree
2. XML cleaning - whitespace and unused attributes are stripped
3. XML tidying - issues like duplicate nodes are resolved, if possible
4. XML parsed into a graph with symbol information held separately
5. symbols and root node added to the graph
6. useless type qualifiers are stripped in post-processing
### BTF
[BTF](https://docs.kernel.org/bpf/btf.html) is typically used for the Linux
kernel where it is generated by `pahole -J` from ELF and DWARF information. It
can also be generated natively instead of DWARF using `gcc -gbtf` and by Clang,
but only for eBPF targets.
This command will transform BTF into STG:
```shell
stg --btf vmlinux --output vmlinux.stg
```
STG has primarily been tested against the `pahole` (libbtf) dialect of BTF and
support is not complete.
* split BTF is not supported at all
* any `.BTF.ext` section is just ignored
* some kinds of BTF node are not handled:
* `BTF_KIND_DATASEC` - skip
* `BTF_KIND_DECL_TAG` - abort
* `BTF_KIND_TYPE_TAG` - abort
The BTF reader has these distinct phases of operation:
1. file is opened as ELF and `.BTF` section data found
2. BTF header processed
3. BTF nodes parsed into a graph with symbol information held separately
4. symbols and root node added to the graph
### DWARF
The ELF / DWARF reader operates similarly to the other readers at a high level,
but much more work has to be done to turn ELF symbols and DWARF DIEs into STG
nodes.
1. the ELF file is checked for DWARF - missing DWARF results in a warning
2. the ELF symbols are read (from `.dynsym` in the case of shared object file)
3. the DWARF information is parsed into a partial STG graph
4. the ELF and DWARF information are stitched together, adding symbols and a
root node to the graph
5. useless type qualifiers are stripped in post-processing
## Output preprocessing
Before `stg` outputs a serialised graph, it performs:
1. a type normalisation step that unifies overlapping type definitions
2. a final deduplication step to eliminate other redundant nodes