compiler/rustc_codegen_llvm/src/debuginfo/doc.md - toolchain/rustc - Git at Google

 # Debug Info Module

 This module serves the purpose of generating debug symbols. We use LLVM's
 [source level debugging](https://llvm.org/docs/SourceLevelDebugging.html)
 features for generating the debug information. The general principle is
 this:

 Given the right metadata in the LLVM IR, the LLVM code generator is able to
 create DWARF debug symbols for the given code. The
 [metadata](https://llvm.org/docs/LangRef.html#metadata-type) is structured
 much like DWARF *debugging information entries* (DIE), representing type
 information such as datatype layout, function signatures, block layout,
 variable location and scope information, etc. It is the purpose of this
 module to generate correct metadata and insert it into the LLVM IR.

 As the exact format of metadata trees may change between different LLVM
 versions, we now use LLVM
 [DIBuilder](https://llvm.org/docs/doxygen/html/classllvm_1_1DIBuilder.html)
 to create metadata where possible. This will hopefully ease the adaption of
 this module to future LLVM versions.

 The public API of the module is a set of functions that will insert the
 correct metadata into the LLVM IR when called with the right parameters.
 The module is thus driven from an outside client with functions like
 `debuginfo::create_local_var_metadata(bx: block, local: &ast::local)`.

 Internally the module will try to reuse already created metadata by
 utilizing a cache. The way to get a shared metadata node when needed is
 thus to just call the corresponding function in this module:

     let file_metadata = file_metadata(cx, file);

 The function will take care of probing the cache for an existing node for
 that exact file path.

 All private state used by the module is stored within either the
 CrateDebugContext struct (owned by the CodegenCx) or the
 FunctionDebugContext (owned by the FunctionCx).

 This file consists of three conceptual sections:
 1. The public interface of the module
 2. Module-internal metadata creation functions
 3. Minor utility functions


 ## Recursive Types

 Some kinds of types, such as structs and enums can be recursive. That means
 that the type definition of some type X refers to some other type which in
 turn (transitively) refers to X. This introduces cycles into the type
 referral graph. A naive algorithm doing an on-demand, depth-first traversal
 of this graph when describing types, can get trapped in an endless loop
 when it reaches such a cycle.

 For example, the following simple type for a singly-linked list...

 ```
 struct List {
     value: i32,
     tail: Option<Box<List>>,
 }
 ```

 will generate the following callstack with a naive DFS algorithm:

 ```
 describe(t = List)
   describe(t = i32)
   describe(t = Option<Box<List>>)
     describe(t = Box<List>)
       describe(t = List) // at the beginning again...
       ...
 ```

 To break cycles like these, we use "forward declarations". That is, when
 the algorithm encounters a possibly recursive type (any struct or enum), it
 immediately creates a type description node and inserts it into the cache
 *before* describing the members of the type. This type description is just
 a stub (as type members are not described and added to it yet) but it
 allows the algorithm to already refer to the type. After the stub is
 inserted into the cache, the algorithm continues as before. If it now
 encounters a recursive reference, it will hit the cache and does not try to
 describe the type anew.

 This behavior is encapsulated in the 'RecursiveTypeDescription' enum,
 which represents a kind of continuation, storing all state needed to
 continue traversal at the type members after the type has been registered
 with the cache. (This implementation approach might be a tad over-
 engineered and may change in the future)


 ## Source Locations and Line Information

 In addition to data type descriptions the debugging information must also
 allow to map machine code locations back to source code locations in order
 to be useful. This functionality is also handled in this module. The
 following functions allow to control source mappings:

 + `set_source_location()`
 + `clear_source_location()`
 + `start_emitting_source_locations()`

 `set_source_location()` allows to set the current source location. All IR
 instructions created after a call to this function will be linked to the
 given source location, until another location is specified with
 `set_source_location()` or the source location is cleared with
 `clear_source_location()`. In the later case, subsequent IR instruction
 will not be linked to any source location. As you can see, this is a
 stateful API (mimicking the one in LLVM), so be careful with source
 locations set by previous calls. It's probably best to not rely on any
 specific state being present at a given point in code.

 One topic that deserves some extra attention is *function prologues*. At
 the beginning of a function's machine code there are typically a few
 instructions for loading argument values into allocas and checking if
 there's enough stack space for the function to execute. This *prologue* is
 not visible in the source code and LLVM puts a special PROLOGUE END marker
 into the line table at the first non-prologue instruction of the function.
 In order to find out where the prologue ends, LLVM looks for the first
 instruction in the function body that is linked to a source location. So,
 when generating prologue instructions we have to make sure that we don't
 emit source location information until the 'real' function body begins. For
 this reason, source location emission is disabled by default for any new
 function being codegened and is only activated after a call to the third
 function from the list above, `start_emitting_source_locations()`. This
 function should be called right before regularly starting to codegen the
 top-level block of the given function.

 There is one exception to the above rule: `llvm.dbg.declare` instruction
 must be linked to the source location of the variable being declared. For
 function parameters these `llvm.dbg.declare` instructions typically occur
 in the middle of the prologue, however, they are ignored by LLVM's prologue
 detection. The `create_argument_metadata()` and related functions take care
 of linking the `llvm.dbg.declare` instructions to the correct source
 locations even while source location emission is still disabled, so there
 is no need to do anything special with source location handling here.

 ## Unique Type Identification

 In order for link-time optimization to work properly, LLVM needs a unique
 type identifier that tells it across compilation units which types are the
 same as others. This type identifier is created by
 `TypeMap::get_unique_type_id_of_type()` using the following algorithm:

 1. Primitive types have their name as ID

 2. Structs, enums and traits have a multipart identifier

   1. The first part is the SVH (strict version hash) of the crate they
      were originally defined in

   2. The second part is the ast::NodeId of the definition in their
      original crate

   3. The final part is a concatenation of the type IDs of their concrete
      type arguments if they are generic types.

 3. Tuple-, pointer-, and function types are structurally identified, which
    means that they are equivalent if their component types are equivalent
    (i.e., `(i32, i32)` is the same regardless in which crate it is used).

 This algorithm also provides a stable ID for types that are defined in one
 crate but instantiated from metadata within another crate. We just have to
 take care to always map crate and `NodeId`s back to the original crate
 context.

 As a side-effect these unique type IDs also help to solve a problem arising
 from lifetime parameters. Since lifetime parameters are completely omitted
 in debuginfo, more than one `Ty` instance may map to the same debuginfo
 type metadata, that is, some struct `Struct<'a>` may have N instantiations
 with different concrete substitutions for `'a`, and thus there will be N
 `Ty` instances for the type `Struct<'a>` even though it is not generic
 otherwise. Unfortunately this means that we cannot use `ty::type_id()` as
 cheap identifier for type metadata -- we have done this in the past, but it
 led to unnecessary metadata duplication in the best case and LLVM
 assertions in the worst. However, the unique type ID as described above
 *can* be used as identifier. Since it is comparatively expensive to
 construct, though, `ty::type_id()` is still used additionally as an
 optimization for cases where the exact same type has been seen before
 (which is most of the time).
	# Debug Info Module

	This module serves the purpose of generating debug symbols. We use LLVM's
	[source level debugging](https://llvm.org/docs/SourceLevelDebugging.html)
	features for generating the debug information. The general principle is
	this:

	Given the right metadata in the LLVM IR, the LLVM code generator is able to
	create DWARF debug symbols for the given code. The
	[metadata](https://llvm.org/docs/LangRef.html#metadata-type) is structured
	much like DWARF debugging information entries (DIE), representing type
	information such as datatype layout, function signatures, block layout,
	variable location and scope information, etc. It is the purpose of this
	module to generate correct metadata and insert it into the LLVM IR.

	As the exact format of metadata trees may change between different LLVM
	versions, we now use LLVM
	[DIBuilder](https://llvm.org/docs/doxygen/html/classllvm_1_1DIBuilder.html)
	to create metadata where possible. This will hopefully ease the adaption of
	this module to future LLVM versions.

	The public API of the module is a set of functions that will insert the
	correct metadata into the LLVM IR when called with the right parameters.
	The module is thus driven from an outside client with functions like
	`debuginfo::create_local_var_metadata(bx: block, local: &ast::local)`.

	Internally the module will try to reuse already created metadata by
	utilizing a cache. The way to get a shared metadata node when needed is
	thus to just call the corresponding function in this module:

	let file_metadata = file_metadata(cx, file);

	The function will take care of probing the cache for an existing node for
	that exact file path.

	All private state used by the module is stored within either the
	CrateDebugContext struct (owned by the CodegenCx) or the
	FunctionDebugContext (owned by the FunctionCx).

	This file consists of three conceptual sections:
	1. The public interface of the module
	2. Module-internal metadata creation functions
	3. Minor utility functions


	## Recursive Types

	Some kinds of types, such as structs and enums can be recursive. That means
	that the type definition of some type X refers to some other type which in
	turn (transitively) refers to X. This introduces cycles into the type
	referral graph. A naive algorithm doing an on-demand, depth-first traversal
	of this graph when describing types, can get trapped in an endless loop
	when it reaches such a cycle.

	For example, the following simple type for a singly-linked list...

	```
	struct List {
	value: i32,
	tail: Option<Box<List>>,
	}
	```

	will generate the following callstack with a naive DFS algorithm:

	```
	describe(t = List)
	describe(t = i32)
	describe(t = Option<Box<List>>)
	describe(t = Box<List>)
	describe(t = List) // at the beginning again...
	...
	```

	To break cycles like these, we use "forward declarations". That is, when
	the algorithm encounters a possibly recursive type (any struct or enum), it
	immediately creates a type description node and inserts it into the cache
	before describing the members of the type. This type description is just
	a stub (as type members are not described and added to it yet) but it
	allows the algorithm to already refer to the type. After the stub is
	inserted into the cache, the algorithm continues as before. If it now
	encounters a recursive reference, it will hit the cache and does not try to
	describe the type anew.

	This behavior is encapsulated in the 'RecursiveTypeDescription' enum,
	which represents a kind of continuation, storing all state needed to
	continue traversal at the type members after the type has been registered
	with the cache. (This implementation approach might be a tad over-
	engineered and may change in the future)


	## Source Locations and Line Information

	In addition to data type descriptions the debugging information must also
	allow to map machine code locations back to source code locations in order
	to be useful. This functionality is also handled in this module. The
	following functions allow to control source mappings:

	+ `set_source_location()`
	+ `clear_source_location()`
	+ `start_emitting_source_locations()`

	`set_source_location()` allows to set the current source location. All IR
	instructions created after a call to this function will be linked to the
	given source location, until another location is specified with
	`set_source_location()` or the source location is cleared with
	`clear_source_location()`. In the later case, subsequent IR instruction
	will not be linked to any source location. As you can see, this is a
	stateful API (mimicking the one in LLVM), so be careful with source
	locations set by previous calls. It's probably best to not rely on any
	specific state being present at a given point in code.

	One topic that deserves some extra attention is function prologues. At
	the beginning of a function's machine code there are typically a few
	instructions for loading argument values into allocas and checking if
	there's enough stack space for the function to execute. This prologue is
	not visible in the source code and LLVM puts a special PROLOGUE END marker
	into the line table at the first non-prologue instruction of the function.
	In order to find out where the prologue ends, LLVM looks for the first
	instruction in the function body that is linked to a source location. So,
	when generating prologue instructions we have to make sure that we don't
	emit source location information until the 'real' function body begins. For
	this reason, source location emission is disabled by default for any new
	function being codegened and is only activated after a call to the third
	function from the list above, `start_emitting_source_locations()`. This
	function should be called right before regularly starting to codegen the
	top-level block of the given function.

	There is one exception to the above rule: `llvm.dbg.declare` instruction
	must be linked to the source location of the variable being declared. For
	function parameters these `llvm.dbg.declare` instructions typically occur
	in the middle of the prologue, however, they are ignored by LLVM's prologue
	detection. The `create_argument_metadata()` and related functions take care
	of linking the `llvm.dbg.declare` instructions to the correct source
	locations even while source location emission is still disabled, so there
	is no need to do anything special with source location handling here.

	## Unique Type Identification

	In order for link-time optimization to work properly, LLVM needs a unique
	type identifier that tells it across compilation units which types are the
	same as others. This type identifier is created by
	`TypeMap::get_unique_type_id_of_type()` using the following algorithm:

	1. Primitive types have their name as ID

	2. Structs, enums and traits have a multipart identifier

	1. The first part is the SVH (strict version hash) of the crate they
	were originally defined in

	2. The second part is the ast::NodeId of the definition in their
	original crate

	3. The final part is a concatenation of the type IDs of their concrete
	type arguments if they are generic types.

	3. Tuple-, pointer-, and function types are structurally identified, which
	means that they are equivalent if their component types are equivalent
	(i.e., `(i32, i32)` is the same regardless in which crate it is used).

	This algorithm also provides a stable ID for types that are defined in one
	crate but instantiated from metadata within another crate. We just have to
	take care to always map crate and `NodeId`s back to the original crate
	context.

	As a side-effect these unique type IDs also help to solve a problem arising
	from lifetime parameters. Since lifetime parameters are completely omitted
	in debuginfo, more than one `Ty` instance may map to the same debuginfo
	type metadata, that is, some struct `Struct<'a>` may have N instantiations
	with different concrete substitutions for `'a`, and thus there will be N
	`Ty` instances for the type `Struct<'a>` even though it is not generic
	otherwise. Unfortunately this means that we cannot use `ty::type_id()` as
	cheap identifier for type metadata -- we have done this in the past, but it
	led to unnecessary metadata duplication in the best case and LLVM
	assertions in the worst. However, the unique type ID as described above
	can be used as identifier. Since it is comparatively expensive to
	construct, though, `ty::type_id()` is still used additionally as an
	optimization for cases where the exact same type has been seen before
	(which is most of the time).