| :tocdepth: 3 |
| |
| .. _module-pw_tokenizer-detokenization: |
| |
| ============== |
| Detokenization |
| ============== |
| .. pigweed-module-subpage:: |
| :name: pw_tokenizer |
| |
| Detokenization is the process of expanding a token to the string it represents |
| and decoding its arguments. ``pw_tokenizer`` provides Python, C++ and |
| TypeScript detokenization libraries. |
| |
| -------------------------------- |
| Example: decoding tokenized logs |
| -------------------------------- |
| A project might tokenize its log messages with the |
| :ref:`module-pw_tokenizer-base64-format`. Consider the following log file, which |
| has four tokenized logs and one plain text log: |
| |
| .. code-block:: text |
| |
| 20200229 14:38:58 INF $HL2VHA== |
| 20200229 14:39:00 DBG $5IhTKg== |
| 20200229 14:39:20 DBG Crunching numbers to calculate probability of success |
| 20200229 14:39:21 INF $EgFj8lVVAUI= |
| 20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk= |
| |
| The project's log strings are stored in a database like the following: |
| |
| .. code-block:: |
| |
| 1c95bd1c, ,"Initiating retrieval process for recovery object" |
| 2a5388e4, ,"Determining optimal approach and coordinating vectors" |
| 3743540c, ,"Recovery object retrieval failed with status %s" |
| f2630112, ,"Calculated acceptable probability of success (%.2f%%)" |
| |
| Using the detokenizing tools with the database, the logs can be decoded: |
| |
| .. code-block:: text |
| |
| 20200229 14:38:58 INF Initiating retrieval process for recovery object |
| 20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors |
| 20200229 14:39:20 DBG Crunching numbers to calculate probability of success |
| 20200229 14:39:21 INF Calculated acceptable probability of success (32.33%) |
| 20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY |
| |
| .. note:: |
| |
| This example uses the :ref:`module-pw_tokenizer-base64-format`, which |
| occupies about 4/3 (133%) as much space as the default binary format when |
| encoded. For projects that wish to interleave tokenized with plain text, |
| using Base64 is a worthwhile tradeoff. |
| |
| ------------------------ |
| Detokenization in Python |
| ------------------------ |
| To detokenize in Python, import ``Detokenizer`` from the ``pw_tokenizer`` |
| package, and instantiate it with paths to token databases or ELF files. |
| |
| .. code-block:: python |
| |
| import pw_tokenizer |
| |
| detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf') |
| |
| def process_log_message(log_message): |
| result = detokenizer.detokenize(log_message.payload) |
| self._log(str(result)) |
| |
| The ``pw_tokenizer`` package also provides the ``AutoUpdatingDetokenizer`` |
| class, which can be used in place of the standard ``Detokenizer``. This class |
| monitors database files for changes and automatically reloads them when they |
| change. This is helpful for long-running tools that use detokenization. The |
| class also supports token domains for the given database files in the |
| ``<path>#<domain>`` format. |
| |
| For messages that are optionally tokenized and may be encoded as binary, |
| Base64, or plaintext UTF-8, use |
| :func:`pw_tokenizer.proto.decode_optionally_tokenized`. This will attempt to |
| determine the correct method to detokenize and always provide a printable |
| string. |
| |
| .. _module-pw_tokenizer-base64-decoding: |
| |
| Decoding Base64 |
| =============== |
| The Python ``Detokenizer`` class supports decoding and detokenizing prefixed |
| Base64 messages with ``detokenize_base64`` and related methods. |
| |
| .. tip:: |
| The Python detokenization tools support recursive detokenization for prefixed |
| Base64 text. Tokenized strings found in detokenized text are detokenized, so |
| prefixed Base64 messages can be passed as ``%s`` arguments. |
| |
| For example, the tokenized string for "Wow!" is ``$RhYjmQ==``. This could be |
| passed as an argument to the printf-style string ``Nested message: %s``, which |
| encodes to ``$pEVTYQkkUmhZam1RPT0=``. The detokenizer would decode the message |
| as follows: |
| |
| :: |
| |
| "$pEVTYQkkUmhZam1RPT0=" → "Nested message: $RhYjmQ==" → "Nested message: Wow!" |
| |
| Base64 decoding is supported in C++ or C with the |
| ``pw::tokenizer::PrefixedBase64Decode`` or ``pw_tokenizer_PrefixedBase64Decode`` |
| functions. |
| |
| Investigating undecoded Base64 messages |
| --------------------------------------- |
| Tokenized messages cannot be decoded if the token is not recognized. The Python |
| package includes the ``parse_message`` tool, which parses tokenized Base64 |
| messages without looking up the token in a database. This tool attempts to guess |
| the types of the arguments and displays potential ways to decode them. |
| |
| This tool can be used to extract argument information from an otherwise unusable |
| message. It could help identify which statement in the code produced the |
| message. This tool is not particularly helpful for tokenized messages without |
| arguments, since all it can do is show the value of the unknown token. |
| |
| The tool is executed by passing Base64 tokenized messages, with or without the |
| ``$`` prefix, to ``pw_tokenizer.parse_message``. Pass ``-h`` or ``--help`` to |
| see full usage information. |
| |
| Example |
| ^^^^^^^ |
| .. code-block:: |
| |
| $ python -m pw_tokenizer.parse_message '$329JMwA=' koSl524TRkFJTEVEX1BSRUNPTkRJVElPTgJPSw== --specs %s %d |
| |
| INF Decoding arguments for '$329JMwA=' |
| INF Binary: b'\xdfoI3\x00' [df 6f 49 33 00] (5 bytes) |
| INF Token: 0x33496fdf |
| INF Args: b'\x00' [00] (1 bytes) |
| INF Decoding with up to 8 %s or %d arguments |
| INF Attempt 1: [%s] |
| INF Attempt 2: [%d] 0 |
| |
| INF Decoding arguments for '$koSl524TRkFJTEVEX1BSRUNPTkRJVElPTgJPSw==' |
| INF Binary: b'\x92\x84\xa5\xe7n\x13FAILED_PRECONDITION\x02OK' [92 84 a5 e7 6e 13 46 41 49 4c 45 44 5f 50 52 45 43 4f 4e 44 49 54 49 4f 4e 02 4f 4b] (28 bytes) |
| INF Token: 0xe7a58492 |
| INF Args: b'n\x13FAILED_PRECONDITION\x02OK' [6e 13 46 41 49 4c 45 44 5f 50 52 45 43 4f 4e 44 49 54 49 4f 4e 02 4f 4b] (24 bytes) |
| INF Decoding with up to 8 %s or %d arguments |
| INF Attempt 1: [%d %s %d %d %d] 55 FAILED_PRECONDITION 1 -40 -38 |
| INF Attempt 2: [%d %s %s] 55 FAILED_PRECONDITION OK |
| |
| |
| .. _module-pw_tokenizer-protobuf-tokenization-python: |
| |
| Detokenizing protobufs |
| ====================== |
| The :py:mod:`pw_tokenizer.proto` Python module defines functions that may be |
| used to detokenize protobuf objects in Python. The function |
| :py:func:`pw_tokenizer.proto.detokenize_fields` detokenizes all fields |
| annotated as tokenized, replacing them with their detokenized version. For |
| example: |
| |
| .. code-block:: python |
| |
| my_detokenizer = pw_tokenizer.Detokenizer(some_database) |
| |
| my_message = SomeMessage(tokenized_field=b'$YS1EMQ==') |
| pw_tokenizer.proto.detokenize_fields(my_detokenizer, my_message) |
| |
| assert my_message.tokenized_field == b'The detokenized string! Cool!' |
| |
| Decoding optionally tokenized strings |
| ------------------------------------- |
| The encoding used for an optionally tokenized field is not recorded in the |
| protobuf. Despite this, the text can reliably be decoded. This is accomplished |
| by attempting to decode the field as binary or Base64 tokenized data before |
| treating it like plain text. |
| |
| The following diagram describes the decoding process for optionally tokenized |
| fields in detail. |
| |
| .. mermaid:: |
| |
| flowchart TD |
| start([Received bytes]) --> binary |
| |
| binary[Decode as<br>binary tokenized] --> binary_ok |
| binary_ok{Detokenizes<br>successfully?} -->|no| utf8 |
| binary_ok -->|yes| done_binary([Display decoded binary]) |
| |
| utf8[Decode as UTF-8] --> utf8_ok |
| utf8_ok{Valid UTF-8?} -->|no| base64_encode |
| utf8_ok -->|yes| base64 |
| |
| base64_encode[Encode as<br>tokenized Base64] --> display |
| display([Display encoded Base64]) |
| |
| base64[Decode as<br>Base64 tokenized] --> base64_ok |
| |
| base64_ok{Fully<br>or partially<br>detokenized?} -->|no| is_plain_text |
| base64_ok -->|yes| base64_results |
| |
| is_plain_text{Text is<br>printable?} -->|no| base64_encode |
| is_plain_text-->|yes| plain_text |
| |
| base64_results([Display decoded Base64]) |
| plain_text([Display text]) |
| |
| Potential decoding problems |
| --------------------------- |
| The decoding process for optionally tokenized fields will yield correct results |
| in almost every situation. In rare circumstances, it is possible for it to fail, |
| but these can be avoided with a low-overhead mitigation if desired. |
| |
| There are two ways in which the decoding process may fail. |
| |
| Accidentally interpreting plain text as tokenized binary |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| If a plain-text string happens to decode as a binary tokenized message, the |
| incorrect message could be displayed. This is very unlikely to occur. While many |
| tokens will incidentally end up being valid UTF-8 strings, it is highly unlikely |
| that a device will happen to log one of these strings as plain text. The |
| overwhelming majority of these strings will be nonsense. |
| |
| If an implementation wishes to guard against this extremely improbable |
| situation, it is possible to prevent it. This situation is prevented by |
| appending 0xFF (or another byte never valid in UTF-8) to binary tokenized data |
| that happens to be valid UTF-8 (or all binary tokenized messages, if desired). |
| When decoding, if there is an extra 0xFF byte, it is discarded. |
| |
| Displaying undecoded binary as plain text instead of Base64 |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| If a message fails to decode as binary tokenized and it is not valid UTF-8, it |
| is displayed as tokenized Base64. This makes it easily recognizable as a |
| tokenized message and makes it simple to decode later from the text output (for |
| example, with an updated token database). |
| |
| A binary message for which the token is not known may coincidentally be valid |
| UTF-8 or ASCII. 6.25% of 4-byte sequences are composed only of ASCII characters |
| When decoding with an out-of-date token database, it is possible that some |
| binary tokenized messages will be displayed as plain text rather than tokenized |
| Base64. |
| |
| This situation is likely to occur, but should be infrequent. Even if it does |
| happen, it is not a serious issue. A very small number of strings will be |
| displayed incorrectly, but these strings cannot be decoded anyway. One nonsense |
| string (e.g. ``a-D1``) would be displayed instead of another (``$YS1EMQ==``). |
| Updating the token database would resolve the issue, though the non-Base64 logs |
| would be difficult decode later from a log file. |
| |
| This situation can be avoided with the same approach described in |
| `Accidentally interpreting plain text as tokenized binary`_. Appending |
| an invalid UTF-8 character prevents the undecoded binary message from being |
| interpreted as plain text. |
| |
| --------------------- |
| Detokenization in C++ |
| --------------------- |
| The C++ detokenization libraries can be used in C++ or any language that can |
| call into C++ with a C-linkage wrapper, such as Java or Rust. A reference |
| Java Native Interface (JNI) implementation is provided. |
| |
| The C++ detokenization library uses binary-format token databases (created with |
| ``database.py create --type binary``). Read a binary format database from a |
| file or include it in the source code. Pass the database array to |
| ``TokenDatabase::Create``, and construct a detokenizer. |
| |
| .. code-block:: cpp |
| |
| Detokenizer detokenizer(TokenDatabase::Create(token_database_array)); |
| |
| std::string ProcessLog(span<uint8_t> log_data) { |
| return detokenizer.Detokenize(log_data).BestString(); |
| } |
| |
| The ``TokenDatabase`` class verifies that its data is valid before using it. If |
| it is invalid, the ``TokenDatabase::Create`` returns an empty database for which |
| ``ok()`` returns false. If the token database is included in the source code, |
| this check can be done at compile time. |
| |
| .. code-block:: cpp |
| |
| // This line fails to compile with a static_assert if the database is invalid. |
| constexpr TokenDatabase kDefaultDatabase = TokenDatabase::Create<kData>(); |
| |
| Detokenizer OpenDatabase(std::string_view path) { |
| std::vector<uint8_t> data = ReadWholeFile(path); |
| |
| TokenDatabase database = TokenDatabase::Create(data); |
| |
| // This checks if the file contained a valid database. It is safe to use a |
| // TokenDatabase that failed to load (it will be empty), but it may be |
| // desirable to provide a default database or otherwise handle the error. |
| if (database.ok()) { |
| return Detokenizer(database); |
| } |
| return Detokenizer(kDefaultDatabase); |
| } |
| |
| ---------------------------- |
| Detokenization in TypeScript |
| ---------------------------- |
| To detokenize in TypeScript, import ``Detokenizer`` from the ``pigweedjs`` |
| package, and instantiate it with a CSV token database. |
| |
| .. code-block:: typescript |
| |
| import { pw_tokenizer, pw_hdlc } from 'pigweedjs'; |
| const { Detokenizer } = pw_tokenizer; |
| const { Frame } = pw_hdlc; |
| |
| const detokenizer = new Detokenizer(String(tokenCsv)); |
| |
| function processLog(frame: Frame){ |
| const result = detokenizer.detokenize(frame); |
| console.log(result); |
| } |
| |
| For messages that are encoded in Base64, use ``Detokenizer::detokenizeBase64``. |
| `detokenizeBase64` will also attempt to detokenize nested Base64 tokens. There |
| is also `detokenizeUint8Array` that works just like `detokenize` but expects |
| `Uint8Array` instead of a `Frame` argument. |
| |
| |
| |
| .. _module-pw_tokenizer-cli-detokenizing: |
| |
| --------------------- |
| Detokenizing CLI tool |
| --------------------- |
| ``pw_tokenizer`` provides two standalone command line utilities for detokenizing |
| Base64-encoded tokenized strings. |
| |
| * ``detokenize.py`` -- Detokenizes Base64-encoded strings in files or from |
| stdin. |
| * ``serial_detokenizer.py`` -- Detokenizes Base64-encoded strings from a |
| connected serial device. |
| |
| If the ``pw_tokenizer`` Python package is installed, these tools may be executed |
| as runnable modules. For example: |
| |
| .. code-block:: |
| |
| # Detokenize Base64-encoded strings in a file |
| python -m pw_tokenizer.detokenize -i input_file.txt |
| |
| # Detokenize Base64-encoded strings in output from a serial device |
| python -m pw_tokenizer.serial_detokenizer --device /dev/ttyACM0 |
| |
| See the ``--help`` options for these tools for full usage information. |
| |
| -------- |
| Appendix |
| -------- |
| |
| .. _module-pw_tokenizer-python-detokenization-c99-printf-notes: |
| |
| Python detokenization: C99 ``printf`` compatibility notes |
| ========================================================= |
| This implementation is designed to align with the |
| `C99 specification, section 7.19.6 |
| <https://www.dii.uchile.cl/~daespino/files/Iso_C_1999_definition.pdf>`_. |
| Notably, this specification is slightly different than what is implemented |
| in most compilers due to each compiler choosing to interpret undefined |
| behavior in slightly different ways. Treat the following description as the |
| source of truth. |
| |
| This implementation supports: |
| |
| - Overall Format: ``%[flags][width][.precision][length][specifier]`` |
| - Flags (Zero or More) |
| - ``-``: Left-justify within the given field width; Right justification is |
| the default (see Width modifier). |
| - ``+``: Forces to preceed the result with a plus or minus sign (``+`` or |
| ``-``) even for positive numbers. By default, only negative numbers are |
| preceded with a ``-`` sign. |
| - (space): If no sign is going to be written, a blank space is inserted |
| before the value. |
| - ``#``: Specifies an alternative print syntax should be used. |
| - Used with ``o``, ``x`` or ``X`` specifiers the value is preceeded with |
| ``0``, ``0x`` or ``0X``, respectively, for values different than zero. |
| - Used with ``a``, ``A``, ``e``, ``E``, ``f``, ``F``, ``g``, or ``G`` it |
| forces the written output to contain a decimal point even if no more |
| digits follow. By default, if no digits follow, no decimal point is |
| written. |
| - ``0``: Left-pads the number with zeroes (``0``) instead of spaces when |
| padding is specified (see width sub-specifier). |
| - Width (Optional) |
| - ``(number)``: Minimum number of characters to be printed. If the value to |
| be printed is shorter than this number, the result is padded with blank |
| spaces or ``0`` if the ``0`` flag is present. The value is not truncated |
| even if the result is larger. If the value is negative and the ``0`` flag |
| is present, the ``0``\s are padded after the ``-`` symbol. |
| - ``*``: The width is not specified in the format string, but as an |
| additional integer value argument preceding the argument that has to be |
| formatted. |
| - Precision (Optional) |
| - ``.(number)`` |
| - For ``d``, ``i``, ``o``, ``u``, ``x``, ``X``, specifies the minimum |
| number of digits to be written. If the value to be written is shorter |
| than this number, the result is padded with leading zeros. The value is |
| not truncated even if the result is longer. |
| |
| - A precision of ``0`` means that no character is written for the value |
| ``0``. |
| |
| - For ``a``, ``A``, ``e``, ``E``, ``f``, and ``F``, specifies the number |
| of digits to be printed after the decimal point. By default, this is |
| ``6``. |
| |
| - For ``g`` and ``G``, specifies the maximum number of significant digits |
| to be printed. |
| |
| - For ``s``, specifies the maximum number of characters to be printed. By |
| default all characters are printed until the ending null character is |
| encountered. |
| |
| - If the period is specified without an explicit value for precision, |
| ``0`` is assumed. |
| - ``.*``: The precision is not specified in the format string, but as an |
| additional integer value argument preceding the argument that has to be |
| formatted. |
| - Length (Optional) |
| - ``hh``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers |
| to convey the argument will be a ``signed char`` or ``unsigned char``. |
| However, this is largely ignored in the implementation due to it not being |
| necessary for Python or argument decoding (since the argument is always |
| encoded at least as a 32-bit integer). |
| - ``h``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers |
| to convey the argument will be a ``signed short int`` or |
| ``unsigned short int``. However, this is largely ignored in the |
| implementation due to it not being necessary for Python or argument |
| decoding (since the argument is always encoded at least as a 32-bit |
| integer). |
| - ``l``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers |
| to convey the argument will be a ``signed long int`` or |
| ``unsigned long int``. Also is usable with ``c`` and ``s`` to specify that |
| the arguments will be encoded with ``wchar_t`` values (which isn't |
| different from normal ``char`` values). However, this is largely ignored in |
| the implementation due to it not being necessary for Python or argument |
| decoding (since the argument is always encoded at least as a 32-bit |
| integer). |
| - ``ll``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers |
| to convey the argument will be a ``signed long long int`` or |
| ``unsigned long long int``. This is required to properly decode the |
| argument as a 64-bit integer. |
| - ``L``: Usable with ``a``, ``A``, ``e``, ``E``, ``f``, ``F``, ``g``, or |
| ``G`` conversion specifiers applies to a long double argument. However, |
| this is ignored in the implementation due to floating point value encoded |
| that is unaffected by bit width. |
| - ``j``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers |
| to convey the argument will be a ``intmax_t`` or ``uintmax_t``. |
| - ``z``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers |
| to convey the argument will be a ``size_t``. This will force the argument |
| to be decoded as an unsigned integer. |
| - ``t``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers |
| to convey the argument will be a ``ptrdiff_t``. |
| - If a length modifier is provided for an incorrect specifier, it is ignored. |
| - Specifier (Required) |
| - ``d`` / ``i``: Used for signed decimal integers. |
| |
| - ``u``: Used for unsigned decimal integers. |
| |
| - ``o``: Used for unsigned decimal integers and specifies formatting should |
| be as an octal number. |
| |
| - ``x``: Used for unsigned decimal integers and specifies formatting should |
| be as a hexadecimal number using all lowercase letters. |
| |
| - ``X``: Used for unsigned decimal integers and specifies formatting should |
| be as a hexadecimal number using all uppercase letters. |
| |
| - ``f``: Used for floating-point values and specifies to use lowercase, |
| decimal floating point formatting. |
| |
| - Default precision is ``6`` decimal places unless explicitly specified. |
| |
| - ``F``: Used for floating-point values and specifies to use uppercase, |
| decimal floating point formatting. |
| |
| - Default precision is ``6`` decimal places unless explicitly specified. |
| |
| - ``e``: Used for floating-point values and specifies to use lowercase, |
| exponential (scientific) formatting. |
| |
| - Default precision is ``6`` decimal places unless explicitly specified. |
| |
| - ``E``: Used for floating-point values and specifies to use uppercase, |
| exponential (scientific) formatting. |
| |
| - Default precision is ``6`` decimal places unless explicitly specified. |
| |
| - ``g``: Used for floating-point values and specified to use ``f`` or ``e`` |
| formatting depending on which would be the shortest representation. |
| |
| - Precision specifies the number of significant digits, not just digits |
| after the decimal place. |
| |
| - If the precision is specified as ``0``, it is interpreted to mean ``1``. |
| |
| - ``e`` formatting is used if the the exponent would be less than ``-4`` or |
| is greater than or equal to the precision. |
| |
| - Trailing zeros are removed unless the ``#`` flag is set. |
| |
| - A decimal point only appears if it is followed by a digit. |
| |
| - ``NaN`` or infinities always follow ``f`` formatting. |
| |
| - ``G``: Used for floating-point values and specified to use ``f`` or ``e`` |
| formatting depending on which would be the shortest representation. |
| |
| - Precision specifies the number of significant digits, not just digits |
| after the decimal place. |
| |
| - If the precision is specified as ``0``, it is interpreted to mean ``1``. |
| |
| - ``E`` formatting is used if the the exponent would be less than ``-4`` or |
| is greater than or equal to the precision. |
| |
| - Trailing zeros are removed unless the ``#`` flag is set. |
| |
| - A decimal point only appears if it is followed by a digit. |
| |
| - ``NaN`` or infinities always follow ``F`` formatting. |
| |
| - ``c``: Used for formatting a ``char`` value. |
| |
| - ``s``: Used for formatting a string of ``char`` values. |
| |
| - If width is specified, the null terminator character is included as a |
| character for width count. |
| |
| - If precision is specified, no more ``char``\s than that value will be |
| written from the string (padding is used to fill additional width). |
| |
| - ``p``: Used for formatting a pointer address. |
| |
| - ``%``: Prints a single ``%``. Only valid as ``%%`` (supports no flags, |
| width, precision, or length modifiers). |
| |
| Underspecified details: |
| |
| - If both ``+`` and (space) flags appear, the (space) is ignored. |
| - The ``+`` and (space) flags will error if used with ``c`` or ``s``. |
| - The ``#`` flag will error if used with ``d``, ``i``, ``u``, ``c``, ``s``, or |
| ``p``. |
| - The ``0`` flag will error if used with ``c``, ``s``, or ``p``. |
| - Both ``+`` and (space) can work with the unsigned integer specifiers ``u``, |
| ``o``, ``x``, and ``X``. |
| - If a length modifier is provided for an incorrect specifier, it is ignored. |
| - The ``z`` length modifier will decode arugments as signed as long as ``d`` or |
| ``i`` is used. |
| - ``p`` is implementation defined. |
| |
| - For this implementation, it will print with a ``0x`` prefix and then the |
| pointer value was printed using ``%08X``. |
| |
| - ``p`` supports the ``+``, ``-``, and (space) flags, but not the ``#`` or |
| ``0`` flags. |
| |
| - None of the length modifiers are usable with ``p``. |
| |
| - This implementation will try to adhere to user-specified width (assuming the |
| width provided is larger than the guaranteed minimum of ``10``). |
| |
| - Specifying precision for ``p`` is considered an error. |
| - Only ``%%`` is allowed with no other modifiers. Things like ``%+%`` will fail |
| to decode. Some C stdlib implementations support any modifiers being |
| present between ``%``, but ignore any for the output. |
| - If a width is specified with the ``0`` flag for a negative value, the padded |
| ``0``\s will appear after the ``-`` symbol. |
| - A precision of ``0`` for ``d``, ``i``, ``u``, ``o``, ``x``, or ``X`` means |
| that no character is written for the value ``0``. |
| - Precision cannot be specified for ``c``. |
| - Using ``*`` or fixed precision with the ``s`` specifier still requires the |
| string argument to be null-terminated. This is due to argument encoding |
| happening on the C/C++-side while the precision value is not read or |
| otherwise used until decoding happens in this Python code. |
| |
| Non-conformant details: |
| |
| - ``n`` specifier: We do not support the ``n`` specifier since it is impossible |
| for us to retroactively tell the original program how many characters have |
| been printed since this decoding happens a great deal of time after the |
| device sent it, usually on a separate processing device entirely. |