| :mod:`tokenize` --- Tokenizer for Python source |
| =============================================== |
| |
| .. module:: tokenize |
| :synopsis: Lexical scanner for Python source code. |
| .. moduleauthor:: Ka Ping Yee |
| .. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org> |
| |
| **Source code:** :source:`Lib/tokenize.py` |
| |
| -------------- |
| |
| The :mod:`tokenize` module provides a lexical scanner for Python source code, |
| implemented in Python. The scanner in this module returns comments as tokens as |
| well, making it useful for implementing "pretty-printers," including colorizers |
| for on-screen displays. |
| |
| To simplify token stream handling, all :ref:`operators` and :ref:`delimiters` |
| tokens are returned using the generic :data:`token.OP` token type. The exact |
| type can be determined by checking the token ``string`` field on the |
| :term:`named tuple` returned from :func:`tokenize.tokenize` for the character |
| sequence that identifies a specific operator token. |
| |
| The primary entry point is a :term:`generator`: |
| |
| .. function:: generate_tokens(readline) |
| |
| The :func:`generate_tokens` generator requires one argument, *readline*, |
| which must be a callable object which provides the same interface as the |
| :meth:`readline` method of built-in file objects (see section |
| :ref:`bltin-file-objects`). Each call to the function should return one line |
| of input as a string. Alternately, *readline* may be a callable object that |
| signals completion by raising :exc:`StopIteration`. |
| |
| The generator produces 5-tuples with these members: the token type; the token |
| string; a 2-tuple ``(srow, scol)`` of ints specifying the row and column |
| where the token begins in the source; a 2-tuple ``(erow, ecol)`` of ints |
| specifying the row and column where the token ends in the source; and the |
| line on which the token was found. The line passed (the last tuple item) is |
| the *logical* line; continuation lines are included. |
| |
| .. versionadded:: 2.2 |
| |
| An older entry point is retained for backward compatibility: |
| |
| |
| .. function:: tokenize(readline[, tokeneater]) |
| |
| The :func:`tokenize` function accepts two parameters: one representing the input |
| stream, and one providing an output mechanism for :func:`tokenize`. |
| |
| The first parameter, *readline*, must be a callable object which provides the |
| same interface as the :meth:`readline` method of built-in file objects (see |
| section :ref:`bltin-file-objects`). Each call to the function should return one |
| line of input as a string. Alternately, *readline* may be a callable object that |
| signals completion by raising :exc:`StopIteration`. |
| |
| .. versionchanged:: 2.5 |
| Added :exc:`StopIteration` support. |
| |
| The second parameter, *tokeneater*, must also be a callable object. It is |
| called once for each token, with five arguments, corresponding to the tuples |
| generated by :func:`generate_tokens`. |
| |
| All constants from the :mod:`token` module are also exported from |
| :mod:`tokenize`, as are two additional token type values that might be passed to |
| the *tokeneater* function by :func:`tokenize`: |
| |
| |
| .. data:: COMMENT |
| |
| Token value used to indicate a comment. |
| |
| |
| .. data:: NL |
| |
| Token value used to indicate a non-terminating newline. The NEWLINE token |
| indicates the end of a logical line of Python code; NL tokens are generated when |
| a logical line of code is continued over multiple physical lines. |
| |
| Another function is provided to reverse the tokenization process. This is useful |
| for creating tools that tokenize a script, modify the token stream, and write |
| back the modified script. |
| |
| |
| .. function:: untokenize(iterable) |
| |
| Converts tokens back into Python source code. The *iterable* must return |
| sequences with at least two elements, the token type and the token string. Any |
| additional sequence elements are ignored. |
| |
| The reconstructed script is returned as a single string. The result is |
| guaranteed to tokenize back to match the input so that the conversion is |
| lossless and round-trips are assured. The guarantee applies only to the token |
| type and token string as the spacing between tokens (column positions) may |
| change. |
| |
| .. versionadded:: 2.5 |
| |
| Example of a script re-writer that transforms float literals into Decimal |
| objects:: |
| |
| def decistmt(s): |
| """Substitute Decimals for floats in a string of statements. |
| |
| >>> from decimal import Decimal |
| >>> s = 'print +21.3e-5*-.1234/81.7' |
| >>> decistmt(s) |
| "print +Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7')" |
| |
| >>> exec(s) |
| -3.21716034272e-007 |
| >>> exec(decistmt(s)) |
| -3.217160342717258261933904529E-7 |
| |
| """ |
| result = [] |
| g = generate_tokens(StringIO(s).readline) # tokenize the string |
| for toknum, tokval, _, _, _ in g: |
| if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens |
| result.extend([ |
| (NAME, 'Decimal'), |
| (OP, '('), |
| (STRING, repr(tokval)), |
| (OP, ')') |
| ]) |
| else: |
| result.append((toknum, tokval)) |
| return untokenize(result) |
| |