| Writing Extensions for Python-Markdown |
| ====================================== |
| |
| Overview |
| -------- |
| |
| Python-Markdown includes an API for extension writers to plug their own |
| custom functionality and/or syntax into the parser. There are preprocessors |
| which allow you to alter the source before it is passed to the parser, |
| inline patterns which allow you to add, remove or override the syntax of |
| any inline elements, and postprocessors which allow munging of the |
| output of the parser before it is returned. If you really want to dive in, |
| there are also blockprocessors which are part of the core BlockParser. |
| |
| As the parser builds an [ElementTree][] object which is later rendered |
| as Unicode text, there are also some helpers provided to ease manipulation of |
| the tree. Each part of the API is discussed in its respective section below. |
| Additionaly, reading the source of some [[Available Extensions]] may be helpful. |
| For example, the [[Footnotes]] extension uses most of the features documented |
| here. |
| |
| * [Preprocessors][] |
| * [InlinePatterns][] |
| * [Treeprocessors][] |
| * [Postprocessors][] |
| * [BlockParser][] |
| * [Working with the ElementTree][] |
| * [Integrating your code into Markdown][] |
| * [extendMarkdown][] |
| * [OrderedDict][] |
| * [registerExtension][] |
| * [Config Settings][] |
| * [makeExtension][] |
| |
| <h3 id="preprocessors">Preprocessors</h3> |
| |
| Preprocessors munge the source text before it is passed into the Markdown |
| core. This is an excellent place to clean up bad syntax, extract things the |
| parser may otherwise choke on and perhaps even store it for later retrieval. |
| |
| Preprocessors should inherit from ``markdown.preprocessors.Preprocessor`` and |
| implement a ``run`` method with one argument ``lines``. The ``run`` method of |
| each Preprocessor will be passed the entire source text as a list of Unicode |
| strings. Each string will contain one line of text. The ``run`` method should |
| return either that list, or an altered list of Unicode strings. |
| |
| A pseudo example: |
| |
| class MyPreprocessor(markdown.preprocessors.Preprocessor): |
| def run(self, lines): |
| new_lines = [] |
| for line in lines: |
| m = MYREGEX.match(line) |
| if m: |
| # do stuff |
| else: |
| new_lines.append(line) |
| return new_lines |
| |
| <h3 id="inlinepatterns">Inline Patterns</h3> |
| |
| Inline Patterns implement the inline HTML element syntax for Markdown such as |
| ``*emphasis*`` or ``[links](http://example.com)``. Pattern objects should be |
| instances of classes that inherit from ``markdown.inlinepatterns.Pattern`` or |
| one of its children. Each pattern object uses a single regular expression and |
| must have the following methods: |
| |
| * **``getCompiledRegExp()``**: |
| |
| Returns a compiled regular expression. |
| |
| * **``handleMatch(m)``**: |
| |
| Accepts a match object and returns an ElementTree element of a plain |
| Unicode string. |
| |
| Note that any regular expression returned by ``getCompiledRegExp`` must capture |
| the whole block. Therefore, they should all start with ``r'^(.*?)'`` and end |
| with ``r'(.*?)!'``. When using the default ``getCompiledRegExp()`` method |
| provided in the ``Pattern`` you can pass in a regular expression without that |
| and ``getCompiledRegExp`` will wrap your expression for you. This means that |
| the first group of your match will be ``m.group(2)`` as ``m.group(1)`` will |
| match everything before the pattern. |
| |
| For an example, consider this simplified emphasis pattern: |
| |
| class EmphasisPattern(markdown.inlinepatterns.Pattern): |
| def handleMatch(self, m): |
| el = markdown.etree.Element('em') |
| el.text = m.group(3) |
| return el |
| |
| As discussed in [Integrating Your Code Into Markdown][], an instance of this |
| class will need to be provided to Markdown. That instance would be created |
| like so: |
| |
| # an oversimplified regex |
| MYPATTERN = r'\*([^*]+)\*' |
| # pass in pattern and create instance |
| emphasis = EmphasisPattern(MYPATTERN) |
| |
| Actually it would not be necessary to create that pattern (and not just because |
| a more sophisticated emphasis pattern already exists in Markdown). The fact is, |
| that example pattern is not very DRY. A pattern for `**strong**` text would |
| be almost identical, with the exception that it would create a 'strong' element. |
| Therefore, Markdown provides a number of generic pattern classes that can |
| provide some common functionality. For example, both emphasis and strong are |
| implemented with separate instances of the ``SimpleTagPettern`` listed below. |
| Feel free to use or extend any of these Pattern classes. |
| |
| **Generic Pattern Classes** |
| |
| * **``SimpleTextPattern(pattern)``**: |
| |
| Returns simple text of ``group(2)`` of a ``pattern``. |
| |
| * **``SimpleTagPattern(pattern, tag)``**: |
| |
| Returns an element of type "`tag`" with a text attribute of ``group(3)`` |
| of a ``pattern``. ``tag`` should be a string of a HTML element (i.e.: 'em'). |
| |
| * **``SubstituteTagPattern(pattern, tag)``**: |
| |
| Returns an element of type "`tag`" with no children or text (i.e.: 'br'). |
| |
| There may be other Pattern classes in the Markdown source that you could extend |
| or use as well. Read through the source and see if there is anything you can |
| use. You might even get a few ideas for different approaches to your specific |
| situation. |
| |
| <h3 id="treeprocessors">Treeprocessors</h3> |
| |
| Treeprocessors manipulate an ElemenTree object after it has passed through the |
| core BlockParser. This is where additional manipulation of the tree takes |
| place. Additionally, the InlineProcessor is a Treeprocessor which steps through |
| the tree and runs the InlinePatterns on the text of each Element in the tree. |
| |
| A Treeprocessor should inherit from ``markdown.treeprocessors.Treeprocessor``, |
| over-ride the ``run`` method which takes one argument ``root`` (an Elementree |
| object) and returns either that root element or a modified root element. |
| |
| A pseudo example: |
| |
| class MyTreeprocessor(markdown.treeprocessors.Treeprocessor): |
| def run(self, root): |
| #do stuff |
| return my_modified_root |
| |
| For specifics on manipulating the ElementTree, see |
| [Working with the ElementTree][] below. |
| |
| <h3 id="postprocessors">Postprocessors</h3> |
| |
| Postprocessors manipulate the document after the ElementTree has been |
| serialized into a string. Postprocessors should be used to work with the |
| text just before output. |
| |
| A Postprocessor should inherit from ``markdown.postprocessors.Postprocessor`` |
| and over-ride the ``run`` method which takes one argument ``text`` and returns |
| a Unicode string. |
| |
| Postprocessors are run after the ElementTree has been serialized back into |
| Unicode text. For example, this may be an appropriate place to add a table of |
| contents to a document: |
| |
| class TocPostprocessor(markdown.postprocessors.Postprocessor): |
| def run(self, text): |
| return MYMARKERRE.sub(MyToc, text) |
| |
| <h3 id="blockparser">BlockParser</h3> |
| |
| Sometimes, pre/tree/postprocessors and Inline Patterns aren't going to do what |
| you need. Perhaps you want a new type of block type that needs to be integrated |
| into the core parsing. In such a situation, you can add/change/remove |
| functionality of the core ``BlockParser``. The BlockParser is composed of a |
| number of Blockproccessors. The BlockParser steps through each block of text |
| (split by blank lines) and passes each block to the appropriate Blockprocessor. |
| That Blockprocessor parses the block and adds it to the ElementTree. The |
| [[Definition Lists]] extension would be a good example of an extension that |
| adds/modifies Blockprocessors. |
| |
| A Blockprocessor should inherit from ``markdown.blockprocessors.BlockProcessor`` |
| and implement both the ``test`` and ``run`` methods. |
| |
| The ``test`` method is used by BlockParser to identify the type of block. |
| Therefore the ``test`` method must return a boolean value. If the test returns |
| ``True``, then the BlockParser will call that Blockprocessor's ``run`` method. |
| If it returns ``False``, the BlockParser will move on to the next |
| BlockProcessor. |
| |
| The **``test``** method takes two arguments: |
| |
| * **``parent``**: The parent etree Element of the block. This can be useful as |
| the block may need to be treated differently if it is inside a list, for |
| example. |
| |
| * **``block``**: A string of the current block of text. The test may be a |
| simple string method (such as ``block.startswith(some_text)``) or a complex |
| regular expression. |
| |
| The **``run``** method takes two arguments: |
| |
| * **``parent``**: A pointer to the parent etree Element of the block. The run |
| method will most likely attach additional nodes to this parent. Note that |
| nothing is returned by the method. The Elementree object is altered in place. |
| |
| * **``blocks``**: A list of all remaining blocks of the document. Your run |
| method must remove (pop) the first block from the list (which it altered in |
| place - not returned) and parse that block. You may find that a block of text |
| legitimately contains multiple block types. Therefore, after processing the |
| first type, your processor can insert the remaining text into the beginning |
| of the ``blocks`` list for future parsing. |
| |
| Please be aware that a single block can span multiple text blocks. For example, |
| The official Markdown syntax rules state that a blank line does not end a |
| Code Block. If the next block of text is also indented, then it is part of |
| the previous block. Therefore, the BlockParser was specifically designed to |
| address these types of situations. If you notice the ``CodeBlockProcessor``, |
| in the core, you will note that it checks the last child of the ``parent``. |
| If the last child is a code block (``<pre><code>...</code></pre>``), then it |
| appends that block to the previous code block rather than creating a new |
| code block. |
| |
| Each BlockProcessor has the following utility methods available: |
| |
| * **``lastChild(parent)``**: |
| |
| Returns the last child of the given etree Element or ``None`` if it had no |
| children. |
| |
| * **``detab(text)``**: |
| |
| Removes one level of indent (four spaces by default) from the front of each |
| line of the given text string. |
| |
| * **``looseDetab(text, level)``**: |
| |
| Removes "level" levels of indent (defaults to 1) from the front of each line |
| of the given text string. However, this methods allows secondary lines to |
| not be indented as does some parts of the Markdown syntax. |
| |
| Each BlockProcessor also has a pointer to the containing BlockParser instance at |
| ``self.parser``, which can be used to check or alter the state of the parser. |
| The BlockParser tracks it's state in a stack at ``parser.state``. The state |
| stack is an instance of the ``State`` class. |
| |
| **``State``** is a subclass of ``list`` and has the additional methods: |
| |
| * **``set(state)``**: |
| |
| Set a new state to string ``state``. The new state is appended to the end |
| of the stack. |
| |
| * **``reset()``**: |
| |
| Step back one step in the stack. The last state at the end is removed from |
| the stack. |
| |
| * **``isstate(state)``**: |
| |
| Test that the top (current) level of the stack is of the given string |
| ``state``. |
| |
| Note that to ensure that the state stack doesn't become corrupted, each time a |
| state is set for a block, that state *must* be reset when the parser finishes |
| parsing that block. |
| |
| An instance of the **``BlockParser``** is found at ``Markdown.parser``. |
| ``BlockParser`` has the following methods: |
| |
| * **``parseDocument(lines)``**: |
| |
| Given a list of lines, an ElementTree object is returned. This should be |
| passed an entire document and is the only method the ``Markdown`` class |
| calls directly. |
| |
| * **``parseChunk(parent, text)``**: |
| |
| Parses a chunk of markdown text composed of multiple blocks and attaches |
| those blocks to the ``parent`` Element. The ``parent`` is altered in place |
| and nothing is returned. Extensions would most likely use this method for |
| block parsing. |
| |
| * **``parseBlocks(parent, blocks)``**: |
| |
| Parses a list of blocks of text and attaches those blocks to the ``parent`` |
| Element. The ``parent`` is altered in place and nothing is returned. This |
| method will generally only be used internally to recursively parse nested |
| blocks of text. |
| |
| While is is not recommended, an extension could subclass or completely replace |
| the ``BlockParser``. The new class would have to provide the same public API. |
| However, be aware that other extensions may expect the core parser provided |
| and will not work with such a drastically different parser. |
| |
| <h3 id="working_with_et">Working with the ElementTree</h3> |
| |
| As mentioned, the Markdown parser converts a source document to an |
| [ElementTree][] object before serializing that back to Unicode text. |
| Markdown has provided some helpers to ease that manipulation within the context |
| of the Markdown module. |
| |
| First, to get access to the ElementTree module import ElementTree from |
| ``markdown`` rather than importing it directly. This will ensure you are using |
| the same version of ElementTree as markdown. The module is named ``etree`` |
| within Markdown. |
| |
| from markdown import etree |
| |
| ``markdown.etree`` tries to import ElementTree from any known location, first |
| as a standard library module (from ``xml.etree`` in Python 2.5), then as a third |
| party package (``Elementree``). In each instance, ``cElementTree`` is tried |
| first, then ``ElementTree`` if the faster C implementation is not available on |
| your system. |
| |
| Sometimes you may want text inserted into an element to be parsed by |
| [InlinePatterns][]. In such a situation, simply insert the text as you normally |
| would and the text will be automatically run through the InlinePatterns. |
| However, if you do *not* want some text to be parsed by InlinePatterns, |
| then insert the text as an ``AtomicString``. |
| |
| some_element.text = markdown.AtomicString(some_text) |
| |
| Here's a basic example which creates an HTML table (note that the contents of |
| the second cell (``td2``) will be run through InlinePatterns latter): |
| |
| table = etree.Element("table") |
| table.set("cellpadding", "2") # Set cellpadding to 2 |
| tr = etree.SubElement(table, "tr") # Add child tr to table |
| td1 = etree.SubElement(tr, "td") # Add child td1 to tr |
| td1.text = markdown.AtomicString("Cell content") # Add plain text content |
| td2 = etree.SubElement(tr, "td") # Add second td to tr |
| td2.text = "*text* with **inline** formatting." # Add markup text |
| table.tail = "Text after table" # Add text after table |
| |
| You can also manipulate an existing tree. Consider the following example which |
| adds a ``class`` attribute to ``<a>`` elements: |
| |
| def set_link_class(self, element): |
| for child in element: |
| if child.tag == "a": |
| child.set("class", "myclass") #set the class attribute |
| set_link_class(child) # run recursively on children |
| |
| For more information about working with ElementTree see the ElementTree |
| [Documentation](http://effbot.org/zone/element-index.htm) |
| ([Python Docs](http://docs.python.org/lib/module-xml.etree.ElementTree.html)). |
| |
| <h3 id="integrating_into_markdown">Integrating Your Code Into Markdown</h3> |
| |
| Once you have the various pieces of your extension built, you need to tell |
| Markdown about them and ensure that they are run in the proper sequence. |
| Markdown accepts a ``Extension`` instance for each extension. Therefore, you |
| will need to define a class that extends ``markdown.Extension`` and over-rides |
| the ``extendMarkdown`` method. Within this class you will manage configuration |
| options for your extension and attach the various processors and patterns to |
| the Markdown instance. |
| |
| It is important to note that the order of the various processors and patterns |
| matters. For example, if we replace ``http://...`` links with <a> elements, and |
| *then* try to deal with inline html, we will end up with a mess. Therefore, |
| the various types of processors and patterns are stored within an instance of |
| the Markdown class in [OrderedDict][]s. Your ``Extension`` class will need to |
| manipulate those OrderedDicts appropriately. You may insert instances of your |
| processors and patterns into the appropriate location in an OrderedDict, remove |
| a built-in instance, or replace a built-in instance with your own. |
| |
| <h4 id="extendmarkdown">extendMarkdown</h4> |
| |
| The ``extendMarkdown`` method of a ``markdown.Extension`` class accepts two |
| arguments: |
| |
| * **``md``**: |
| |
| A pointer to the instance of the Markdown class. You should use this to |
| access the [OrderedDict][]s of processors and patterns. They are found |
| under the following attributes: |
| |
| * ``md.preprocessors`` |
| * ``md.inlinePatterns`` |
| * ``md.parser.blockprocessors`` |
| * ``md.treepreprocessors`` |
| * ``md.postprocessors`` |
| |
| Some other things you may want to access in the markdown instance are: |
| |
| * ``md.htmlStash`` |
| * ``md.output_formats`` |
| * ``md.set_output_format()`` |
| * ``md.registerExtension()`` |
| |
| * **``md_globals``**: |
| |
| Contains all the various global variables within the markdown module. |
| |
| Of course, with access to those items, theoretically you have the option to |
| changing anything through various [monkey_patching][] techniques. However, you |
| should be aware that the various undocumented or private parts of markdown |
| may change without notice and your monkey_patches may break with a new release. |
| Therefore, what you really should be doing is inserting processors and patterns |
| into the markdown pipeline. Consider yourself warned. |
| |
| [monkey_patching]: http://en.wikipedia.org/wiki/Monkey_patch |
| |
| A simple example: |
| |
| class MyExtension(markdown.Extension): |
| def extendMarkdown(self, md, md_globals): |
| # Insert instance of 'mypattern' before 'references' pattern |
| md.inlinePatterns.add('mypattern', MyPattern(md), '<references') |
| |
| <h4 id="ordereddict">OrderedDict</h4> |
| |
| An OrderedDict is a dictionary like object that retains the order of it's |
| items. The items are ordered in the order in which they were appended to |
| the OrderedDict. However, an item can also be inserted into the OrderedDict |
| in a specific location in relation to the existing items. |
| |
| Think of OrderedDict as a combination of a list and a dictionary as it has |
| methods common to both. For example, you can get and set items using the |
| ``od[key] = value`` syntax and the methods ``keys()``, ``values()``, and |
| ``items()`` work as expected with the keys, values and items returned in the |
| proper order. At the same time, you can use ``insert()``, ``append()``, and |
| ``index()`` as you would with a list. |
| |
| Generally speaking, within Markdown extensions you will be using the special |
| helper method ``add()`` to add additional items to an existing OrderedDict. |
| |
| The ``add()`` method accepts three arguments: |
| |
| * **``key``**: A string. The key is used for later reference to the item. |
| |
| * **``value``**: The object instance stored in this item. |
| |
| * **``location``**: Optional. The items location in relation to other items. |
| |
| Note that the location can consist of a few different values: |
| |
| * The special strings ``"_begin"`` and ``"_end"`` insert that item at the |
| beginning or end of the OrderedDict respectively. |
| |
| * A less-than sign (``<``) followed by an existing key (i.e.: |
| ``"<somekey"``) inserts that item before the existing key. |
| |
| * A greater-than sign (``>``) followed by an existing key (i.e.: |
| ``">somekey"``) inserts that item after the existing key. |
| |
| Consider the following example: |
| |
| >>> import markdown |
| >>> od = markdown.OrderedDict() |
| >>> od['one'] = 1 # The same as: od.add('one', 1, '_begin') |
| >>> od['three'] = 3 # The same as: od.add('three', 3, '>one') |
| >>> od['four'] = 4 # The same as: od.add('four', 4, '_end') |
| >>> od.items() |
| [("one", 1), ("three", 3), ("four", 4)] |
| |
| Note that when building an OrderedDict in order, the extra features of the |
| ``add`` method offer no real value and are not necessary. However, when |
| manipulating an existing OrderedDict, ``add`` can be very helpful. So let's |
| insert another item into the OrderedDict. |
| |
| >>> od.add('two', 2, '>one') # Insert after 'one' |
| >>> od.values() |
| [1, 2, 3, 4] |
| |
| Now let's insert another item. |
| |
| >>> od.add('twohalf', 2.5, '<three') # Insert before 'three' |
| >>> od.keys() |
| ["one", "two", "twohalf", "three", "four"] |
| |
| Note that we also could have set the location of "twohalf" to be 'after two' |
| (i.e.: ``'>two'``). However, it's unlikely that you will have control over the |
| order in which extensions will be loaded, and this could affect the final |
| sorted order of an OrderedDict. For example, suppose an extension adding |
| 'twohalf' in the above examples was loaded before a separate extension which |
| adds 'two'. You may need to take this into consideration when adding your |
| extension components to the various markdown OrderedDicts. |
| |
| Once an OrderedDict is created, the items are available via key: |
| |
| MyNode = od['somekey'] |
| |
| Therefore, to delete an existing item: |
| |
| del od['somekey'] |
| |
| To change the value of an existing item (leaving location unchanged): |
| |
| od['somekey'] = MyNewObject() |
| |
| To change the location of an existing item: |
| |
| t.link('somekey', '<otherkey') |
| |
| <h4 id="registerextension">registerExtension</h4> |
| |
| Some extensions may need to have their state reset between multiple runs of the |
| Markdown class. For example, consider the following use of the [[Footnotes]] |
| extension: |
| |
| md = markdown.Markdown(extensions=['footnotes']) |
| html1 = md.convert(text_with_footnote) |
| md.reset() |
| html2 = md.convert(text_without_footnote) |
| |
| Without calling ``reset``, the footnote definitions from the first document will |
| be inserted into the second document as they are still stored within the class |
| instance. Therefore the ``Extension`` class needs to define a ``reset`` method |
| that will reset the state of the extension (i.e.: ``self.footnotes = {}``). |
| However, as many extensions do not have a need for ``reset``, ``reset`` is only |
| called on extensions that are registered. |
| |
| To register an extension, call ``md.registerExtension`` from within your |
| ``extendMarkdown`` method: |
| |
| |
| def extendMarkdown(self, md, md_globals): |
| md.registerExtension(self) |
| # insert processors and patterns here |
| |
| Then, each time ``reset`` is called on the Markdown instance, the ``reset`` |
| method of each registered extension will be called as well. You should also |
| note that ``reset`` will be called on each registered extension after it is |
| initialized the first time. Keep that in mind when over-riding the extension's |
| ``reset`` method. |
| |
| <h4 id="configsettings">Config Settings</h4> |
| |
| If an extension uses any parameters that the user may want to change, |
| those parameters should be stored in ``self.config`` of your |
| ``markdown.Extension`` class in the following format: |
| |
| self.config = {parameter_1_name : [value1, description1], |
| parameter_2_name : [value2, description2] } |
| |
| When stored this way the config parameters can be over-ridden from the |
| command line or at the time Markdown is initiated: |
| |
| markdown.py -x myextension(SOME_PARAM=2) inputfile.txt > output.txt |
| |
| Note that parameters should always be assumed to be set to string |
| values, and should be converted at run time. For example: |
| |
| i = int(self.getConfig("SOME_PARAM")) |
| |
| <h4 id="makeextension">makeExtension</h4> |
| |
| Each extension should ideally be placed in its own module starting |
| with the ``mdx_`` prefix (e.g. ``mdx_footnotes.py``). The module must |
| provide a module-level function called ``makeExtension`` that takes |
| an optional parameter consisting of a dictionary of configuration over-rides |
| and returns an instance of the extension. An example from the footnote |
| extension: |
| |
| def makeExtension(configs=None) : |
| return FootnoteExtension(configs=configs) |
| |
| By following the above example, when Markdown is passed the name of your |
| extension as a string (i.e.: ``'footnotes'``), it will automatically import |
| the module and call the ``makeExtension`` function initiating your extension. |
| |
| You may have noted that the extensions packaged with Python-Markdown do not |
| use the ``mdx_`` prefix in their module names. This is because they are all |
| part of the ``markdown.extensions`` package. Markdown will first try to import |
| from ``markdown.extensions.extname`` and upon failure, ``mdx_extname``. If both |
| fail, Markdown will continue without the extension. |
| |
| However, Markdown will also accept an already existing instance of an extension. |
| For example: |
| |
| import markdown |
| import myextension |
| configs = {...} |
| myext = myextension.MyExtension(configs=configs) |
| md = markdown.Markdown(extensions=[myext]) |
| |
| This is useful if you need to implement a large number of extensions with more |
| than one residing in a module. |
| |
| [Preprocessors]: #preprocessors |
| [InlinePatterns]: #inlinepatterns |
| [Treeprocessors]: #treeprocessors |
| [Postprocessors]: #postprocessors |
| [BlockParser]: #blockparser |
| [Working with the ElementTree]: #working_with_et |
| [Integrating your code into Markdown]: #integrating_into_markdown |
| [extendMarkdown]: #extendmarkdown |
| [OrderedDict]: #ordereddict |
| [registerExtension]: #registerextension |
| [Config Settings]: #configsettings |
| [makeExtension]: #makeextension |
| [ElementTree]: http://effbot.org/zone/element-index.htm |