Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse Sphinx like documentation

I have a Sphinx formatted docstring from which I would like to extract the different parts (param, return, type, rtype, etc) for further processing. How can I achieve this?

like image 608
Hernan Avatar asked Jan 16 '23 05:01

Hernan


1 Answers

You could use docutils, which is what Sphinx is built upon. In this other answer I use docutils.core.publish_doctree to get an XML representation of a reStructuredText document (actually a string of text) and then extract field lists from that XML using the xml.minidom methods. An alternative method is to use xml.etree.ElementTree, which is far easier to use in my opinion.

First, however, every time docutils encounters a block of reStructuredText like

:param x: Some parameter

the resulting XML representation is (I know, it is quite verbose):

<field_list>
    <field>
        <field_name>
            param x
        </field_name>
        <field_body>
            <paragraph>
                Some parameter
            </paragraph>
        </field_body>
    </field>
</field_list>

The following code will take all field_list elements in a document and put the text from field/field_name and field/field_body/paragraph as a 2-tuple in a list. You can then manipulate this how you wish for post processing.

from docutils.core import publish_doctree
import xml.etree.ElementTree as etree

source = """Some help text

:param x: some parameter
:type x: and it's type

:return: Some text
:rtype: Return type

Some trailing text. I have no idea if the above is valid Sphinx
documentation!
"""

doctree = publish_doctree(source).asdom()

# Convert to etree.ElementTree since this is easier to work with than
# xml.minidom
doctree = etree.fromstring(doctree.toxml())

# Get all field lists in the document.
field_lists = doctree.findall('field_list')

fields = [f for field_list in field_lists \
    for f in field_list.findall('field')]

field_names = [name.text for field in fields \
    for name in field.findall('field_name')]

field_text = [etree.tostring(element) for field in fields \
    for element in field.findall('field_body')]

print zip(field_names, field_text)

This yields the list:

[('param x', '<field_body><paragraph>some parameter</paragraph></field_body>'),
 ('type x', "<field_body><paragraph>and it's type</paragraph></field_body>"), 
 ('return', '<field_body><paragraph>Some text</paragraph></field_body>'), 
 ('rtype', '<field_body><paragraph>Return type</paragraph></field_body>')]

So the first item in each tuple is the field list item (i.e. :return:, :param x: etc) and the second item is the corresponding text. Obviously this text is not the cleanest output - but the above code is pretty easy to modify so I leave it up to the OP to get the exact output they want.

like image 124
Chris Avatar answered Jan 18 '23 23:01

Chris