I have a Sphinx formatted docstring from which I would like to extract the different parts (param, return, type, rtype, etc) for further processing. How can I achieve this?
You could use docutils, which is what Sphinx is built upon. In this other answer I use docutils.core.publish_doctree
to get an XML representation of a reStructuredText document (actually a string of text) and then extract field lists from that XML using the xml.minidom methods. An alternative method is to use xml.etree.ElementTree, which is far easier to use in my opinion.
First, however, every time docutils encounters a block of reStructuredText like
:param x: Some parameter
the resulting XML representation is (I know, it is quite verbose):
<field_list>
<field>
<field_name>
param x
</field_name>
<field_body>
<paragraph>
Some parameter
</paragraph>
</field_body>
</field>
</field_list>
The following code will take all field_list
elements in a document and put the text from field/field_name
and field/field_body/paragraph
as a 2-tuple in a list. You can then manipulate this how you wish for post processing.
from docutils.core import publish_doctree
import xml.etree.ElementTree as etree
source = """Some help text
:param x: some parameter
:type x: and it's type
:return: Some text
:rtype: Return type
Some trailing text. I have no idea if the above is valid Sphinx
documentation!
"""
doctree = publish_doctree(source).asdom()
# Convert to etree.ElementTree since this is easier to work with than
# xml.minidom
doctree = etree.fromstring(doctree.toxml())
# Get all field lists in the document.
field_lists = doctree.findall('field_list')
fields = [f for field_list in field_lists \
for f in field_list.findall('field')]
field_names = [name.text for field in fields \
for name in field.findall('field_name')]
field_text = [etree.tostring(element) for field in fields \
for element in field.findall('field_body')]
print zip(field_names, field_text)
This yields the list:
[('param x', '<field_body><paragraph>some parameter</paragraph></field_body>'),
('type x', "<field_body><paragraph>and it's type</paragraph></field_body>"),
('return', '<field_body><paragraph>Some text</paragraph></field_body>'),
('rtype', '<field_body><paragraph>Return type</paragraph></field_body>')]
So the first item in each tuple is the field list item (i.e. :return:
, :param x:
etc) and the second item is the corresponding text. Obviously this text is not the cleanest output - but the above code is pretty easy to modify so I leave it up to the OP to get the exact output they want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With