Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing elements from a markdown file in python 3

How might I get a list of elements from a markdown file in python 3? I'm specifically interested in getting a list of all images and links (along with relevant information like alt-text and link text) out of a markdown file.

this Is some prior art in this area, but it is almost exactly 2 years old at this point, and I expect that the landscape has changed a bit.

Bonus points if the parser you come up with supports multimarkdown.

like image 413
Andrew Spott Avatar asked Dec 03 '16 07:12

Andrew Spott


People also ask

How do I read a markdown file in Python?

You use the open() function to open the Picnic.md file; passing the value 'r' to the mode parameter to signify that Python should open it for reading. You save the file object in a variable called f , which you can use to reference the file. Then you read the file and save its contents inside the text variable.

Can you use Python in markdown?

The reticulate package includes a Python engine for R Markdown that enables easy interoperability between Python and R chunks.

How do I use Python code in markdown?

To add a Python code chunk to an R Markdown document, you can use the chunk header ```{python} , e.g., ```{python} print("Hello Python!") ```


1 Answers

If you take advantage of two Python packages, pypandoc and panflute, you could do it quite pythonically in a few lines (sample code):

Given a text file example.md, and assuming you have Python 3.3+ and already did pip install pypandoc panflute, then place the sample code in the same folder and run it from the shell or from e.g. IDLE.

import io
import pypandoc
import panflute

def action(elem, doc):
    if isinstance(elem, panflute.Image):
        doc.images.append(elem)
    elif isinstance(elem, panflute.Link):
        doc.links.append(elem)

if __name__ == '__main__':
    data = pypandoc.convert_file('example.md', 'json')
    doc = panflute.load(io.StringIO(data))
    doc.images = []
    doc.links = []
    doc = panflute.run_filter(action, prepare=prepare, doc=doc)

    print("\nList of image URLs:")
    for image in doc.images:
        print(image.url)

The steps are:

  1. Use pypandoc to obtain a json string that contains the AST of the markdown document
  2. Load it into panflute to create a Doc object (panflute requires a stream so we use StringIO)
  3. Use the run_filter function to iterate over every element, and extract the Image and Link objects.
  4. Then you can print the urls, alt text, etc.
like image 71
Sergio Correia Avatar answered Sep 20 '22 11:09

Sergio Correia