Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python : How to convert markdown formatted text to text

I need to convert markdown text to plain text format to display summary in my website. I want the code in python.

like image 703
Krish Avatar asked Apr 17 '09 19:04

Krish


People also ask

How do I read a .md file in Python?

You use the open() function to open the Picnic.md file; passing the value 'r' to the mode parameter to signify that Python should open it for reading. You save the file object in a variable called f , which you can use to reference the file. Then you read the file and save its contents inside the text variable.


2 Answers

The Markdown and BeautifulSoup (now called beautifulsoup4) modules will help do what you describe.

Once you have converted the markdown to HTML, you can use a HTML parser to strip out the plain text.

Your code might look something like this:

from bs4 import BeautifulSoup from markdown import markdown  html = markdown(some_html_string) text = ''.join(BeautifulSoup(html).findAll(text=True)) 
like image 62
Jason Coon Avatar answered Oct 26 '22 20:10

Jason Coon


Despite the fact that this is a very old question, I'd like to suggest a solution I came up with recently. This one neither uses BeautifulSoup nor has an overhead of converting to html and back.

The markdown module core class Markdown has a property output_formats which is not configurable but otherwise patchable like almost anything in python is. This property is a dict mapping output format name to a rendering function. By default it has two output formats, 'html' and 'xhtml' correspondingly. With a little help it may have a plaintext rendering function which is easy to write:

from markdown import Markdown from io import StringIO   def unmark_element(element, stream=None):     if stream is None:         stream = StringIO()     if element.text:         stream.write(element.text)     for sub in element:         unmark_element(sub, stream)     if element.tail:         stream.write(element.tail)     return stream.getvalue()   # patching Markdown Markdown.output_formats["plain"] = unmark_element __md = Markdown(output_format="plain") __md.stripTopLevelTags = False   def unmark(text):     return __md.convert(text) 

unmark function takes markdown text as an input and returns all the markdown characters stripped out.

like image 35
Pavel Vorobyov Avatar answered Oct 26 '22 19:10

Pavel Vorobyov