Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML to Markdown with html2text

I can succesfully transform some HTML code into markdown in python using the html2text library and it looks like this :

def mark_down_formatting(html_text, url):
    h = html2text.HTML2Text()

    # Options to transform URL into absolute links
    h.body_width = 0
    h.protect_links = True
    h.wrap_links = False
    h.baseurl = url

    md_text = h.handle(html_text)

    return md_text

And it was nice for a time but it has limits since I don't find any way to customize the output on the documentation.

Actually I don't need a lot of customisation, I only need this HTML tag <span class="searched_found">example text</span> to be transformed in markdown into anything I give. It could be this +example text+

So I'm searching a solution to my problem, also since html2text is a good library that allows me to configure some options, like the ones I showed with the hyperlinks, it would be nice to have a solution based on this library.

UPDATE:

I have a solution using the BeautifulSoup library but I consider it to be a temporary patch since it adds another dependency and it adds a lot of unnecessary processing. What I did here was to edit the HTML before parsing into markdown :

def processing_to_markdown(html_text, url, delimiter):
    # Not using "lxml" parser since I get to see a lot of different HTML
    # and the "lxml" parser tend to drop content when parsing very big HTML
    # that has some errors inside
    soup = BeautifulSoup(html_text, "html.parser")

    # Finds all <span class="searched_found">...</span> tags
    for tag in soup.findAll('span', class_="searched_found"):
        tag.string = delimiter + tag.string + delimiter
        tag.unwrap()  # Removes the tags to only keep the text

    html_text = unicode(soup)

    return mark_down_formatting(html_text, url)

With very long HTML content, this proves to be quite slow as we parse the HTML twice, once with BeautifulSoup and then with html2text.

like image 986
Qrom Avatar asked Jul 11 '17 12:07

Qrom


People also ask

How do I convert HTML content to plain text in Python?

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format). Escape all special characters. Output is less readable, but avoids corner case formatting issues.

How do I convert HTML to markdown in Python?

To convert HTML to Markdown, I recommend using the Markdownify package by Matthew Dapena-Tretter. Use pip to install Markdownify. After installing Markdownify, converting HTML to Markdown is super easy.


1 Answers

markdownify can help

markdownify uses BeautifulSoup for parsing

soup = BeautifulSoup(html, 'html.parser')

the transformation can be customized with

import markdownify

"""
https://stackoverflow.com/questions/45034227/html-to-markdown-with-html2text
https://beautiful-soup-4.readthedocs.io/en/latest/#multi-valued-attributes
https://beautiful-soup-4.readthedocs.io/en/latest/#contents-and-children
"""

class CustomMarkdownConverter(markdownify.MarkdownConverter):
    def convert_a(self, el, text, convert_as_inline):
        classList = el.get("class")
        if classList and "searched_found" in classList:
            # custom transformation
            # unwrap child nodes of <a class="searched_found">
            text = ""
            for child in el.children:
                text += super().process_tag(child, convert_as_inline)
            return text
        # default transformation
        return super().convert_a(el, text, convert_as_inline)

# Create shorthand method for conversion
def md4html(html, **options):
    return CustomMarkdownConverter(**options).convert(html)

md = md4html("""<a class="searched_found"><b>hello</b> world</a>""")
like image 134
Mila Nautikus Avatar answered Oct 20 '22 20:10

Mila Nautikus