HTML to Markdown with html2text

I can succesfully transform some HTML code into markdown in python using the html2text library and it looks like this :

def mark_down_formatting(html_text, url):
    h = html2text.HTML2Text()

    # Options to transform URL into absolute links
    h.body_width = 0
    h.protect_links = True
    h.wrap_links = False
    h.baseurl = url

    md_text = h.handle(html_text)

    return md_text

And it was nice for a time but it has limits since I don't find any way to customize the output on the documentation.

Actually I don't need a lot of customisation, I only need this HTML tag <span class="searched_found">example text</span> to be transformed in markdown into anything I give. It could be this +example text+

So I'm searching a solution to my problem, also since html2text is a good library that allows me to configure some options, like the ones I showed with the hyperlinks, it would be nice to have a solution based on this library.

UPDATE:

I have a solution using the BeautifulSoup library but I consider it to be a temporary patch since it adds another dependency and it adds a lot of unnecessary processing. What I did here was to edit the HTML before parsing into markdown :

def processing_to_markdown(html_text, url, delimiter):
    # Not using "lxml" parser since I get to see a lot of different HTML
    # and the "lxml" parser tend to drop content when parsing very big HTML
    # that has some errors inside
    soup = BeautifulSoup(html_text, "html.parser")

    # Finds all <span class="searched_found">...</span> tags
    for tag in soup.findAll('span', class_="searched_found"):
        tag.string = delimiter + tag.string + delimiter
        tag.unwrap()  # Removes the tags to only keep the text

    html_text = unicode(soup)

    return mark_down_formatting(html_text, url)

With very long HTML content, this proves to be quite slow as we parse the HTML twice, once with BeautifulSoup and then with html2text.

How do I convert HTML content to plain text in Python?

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format). Escape all special characters. Output is less readable, but avoids corner case formatting issues.

How do I convert HTML to markdown in Python?

To convert HTML to Markdown, I recommend using the Markdownify package by Matthew Dapena-Tretter. Use pip to install Markdownify. After installing Markdownify, converting HTML to Markdown is super easy.

markdownify can help

markdownify uses BeautifulSoup for parsing

soup = BeautifulSoup(html, 'html.parser')

the transformation can be customized with

import markdownify

"""
https://stackoverflow.com/questions/45034227/html-to-markdown-with-html2text
https://beautiful-soup-4.readthedocs.io/en/latest/#multi-valued-attributes
https://beautiful-soup-4.readthedocs.io/en/latest/#contents-and-children
"""

class CustomMarkdownConverter(markdownify.MarkdownConverter):
    def convert_a(self, el, text, convert_as_inline):
        classList = el.get("class")
        if classList and "searched_found" in classList:
            # custom transformation
            # unwrap child nodes of <a class="searched_found">
            text = ""
            for child in el.children:
                text += super().process_tag(child, convert_as_inline)
            return text
        # default transformation
        return super().convert_a(el, text, convert_as_inline)

# Create shorthand method for conversion
def md4html(html, **options):
    return CustomMarkdownConverter(**options).convert(html)

md = md4html("""<a class="searched_found"><b>hello</b> world</a>""")

HTML to Markdown with html2text

Tags:

python

html

markdown

parsing

UPDATE:

Qrom

People also ask

1 Answers

Mila Nautikus

Recent Activity

Donate For Us

HTML to Markdown with html2text

Tags:

python

html

markdown

parsing

UPDATE:

Qrom

People also ask

1 Answers

Mila Nautikus

Related questions

Recent Activity

Donate For Us