I can succesfully transform some HTML code into markdown in python using the html2text library and it looks like this :
def mark_down_formatting(html_text, url):
h = html2text.HTML2Text()
# Options to transform URL into absolute links
h.body_width = 0
h.protect_links = True
h.wrap_links = False
h.baseurl = url
md_text = h.handle(html_text)
return md_text
And it was nice for a time but it has limits since I don't find any way to customize the output on the documentation.
Actually I don't need a lot of customisation, I only need this HTML tag <span class="searched_found">example text</span>
to be transformed in markdown into anything I give. It could be this +example text+
So I'm searching a solution to my problem, also since html2text is a good library that allows me to configure some options, like the ones I showed with the hyperlinks, it would be nice to have a solution based on this library.
I have a solution using the BeautifulSoup library but I consider it to be a temporary patch since it adds another dependency and it adds a lot of unnecessary processing. What I did here was to edit the HTML before parsing into markdown :
def processing_to_markdown(html_text, url, delimiter):
# Not using "lxml" parser since I get to see a lot of different HTML
# and the "lxml" parser tend to drop content when parsing very big HTML
# that has some errors inside
soup = BeautifulSoup(html_text, "html.parser")
# Finds all <span class="searched_found">...</span> tags
for tag in soup.findAll('span', class_="searched_found"):
tag.string = delimiter + tag.string + delimiter
tag.unwrap() # Removes the tags to only keep the text
html_text = unicode(soup)
return mark_down_formatting(html_text, url)
With very long HTML content, this proves to be quite slow as we parse the HTML twice, once with BeautifulSoup and then with html2text.
html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format). Escape all special characters. Output is less readable, but avoids corner case formatting issues.
To convert HTML to Markdown, I recommend using the Markdownify package by Matthew Dapena-Tretter. Use pip to install Markdownify. After installing Markdownify, converting HTML to Markdown is super easy.
markdownify can help
markdownify uses BeautifulSoup for parsing
soup = BeautifulSoup(html, 'html.parser')
the transformation can be customized with
import markdownify
"""
https://stackoverflow.com/questions/45034227/html-to-markdown-with-html2text
https://beautiful-soup-4.readthedocs.io/en/latest/#multi-valued-attributes
https://beautiful-soup-4.readthedocs.io/en/latest/#contents-and-children
"""
class CustomMarkdownConverter(markdownify.MarkdownConverter):
def convert_a(self, el, text, convert_as_inline):
classList = el.get("class")
if classList and "searched_found" in classList:
# custom transformation
# unwrap child nodes of <a class="searched_found">
text = ""
for child in el.children:
text += super().process_tag(child, convert_as_inline)
return text
# default transformation
return super().convert_a(el, text, convert_as_inline)
# Create shorthand method for conversion
def md4html(html, **options):
return CustomMarkdownConverter(**options).convert(html)
md = md4html("""<a class="searched_found"><b>hello</b> world</a>""")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With