Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to convert HTML to plaintext using Python

I'm working on a project that involves converting a large amount of HTML content to plain/text. I have a custom-written module that does the job OK, but I'm wondering if there's some standard tools to help get the job done.

like image 479
Brian Tol Avatar asked Nov 03 '09 15:11

Brian Tol


People also ask

How do I convert HTML to text in Python?

This can be achieved with the help of html. escape() method(for Python 3.4+), we can convert the ASCII string into HTML script by replacing ASCII characters with special characters by using html. escape() method. By this method we can decode the HTML entities into text.

How do I convert HTML to Markdown in Python?

This method is useful if you're bulk converting a bunch of HTML files into Markdown – just iterate over a list of HTML files and save them to Markdown files. from markdownify import markdownify file = open("./hello-world. html", "r"). read() html = markdownify(file, heading_style="ATX") print(html) ## ## Hello, World!


2 Answers

Html2Text seems to be a good option

like image 55
Chris Ballance Avatar answered Oct 04 '22 10:10

Chris Ballance


Here's a python library which does HTML parsing:

  • lxml.html

BeautifulSoup is another option.

like image 38
tcarobruce Avatar answered Oct 04 '22 12:10

tcarobruce