Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove all html tags from downloaded page

Tags:

python

I have downloaded a page using urlopen. How do I remove all html tags from it? Is there any regexp to replace all <*> tags?

like image 348
Oleg Tarasenko Avatar asked Jul 28 '10 09:07

Oleg Tarasenko


3 Answers

You could use html2text which is supposed to make a readable text equivalent from an HTML source (programatically with Python or as a command-line tool). Thus I may extrapolate your needs from your question...

like image 71
Pierre Avatar answered Oct 21 '22 08:10

Pierre


I can also recommend BeautifulSoup which is an easy to use html parser. There you would do something like:

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)
all_text = ''.join(soup.findAll(text=True))

This way you get all the text from a html document.

like image 33
Uli Held Avatar answered Oct 21 '22 06:10

Uli Held


There's a great python library called bleach. This call below will remove all html tags, leaving everything else (but not removing the content inside tags that are not visible).

bleach.clean(thestring, tags=[], attributes={}, styles=[], strip=True)
like image 22
Jeremy Robin Avatar answered Oct 21 '22 08:10

Jeremy Robin