I'm trying to convert a chunk of HTML text with BeautifulSoup. Here is an example:
<div> <p> Some text <span>more text</span> even more text </p> <ul> <li>list item</li> <li>yet another list item</li> </ul> </div> <p>Some other text</p> <ul> <li>list item</li> <li>yet another list item</li> </ul>
I tried doing something like:
def parse_text(contents_string) Newlines = re.compile(r'[\r\n]\s+') bs = BeautifulSoup.BeautifulSoup(contents_string, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES) txt = bs.getText('\n') return Newlines.sub('\n', txt)
...but that way my span element is always on a new line. This is of course a simple example. Is there a way to get the text in the HTML page as the way it will be rendered in the browser (no css rules required, just the regular way div, span, li, etc. elements are rendered) in Python?
To extract text from HTML file using Python, we can use BeautifulSoup. We call urllib. request. urlopen with the url we want to get the HTML text from.
BeautifulSoup is a scraping library, so it's probably not the best choice for doing HTML rendering. If it's not essential to use BeautifulSoup, you should take a look at html2text
. For example:
import html2text html = open("foobar.html").read() print html2text.html2text(html)
This outputs:
Some text more text even more text * list item * yet another list item Some other text * list item * yet another list item
I was encountering the same problem trying to parse the rendered HTML. Basically it seems that BS is not the ideal package for this. @Del gives the great html2text solution.
On a differet SO question: BeautifulSoup get_text does not strip all tags and JavaScript @Helge mentioned using nltk. Unfortunately nltk appears to be discontinuing this method.
I tried both html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speeds highly depend on the contents of the data...
Answer from @Helge (nltk).
import nltk %timeit nltk.clean_html(html) was returning 153 us per loop
It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.
Answer above from @del
betterHTML = html.decode(errors='ignore') %timeit html2text.html2text(betterHTML) %3.09 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With