Currently I am having trouble typing this because, according to top
, my processor is at 100% and my memory is at 85.7%, all being taken up by python.
Why? Because I had it go through a 250-meg file to remove markup. 250 megs, that's it! I've been manipulating these files in python with so many other modules and things; BeautifulSoup is the first code to give me any problems with something so small. How are nearly 4 gigs of RAM used to manipulate 250megs of html?
The one-liner that I found (on stackoverflow) and have been using was this:
''.join(BeautifulSoup(corpus).findAll(text=True))
Additionally, this seems to remove everything BUT markup, which is sort of the opposite of what I want to do. I'm sure that BeautifulSoup can do that, too, but the speed issue remains.
Is there anything that will do something similar (remove markup, leave text reliably) and NOT require a Cray to run?
lxml.html is FAR more efficient.
http://lxml.de/lxmlhtml.html
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
Looks like this will do what you want.
import lxml.html
t = lxml.html.fromstring("...")
t.text_content()
A couple of other similar questions: python [lxml] - cleaning out html tags
lxml.etree, element.text doesn't return the entire text from an element
Filter out HTML tags and resolve entities in python
You probably want to clean the HTML to remove all scripts and CSS, and then extract the text using .text_content()
from lxml import html
from lxml.html.clean import clean_html
tree = html.parse('http://www.example.com')
tree = clean_html(tree)
text = tree.getroot().text_content()
(From: Remove all html in python?)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With