Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speedier/less resource-demolishing way to strip html from large files than BeautifulSoup? Or, a better way to use BeautifulSoup?

Currently I am having trouble typing this because, according to top, my processor is at 100% and my memory is at 85.7%, all being taken up by python.

Why? Because I had it go through a 250-meg file to remove markup. 250 megs, that's it! I've been manipulating these files in python with so many other modules and things; BeautifulSoup is the first code to give me any problems with something so small. How are nearly 4 gigs of RAM used to manipulate 250megs of html?

The one-liner that I found (on stackoverflow) and have been using was this:

''.join(BeautifulSoup(corpus).findAll(text=True))

Additionally, this seems to remove everything BUT markup, which is sort of the opposite of what I want to do. I'm sure that BeautifulSoup can do that, too, but the speed issue remains.

Is there anything that will do something similar (remove markup, leave text reliably) and NOT require a Cray to run?

like image 711
WaxProlix Avatar asked Jan 24 '11 12:01

WaxProlix


1 Answers

lxml.html is FAR more efficient.

http://lxml.de/lxmlhtml.html

enter image description here

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Looks like this will do what you want.

import lxml.html
t = lxml.html.fromstring("...")
t.text_content()

A couple of other similar questions: python [lxml] - cleaning out html tags

lxml.etree, element.text doesn't return the entire text from an element

Filter out HTML tags and resolve entities in python

UPDATE:

You probably want to clean the HTML to remove all scripts and CSS, and then extract the text using .text_content()

from lxml import html
from lxml.html.clean import clean_html

tree = html.parse('http://www.example.com')
tree = clean_html(tree)

text = tree.getroot().text_content()

(From: Remove all html in python?)

like image 157
Acorn Avatar answered Nov 10 '22 05:11

Acorn