I need to quickly extract text from HTML files. I am using the following regular expressions instead of a full-fledged parser since I need to be fast rather than accurate (I have more than a terabyte of text). The profiler shows that most of the time in my script is spent in the re.sub procedure. What are good ways of speeding up my process? I can implement some portions in C, but I wonder whether that will help given that the time is spent inside re.sub, which I think would be efficiently implemented.
# Remove scripts, styles, tags, entities, and extraneous spaces:
scriptRx = re.compile("<script.*?/script>", re.I)
styleRx = re.compile("<style.*?/style>", re.I)
tagsRx = re.compile("<[!/]?[a-zA-Z-]+[^<>]*>")
entitiesRx = re.compile("&[0-9a-zA-Z]+;")
spacesRx = re.compile("\s{2,}")
....
text = scriptRx.sub(" ", text)
text = styleRx.sub(" ", text)
....
Thanks!
First, use an HTML parser built for this, like BeautifulSoup:
http://www.crummy.com/software/BeautifulSoup/
Then, you can identify remaining particular slow spots with the profiler:
http://docs.python.org/library/profile.html
And for learning about regular expressions, I've found Mastering Regular Expressions very valuable, no matter what the programming language:
http://oreilly.com/catalog/9781565922570
Also:
How can I debug a regular expression in python?
Due to the reclarification of the use-case, then for this request, I would say the above is not what you want. My alternate recommendation would be: Speeding up regular expressions in Python
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With