Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speeding up regular expressions in Python

I need to quickly extract text from HTML files. I am using the following regular expressions instead of a full-fledged parser since I need to be fast rather than accurate (I have more than a terabyte of text). The profiler shows that most of the time in my script is spent in the re.sub procedure. What are good ways of speeding up my process? I can implement some portions in C, but I wonder whether that will help given that the time is spent inside re.sub, which I think would be efficiently implemented.

# Remove scripts, styles, tags, entities, and extraneous spaces:
scriptRx    = re.compile("<script.*?/script>", re.I)
styleRx     = re.compile("<style.*?/style>", re.I)
tagsRx      = re.compile("<[!/]?[a-zA-Z-]+[^<>]*>")
entitiesRx  = re.compile("&[0-9a-zA-Z]+;")
spacesRx    = re.compile("\s{2,}")
....
text = scriptRx.sub(" ", text)
text = styleRx.sub(" ", text)
....

Thanks!

like image 556
Abhi Avatar asked Dec 07 '22 02:12

Abhi


1 Answers

First, use an HTML parser built for this, like BeautifulSoup:

http://www.crummy.com/software/BeautifulSoup/

Then, you can identify remaining particular slow spots with the profiler:

http://docs.python.org/library/profile.html

And for learning about regular expressions, I've found Mastering Regular Expressions very valuable, no matter what the programming language:

http://oreilly.com/catalog/9781565922570

Also:

How can I debug a regular expression in python?

Due to the reclarification of the use-case, then for this request, I would say the above is not what you want. My alternate recommendation would be: Speeding up regular expressions in Python

like image 184
eruciform Avatar answered Jan 03 '23 08:01

eruciform