Speeding up regular expressions in Python

Question

I need to quickly extract text from HTML files. I am using the following regular expressions instead of a full-fledged parser since I need to be fast rather than accurate (I have more than a terabyte of text). The profiler shows that most of the time in my script is spent in the re.sub procedure. What are good ways of speeding up my process? I can implement some portions in C, but I wonder whether that will help given that the time is spent inside re.sub, which I think would be efficiently implemented.

# Remove scripts, styles, tags, entities, and extraneous spaces:
scriptRx    = re.compile("<script.*?/script>", re.I)
styleRx     = re.compile("<style.*?/style>", re.I)
tagsRx      = re.compile("<[!/]?[a-zA-Z-]+[^<>]*>")
entitiesRx  = re.compile("&[0-9a-zA-Z]+;")
spacesRx    = re.compile("\s{2,}")
....
text = scriptRx.sub(" ", text)
text = styleRx.sub(" ", text)
....

Thanks!

eruciform · Accepted Answer

First, use an HTML parser built for this, like BeautifulSoup:

http://www.crummy.com/software/BeautifulSoup/

Then, you can identify remaining particular slow spots with the profiler:

http://docs.python.org/library/profile.html

And for learning about regular expressions, I've found Mastering Regular Expressions very valuable, no matter what the programming language:

http://oreilly.com/catalog/9781565922570

Also:

How can I debug a regular expression in python?

Due to the reclarification of the use-case, then for this request, I would say the above is not what you want. My alternate recommendation would be: Speeding up regular expressions in Python

Speeding up regular expressions in Python

Tags:

python

regex

optimization

Abhi

1 Answers

eruciform

Recent Activity

Donate For Us

Speeding up regular expressions in Python

Tags:

python

regex

optimization

Abhi

1 Answers

eruciform

Related questions

Recent Activity

Donate For Us