Text processing - Python vs Perl performance [closed]

Tags:

Here is my Perl and Python script to do some simple text processing from about 21 log files, each about 300 KB to 1 MB (maximum) x 5 times repeated (total of 125 files, due to the log repeated 5 times).

Python Code (code modified to use compiled re and using re.I)

#!/usr/bin/python  import re import fileinput  exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I) location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)  for line in fileinput.input():     fn = fileinput.filename()     currline = line.rstrip()      mprev = exists_re.search(currline)      if(mprev):         xlogtime = mprev.group(1)      mcurr = location_re.search(currline)      if(mcurr):         print fn, xlogtime, mcurr.group(1)

Perl Code

#!/usr/bin/perl  while (<>) {     chomp;      if (m/^(.*?) INFO.*Such a record already exists/i) {         $xlogtime = $1;     }      if (m/^AwbLocation (.*?) insert into/i) {         print "$ARGV $xlogtime $1\n";     } }

And, on my PC both code generates exactly the same result file of 10,790 lines. And, here is the timing done on Cygwin's Perl and Python implementations.

User@UserHP /cygdrive/d/tmp/Clipboard # time /tmp/scripts/python/afs/process_file.py *log* *log* *log* *log* *log* > summarypy.log  real    0m8.185s user    0m8.018s sys     0m0.092s  User@UserHP /cygdrive/d/tmp/Clipboard # time /tmp/scripts/python/afs/process_file.pl *log* *log* *log* *log* *log* > summarypl.log  real    0m1.481s user    0m1.294s sys     0m0.124s

Originally, it took 10.2 seconds using Python and only 1.9 secs using Perl for this simple text processing.

(UPDATE) but, after the compiled re version of Python, it now takes 8.2 seconds in Python and 1.5 seconds in Perl. Still Perl is much faster.

Is there a way to improve the speed of Python at all OR it is obvious that Perl will be the speedy one for simple text processing.

By the way this was not the only test I did for simple text processing... And, each different way I make the source code, always always Perl wins by a large margin. And, not once did Python performed better for simple m/regex/ match and print stuff.

Please do not suggest to use C, C++, Assembly, other flavours of Python, etc.

I am looking for a solution using Standard Python with its built-in modules compared against Standard Perl (not even using the modules). Boy, I wish to use Python for all my tasks due to its readability, but to give up speed, I don't think so.

So, please suggest how can the code be improved to have comparable results with Perl.

UPDATE: 2012-10-18

As other users suggested, Perl has its place and Python has its.

So, for this question, one can safely conclude that for simple regex match on each line for hundreds or thousands of text files and writing the results to a file (or printing to screen), Perl will always, always WIN in performance for this job. It as simple as that.

Please note that when I say Perl wins in performance... only standard Perl and Python is compared... not resorting to some obscure modules (obscure for a normal user like me) and also not calling C, C++, assembly libraries from Python or Perl. We don't have time to learn all these extra steps and installation for a simple text matching job.

So, Perl rocks for text processing and regex.

Python has its place to rock in other places.

Update 2013-05-29: An excellent article that does similar comparison is here. Perl again wins for simple text matching... And for more details, read the article.

929

asked Oct 09 '12 05:10

ihightower

2 Answers

This is exactly the sort of stuff that Perl was designed to do, so it doesn't surprise me that it's faster.

One easy optimization in your Python code would be to precompile those regexes, so they aren't getting recompiled each time.

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists') location_re = re.compile(r'^AwbLocation (.*?) insert into')

And then in your loop:

mprev = exists_re.search(currline)

and

mcurr = location_re.search(currline)

That by itself won't magically bring your Python script in line with your Perl script, but repeatedly calling re in a loop without compiling first is bad practice in Python.

129

answered Sep 19 '22 15:09

Josh Wright

Hypothesis: Perl spends less time backtracking in lines that don't match due to optimisations it has that Python doesn't.

What do you get by replacing

^(.*?) INFO.*Such a record already exists

with

^((?:(?! INFO).)*?) INFO.*Such a record already

^(?>(.*?) INFO).*Such a record already exists

answered Sep 16 '22 15:09

ikegami

Related questions
                            
                                Can a lambda function call itself recursively in Python?
                            
                                Parallelize apply after pandas groupby
                            
                                Remove HTML tags not on an allowed list from a Python string
                            
                                Python 3.1.1 string to hex
                            
                                Generate password in python
                            
                                Number of days between 2 dates, excluding weekends
                            
                                sscanf in Python
                            
                                How can one mock/stub python module like urllib
                            
                                Select NULL Values in SQLAlchemy
                            
                                Error when install pylibmc using pip
                            
                                UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
                            
                                mysql-python install error: Cannot open include file 'config-win.h'
                            
                                How to use Bulk API to store the keywords in ES by using Python
                            
                                Pass percentiles to pandas agg function
                            
                                Can PyYAML dump dict items in non-alphabetical order?
                            
                                How to pad a string to a fixed length with spaces in Python?
                            
                                How do I use AND in a Django filter?
                            
                                How to extract a filename from a URL & append a word to it?
                            
                                Cannot use Requests-Module on AWS Lambda
                            
                                How do I split a custom dataset into training and test datasets?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Text processing - Python vs Perl performance [closed]

Tags:

performance

python

regex

perl

text-processing

ihightower

People also ask

2 Answers

Josh Wright

ikegami

Recent Activity

Donate For Us