Comparison of Python and Perl solutions to Wide Finder challenge

Question

I'd be very grateful if you could compare the winning O’Rourke's Perl solution to Lundh's Python solution, as I don't know Perl good enough to understand what's going on there. More specifically I'd like to know what gave Perl version 3x advantage: algorithmic superiority, quality of C extensions, other factors?

Wide Finder: Results

Leon Timmermans · Accepted Answer

The better regex implementation of perl is one part of the story. That can't explain however why the perl implementation scales better. The difference become bigger with more processors. For some reason the python implementation has an issue there.

Armin Ronacher · Answer

Perl is heavily optimized for text processing. There are so many factors that it's hard to say what's the exact difference. Text is represented completely differently internally (utf-8 versus utf-16/utf-32) and the regular expression engines are completely different too. Python's regular expression engine is a custom one and not as much used as the perl one. There are very few developers working on it (I think it's largely unmaintained) in contrast to the Perl one which is basically the "core of the language".

After all Perl is the text processing language.

15 revs · Answer

Many-core Engine (MCE) has been released for Perl. MCE does quite well at this, even when reading directly from disk with 8 workers (cold cache). Compare to wf_mmap. MCE follows a bank queuing model when reading input data. Look under the images folder for slides on it.

The source code is hosted at http://code.google.com/p/many-core-engine-perl/

The perl documentation can be read at https://metacpan.org/module/MCE

An implementation of the Wide Finder with MCE is provided under examples/tbray/

https://metacpan.org/source/MARIOROY/MCE-1.514/examples/tbray/

Enjoy MCE.

Script....:  baseline1  baseline2  wf_mce1  wf_mce2  wf_mce3  wf_mmap
Cold cache:      1.674      1.370    1.252    1.182    1.174    3.056
Warm cache:      1.236      0.923    0.277    0.106    0.098    0.092

Kyle Burton · Answer

The Perl implementation uses the mmap system call. What that call does is establish a pointer which to the process appears to be a normal segment of memory or buffer to the program. It maps the contents of a file to a region of memory. There are performances advantages of doing this vs normal file IO (read) - one is that there are no user-space library calls necessary to get access to the data, another is that there are often less copy operations necessary (eg: moving data between kernel and user space).

Perl's strings and regular expressions are 8-bit byte based (as opposed to utf16 for Java for example), so Perl's native 'character type' is the same encoding of the mmapped file.

When the regular expression engine then operates on the mmap backed variable, it is directly accessing the file data via the mamped memory region - without going through Perl's IO functions, or even libc's IO functions.

The mmap is probably largely responsible for the performance difference vs the Python version using the normal Python IO libraries - which additionally introduce the overhead of looking for line breaks.

The Perl program also supports a -J to parallelize the processing, where the oepen "-|" causes a fork() where the file handle in the parent is to the child's stdout. The child processes serialize their results to stdout and the parent de-serializes them to coordinate and summarize the results.

Comparison of Python and Perl solutions to Wide Finder challenge

Tags:

performance

python

perl

analysis

Constantin

4 Answers

Leon Timmermans

Armin Ronacher

15 revs

Kyle Burton

Recent Activity

Donate For Us

Comparison of Python and Perl solutions to Wide Finder challenge

Tags:

performance

python

perl

analysis

Constantin

4 Answers

Leon Timmermans

Armin Ronacher

15 revs

Kyle Burton

Related questions

Recent Activity

Donate For Us