Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I determine why my Racket code runs so slowly?

Just for fun, I wrote a quick Racket command-line script to parse old Unix fortune files. Fortune files are just giant text files, with a single % on a blank line separating entries.

Just as a quick first hack, I wrote the following Racket code:

(define fortunes
    (with-input-from-file "fortunes.txt"
      (λ ()
        (regexp-split #rx"%" (port->string)))))

I thought it would run nearly instantly. Instead, it takes a very long time to run—on the order of a couple of minutes. In comparison, what I think of as equivalent Python:

with open('fortunes.txt') as f:
    fortunes = f.read().split('%')

executes immediately, with equivalent results to the Racket code.

What am I doing wrong here? Yes, there's some obvious low-hanging fruit, such as I'm sure that things would be better if I didn't slurp the whole file into RAM with port->string, but the behavior is so pathologically bad I feel as if I must be doing something stupid at a much higher level than that.

Is there a more Racket-like way to do this with equivalently better performance? Is Racket I/O really poor for some operations? Is there some way to profile my code slightly deeper than the naïve profiler in DrRacket so I can figure out what about a given line is causing a problem?

EDIT: The fortunes file I'm using is FreeBSD's as found at http://fortunes.cat-v.org/freebsd/, which weighs in at about 2 MB. The best runtime for Racket 5.1.3 x64 on OS X Lion was:

real    1m1.479s
user    0m57.400s
sys     0m0.691s

For Python 2.7.1 x64, it was:

real    0m0.057s
user    0m0.029s
sys     0m0.015s

Eli's right that the time is being spent almost entirely in regexp-split (although a full second appears to be spent in port->string), but it's not clear to me that there's a preferred yet equally simple method.

like image 951
Benjamin Pollack Avatar asked Aug 16 '11 23:08

Benjamin Pollack


2 Answers

Looks like most of the cost is due to running regexp-split on a string. The fastest alternative that I found was splitting a byte-string, then converting the results to a strings:

(map bytes->string/utf-8
     (call-with-input-file "db"
       (λ (i) (regexp-split #rx#"%" (port->bytes i)))))

With a random fortune DB of ~2MB, your code takes about 35s, and this version takes 33ms.

(I'm not sure why it takes so long on a string, yet, but it's definitely way too slow.)

EDIT: We tracked it to an efficiency bug. Rough description: when Racket does a regexp-match on a string, it will convert large parts of the string to a byte string (in UTF-8) for the search. This function is the core one that is implemented in C. regexp-split uses it repeatedly to find all of the matches, and therefore keeps re-doing this conversion. I'm looking at a way to do things better, but until it's fixed, use the above workaround.

like image 133
Eli Barzilay Avatar answered Nov 15 '22 06:11

Eli Barzilay


This is now fixed in the latest Git HEAD version of Racket, see: github.com/plt/racket/commit/8eefaba. Your example now runs in 0.1 seconds for me.

like image 29
Sam Tobin-Hochstadt Avatar answered Nov 15 '22 06:11

Sam Tobin-Hochstadt