Just for fun, I wrote a quick Racket command-line script to parse old Unix fortune files. Fortune files are just giant text files, with a single %
on a blank line separating entries.
Just as a quick first hack, I wrote the following Racket code:
(define fortunes
(with-input-from-file "fortunes.txt"
(λ ()
(regexp-split #rx"%" (port->string)))))
I thought it would run nearly instantly. Instead, it takes a very long time to run—on the order of a couple of minutes. In comparison, what I think of as equivalent Python:
with open('fortunes.txt') as f:
fortunes = f.read().split('%')
executes immediately, with equivalent results to the Racket code.
What am I doing wrong here? Yes, there's some obvious low-hanging fruit, such as I'm sure that things would be better if I didn't slurp the whole file into RAM with port->string
, but the behavior is so pathologically bad I feel as if I must be doing something stupid at a much higher level than that.
Is there a more Racket-like way to do this with equivalently better performance? Is Racket I/O really poor for some operations? Is there some way to profile my code slightly deeper than the naïve profiler in DrRacket so I can figure out what about a given line is causing a problem?
EDIT: The fortunes file I'm using is FreeBSD's as found at http://fortunes.cat-v.org/freebsd/, which weighs in at about 2 MB. The best runtime for Racket 5.1.3 x64 on OS X Lion was:
real 1m1.479s
user 0m57.400s
sys 0m0.691s
For Python 2.7.1 x64, it was:
real 0m0.057s
user 0m0.029s
sys 0m0.015s
Eli's right that the time is being spent almost entirely in regexp-split
(although a full second appears to be spent in port->string
), but it's not clear to me that there's a preferred yet equally simple method.
Looks like most of the cost is due to running regexp-split
on a string. The fastest alternative that I found was splitting a byte-string, then converting the results to a strings:
(map bytes->string/utf-8
(call-with-input-file "db"
(λ (i) (regexp-split #rx#"%" (port->bytes i)))))
With a random fortune DB of ~2MB, your code takes about 35s, and this version takes 33ms.
(I'm not sure why it takes so long on a string, yet, but it's definitely way too slow.)
EDIT: We tracked it to an efficiency bug. Rough description: when Racket does a regexp-match
on a string, it will convert large parts of the string to a byte string (in UTF-8) for the search. This function is the core one that is implemented in C. regexp-split
uses it repeatedly to find all of the matches, and therefore keeps re-doing this conversion. I'm looking at a way to do things better, but until it's fixed, use the above workaround.
This is now fixed in the latest Git HEAD version of Racket, see: github.com/plt/racket/commit/8eefaba. Your example now runs in 0.1 seconds for me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With