I'm trying to extract all the unique characters from my utf-8 French dictionary file using this Ruby code. The dictionary is 3.7 MB. For some reason it takes my decent computer about half an hour to execute. Any ideas?
c = Set.new
f = open "dict"
s = f.read
f.close
for i in 0..s.length-1
c << s[i]
end
Reading in the entire file in one go before performing any computation on it prevents IO from being interleaved with computation. Furthermore, it increases memory pressure (potentially important if you're running close to the limit of your memory) and drastically reduces cache coherency.
I wrote the following little script which executes in .3 seconds on my /usr/share/dict/words
file -- less than a megabyte, but still large enough to be slightly interesting:
$ cat /tmp/set.rb
#!/usr/bin/ruby
require 'set'
c = Set.new
f = open "/usr/share/dict/words"
f.each_char do |char|
c << char
end
p c
$ time /tmp/set.rb
#<Set: {"A", "\n", "'", "s", "B", "M", "C", "T", "H", "I", "D", "S", "O", "L", "P", "W", "Z", "a", "c", "h", "e", "n", "l", "i", "y", "r", "o", "b", "d", "t", "u", "j", "g", "m", "p", "v", "x", "f", "k", "z", "w", "q", "ó", "ü", "á", "ö", "ñ", "E", "F", "R", "U", "N", "G", "K", "é", "ä", "Q", "è", "V", "J", "X", "ç", "ô", "í", "Y", "â", "û", "ê", "å", "Å"}>
real 0m0.341s
user 0m0.340s
sys 0m0.000s
Your program was still executing one minute later, and I gave up.
The main difference is that mine uses the built-in iterators to read into a buffer a smaller amount of the file (probably 4k-16k) and hand me a specific character each iteration through. This will re-use the same small amounts of memory over and over again and allow the CPU's relatively small cache lines to store the entirety of the data.
Edit
With a small test case I was able to isolate the speed difference mostly to the each_char
vs string sub-scripting. Jörg points out that string subscripting is an O(N) operation -- because UTF-8 strings cannot be simply indexed by multiplication as one might expect, finding the Nth character means starting from the beginning. Thus, your approach is O(N^2) and mine was just O(N), and that goes much much further to explaining the performance difference. I'm finally content that we figured out the core cause.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With