Short version: how to read from STDIN (or a file) char by char while maintaining high performance using Ruby? (though the problem is probably not Ruby specific)
Long version: While learning Ruby I'm designing a little utility that has to read from a piped text data, find and collect numbers in it and do some processing.
cat huge_text_file.txt | program.rb
input > 123123sdas234sdsd5a ...
output > 123123, 234, 5, ...
The text input might be huge (gigabytes) and it might not contain newlines or whitespace (any non-digit char is a separator) so I did a char by char reading (though I had my concerns about the performance) and it turns out doing it this way is incredibly slow.
Simply reading char by char with no processing on a 900Kb input file takes around 7 seconds!
while c = STDIN.read(1)
end
If I input data with newlines and read line by line, same file is read 100x times faster.
while s = STDIN.gets
end
It seems like reading from a pipe with STDIN.read(1)
doesn't involve any buffering and every time read happens, hard drive is hit - but shouldn't it be cached by OS?
Doesn't STDIN.gets
read char by char internally until it encounters '\n
'?
Using C, I would probably read data in chunks though I would I have to deal with numbers being split by buffer window but that doesn't look like an elegant solution for Ruby. So what is the proper way of doing this?
P.S Timing reading the same file in Python:
for line in f:
line
f.close()
Running time is 0.01 sec.
c = f.read(1)
while c:
c = f.read(1)
f.close()
Running time is 0.17 sec.
Thanks!
This script reads the IO object word by word, and executes the block every time 1000 words have been found or the end of the file has been reached.
No more than 1000 words will be kept in memory at the same time. Note that using " "
as separator means that "words" might contain newlines.
This scripts uses IO#each
to specify a separator (a whitespace in this case, to get an Enumerator of words), lazy
to avoid doing any operation on the whole file content and each_slice
to get an array of batch_size words.
batch_size = 1000
STDIN.each(" ").lazy.each_slice(batch_size) do |batch|
# batch is an Array of batch_size words
end
Instead of using cat and |
, you could also read the file directly :
batch_size = 1000
File.open('huge_text_file.txt').each(" ").lazy.each_slice(batch_size) do |batch|
# batch is an Array of batch_size words
end
With this code, no number will be split, no logic is needed, it should be much faster than reading the file char by char and it will use much less memory than reading the whole file into a String.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With