Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby i/o performance - reading file char by char

Short version: how to read from STDIN (or a file) char by char while maintaining high performance using Ruby? (though the problem is probably not Ruby specific)

Long version: While learning Ruby I'm designing a little utility that has to read from a piped text data, find and collect numbers in it and do some processing.

cat huge_text_file.txt | program.rb

input  > 123123sdas234sdsd5a ...
output > 123123, 234, 5, ...

The text input might be huge (gigabytes) and it might not contain newlines or whitespace (any non-digit char is a separator) so I did a char by char reading (though I had my concerns about the performance) and it turns out doing it this way is incredibly slow.

Simply reading char by char with no processing on a 900Kb input file takes around 7 seconds!

while c = STDIN.read(1)
end

If I input data with newlines and read line by line, same file is read 100x times faster.

while s = STDIN.gets
end

It seems like reading from a pipe with STDIN.read(1) doesn't involve any buffering and every time read happens, hard drive is hit - but shouldn't it be cached by OS?

Doesn't STDIN.gets read char by char internally until it encounters '\n'?

Using C, I would probably read data in chunks though I would I have to deal with numbers being split by buffer window but that doesn't look like an elegant solution for Ruby. So what is the proper way of doing this?

P.S Timing reading the same file in Python:

for line in f:
    line
f.close()

Running time is 0.01 sec.

c = f.read(1)
while c:
    c = f.read(1)
f.close()

Running time is 0.17 sec.

Thanks!

like image 278
epsylon Avatar asked Dec 10 '16 11:12

epsylon


1 Answers

This script reads the IO object word by word, and executes the block every time 1000 words have been found or the end of the file has been reached.

No more than 1000 words will be kept in memory at the same time. Note that using " " as separator means that "words" might contain newlines.

This scripts uses IO#each to specify a separator (a whitespace in this case, to get an Enumerator of words), lazy to avoid doing any operation on the whole file content and each_slice to get an array of batch_size words.

batch_size = 1000

STDIN.each(" ").lazy.each_slice(batch_size) do |batch|
  # batch is an Array of batch_size words
end

Instead of using cat and |, you could also read the file directly :

batch_size = 1000

File.open('huge_text_file.txt').each(" ").lazy.each_slice(batch_size) do |batch|
  # batch is an Array of batch_size words
end

With this code, no number will be split, no logic is needed, it should be much faster than reading the file char by char and it will use much less memory than reading the whole file into a String.

like image 145
Eric Duminil Avatar answered Nov 02 '22 02:11

Eric Duminil