Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lazily reading file paragraph by paragraph

I've got some data stored in a file where each block of interest is stored in a paragraph like so:

hello
there

kind

people
of

stack
overflow

I have tried reading each paragraph with the following code, but it does not work:

paragraphs = File.open("hundreds_of_gigs").lazy.to_enum.grep(/.*\n\n/) do |p| 
  puts p
end

With the regex I am trying to say: "match anything that ends with two newlines"

What am I doing wrong?

Any lazy way of solving this appreciated. The terser the method, the better.

like image 488
The Unfun Cat Avatar asked Jan 10 '23 07:01

The Unfun Cat


1 Answers

IO#readline("\n\n") will do what you want. File is a subclass of IO and has all it's methods even though they are not stated on the File rubydoc page.

It reads line by line, where a line end is the given seperator.

E.g.:

f = File.open("your_file")
f.readline("\n\n") => "hello\nthere\n\n"
f.readline("\n\n") => "kind\n\n"
f.readline("\n\n") => "people\nof\n\n"
f.readline("\n\n") => "stack\noverflow\n\n"

Each call to readline lazy reads one line of the file starting from top.

Or you can use IO#each_line("\n\n") to iterate over the file.

E.g.:

File.open("your_file").each_line("\n\n") do |line|
  puts line
end

=> "hello\nthere\n\n"
=> "kind\n\n"
=> "people\nof\n\n"
=> "stack\noverflow\n\n"
like image 173
dfherr Avatar answered Jan 19 '23 06:01

dfherr