Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dealing with large CSV files (20G) in ruby

Tags:

parsing

ruby

csv

I am working on little problem and would have some advice on how to solve it: Given a csv file with an unknown number of columns and rows, output a list of columns with values and the number of times each value was repeated. without using any library.

if the file is small this shouldn't be a problem, but when it is a few Gigs, i get NoMemoryError: failed to allocate memory. is there a way to create a hash and read from the disk instead of loading the file to Memory? you can do that in perl with tied Hashes

EDIT: will IO#foreach load the file into memory? how about File.open(filename).each?

like image 663
fenec Avatar asked Dec 12 '12 21:12

fenec


People also ask

How do I open a 20gb CSV file?

So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.


Video Answer


2 Answers

Read the file one line at a time, discarding each line as you go:

open("big.csv") do |csv|
  csv.each_line do |line|
    values = line.split(",")
    # process the values
  end
end

Using this method, you should never run out of memory.

like image 121
marcus erronius Avatar answered Sep 18 '22 16:09

marcus erronius


Do you read the whole file at once? Reading it on a per-line basis, i.e. using ruby -pe, ruby -ne or $stdin.each should reduce the memory usage by garbage collecting lines which were processed.

data = {}
$stdin.each do |line|
  # Process line, store results in the data hash.
end

Save it as script.rb and pipe the huge CSV file into this script's standard input:

ruby script.rb < data.csv

If you don't feel like reading from the standard input we'll need a small change.

data = {}
File.open("data.csv").each do |line|
  # Process line, store results in the data hash.
end
like image 31
Jan Avatar answered Sep 21 '22 16:09

Jan