I grabbed the KDD track1 dataset from Kaggle and decided to load a ~2.5GB 3-column CSV file into memory, on my 16GB high-memory EC2 instance:
data = np.loadtxt('rec_log_train.txt')
the python session ate up all my memory (100%), and then got killed.
I then read the same file using R (via read.table) and it used less than 5GB of ram, which collapsed to less than 2GB after I called the garbage collector.
My question is why did this fail under numpy, and what's the proper way of reading a file into memory. Yes I can use generators and avoid the problem, but that's not the goal.
read_csv(chunksize) One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.
So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.
Python NumPy loadtxt() function is used to load the data from a text file and store them in a ndarray. The purpose of loadtxt() function is to be a fast reader for simple text files. Each row in the text file must have the same number of values.
import pandas, re, numpy as np
def load_file(filename, num_cols, delimiter='\t'):
data = None
try:
data = np.load(filename + '.npy')
except:
splitter = re.compile(delimiter)
def items(infile):
for line in infile:
for item in splitter.split(line):
yield item
with open(filename, 'r') as infile:
data = np.fromiter(items(infile), float64, -1)
data = data.reshape((-1, num_cols))
np.save(filename, data)
return pandas.DataFrame(data)
This reads in the 2.5GB file, and serializes the output matrix. The input file is read in "lazily", so no intermediate data-structures are built and minimal memory is used. The initial load takes a long time, but each subsequent load (of the serialized file) is fast. Please let me if you have tips!
Try out recfile for now: http://code.google.com/p/recfile/ . There are a couple of efforts I know of to make a fast C/C++ file reader for NumPy; it's on my short todo list for pandas because it causes problems like these. Warren Weckesser also has a project here: https://github.com/WarrenWeckesser/textreader . I don't know which one is better, try them both?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With