Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is loading this file taking so much memory?

Trying to load a file into python. It's a very big file (1.5Gb), but I have the available memory and I just want to do this once (hence the use of python, I just need to sort the file one time so python was an easy choice).

My issue is that loading this file is resulting in way to much memory usage. When I've loaded about 10% of the lines into memory, Python is already using 700Mb, which is clearly too much. At around 50% the script hangs, using 3.03 Gb of real memory (and slowly rising).

I know this isn't the most efficient method of sorting a file (memory-wise) but I just want it to work so I can move on to more important problems :D So, what is wrong with the following python code that's causing the massive memory usage:

print 'Loading file into memory'
input_file = open(input_file_name, 'r')
input_file.readline() # Toss out the header
lines = []
totalLines = 31164015.0
currentLine = 0.0
printEvery100000 = 0
for line in input_file:
    currentLine += 1.0
    lined = line.split('\t')
    printEvery100000 += 1
    if printEvery100000 == 100000:
        print str(currentLine / totalLines)
        printEvery100000 = 0;
    lines.append( (lined[timestamp_pos].strip(), lined[personID_pos].strip(), lined[x_pos].strip(), lined[y_pos].strip()) )
input_file.close()
print 'Done loading file into memory'

EDIT: In case anyone is unsure, the general consensus seems to be that each variable allocated eats up more and more memory. I "fixed" it in this case by 1) calling readLines(), which still loads all the data, but only has one 'string' variable overhead for each line. This loads the entire file using about 1.7Gb. Then, when I call lines.sort(), I pass a function to key that splits on tabs and returns the right column value, converted to an int. This is slow computationally, and memory-intensive overall, but it works. Learned a ton about variable allocation overhad today :D

like image 963
Hamy Avatar asked Oct 11 '22 00:10

Hamy


1 Answers

Here is a rough estimate of the memory needed, based on the constants derived from your example. At a minimum you have to figure the Python internal object overhead for each split line, plus the overhead for each string.

It estimates 9.1 GB to store the file in memory, assuming the following constants, which are off by a bit, since you're only using part of each line:

  • 1.5 GB file size
  • 31,164,015 total lines
  • each line split into a list with 4 pieces

Code:

import sys
def sizeof(lst):
    return sys.getsizeof(lst) + sum(sys.getsizeof(v) for v in lst)

GIG = 1024**3
file_size = 1.5 * GIG
lines = 31164015
num_cols = 4
avg_line_len = int(file_size / float(lines))

val = 'a' * (avg_line_len / num_cols)
lst = [val] * num_cols

line_size = sizeof(lst)
print 'avg line size: %d bytes' % line_size
print 'approx. memory needed: %.1f GB' % ((line_size * lines) / float(GIG))

Returns:

avg line size: 312 bytes
approx. memory needed: 9.1 GB
like image 170
samplebias Avatar answered Oct 14 '22 07:10

samplebias