Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: How to read huge text file into memory

I'm using Python 2.6 on a Mac Mini with 1GB RAM. I want to read in a huge text file

$ ls -l links.csv; file links.csv; tail links.csv  -rw-r--r--  1 user  user  469904280 30 Nov 22:42 links.csv links.csv: ASCII text, with CRLF line terminators 4757187,59883 4757187,99822 4757187,66546 4757187,638452 4757187,4627959 4757187,312826 4757187,6143 4757187,6141 4757187,3081726 4757187,58197 

So each line in the file consists of a tuple of two comma separated integer values. I want to read in the whole file and sort it according to the second column. I know, that I could do the sorting without reading the whole file into memory. But I thought for a file of 500MB I should still be able to do it in memory since I have 1GB available.

However when I try to read in the file, Python seems to allocate a lot more memory than is needed by the file on disk. So even with 1GB of RAM I'm not able to read in the 500MB file into memory. My Python code for reading the file and printing some information about the memory consumption is:

#!/usr/bin/python # -*- coding: utf-8 -*-  import sys  infile=open("links.csv", "r")  edges=[] count=0 #count the total number of lines in the file for line in infile:  count=count+1  total=count print "Total number of lines: ",total  infile.seek(0) count=0 for line in infile:  edge=tuple(map(int,line.strip().split(",")))  edges.append(edge)  count=count+1  # for every million lines print memory consumption  if count%1000000==0:   print "Position: ", edge   print "Read ",float(count)/float(total)*100,"%."   mem=sys.getsizeof(edges)   for edge in edges:    mem=mem+sys.getsizeof(edge)    for node in edge:     mem=mem+sys.getsizeof(node)     print "Memory (Bytes): ", mem  

The output I got was:

Total number of lines:  30609720 Position:  (9745, 2994) Read  3.26693612356 %. Memory (Bytes):  64348736 Position:  (38857, 103574) Read  6.53387224712 %. Memory (Bytes):  128816320 Position:  (83609, 63498) Read  9.80080837067 %. Memory (Bytes):  192553000 Position:  (139692, 1078610) Read  13.0677444942 %. Memory (Bytes):  257873392 Position:  (205067, 153705) Read  16.3346806178 %. Memory (Bytes):  320107588 Position:  (283371, 253064) Read  19.6016167413 %. Memory (Bytes):  385448716 Position:  (354601, 377328) Read  22.8685528649 %. Memory (Bytes):  448629828 Position:  (441109, 3024112) Read  26.1354889885 %. Memory (Bytes):  512208580 

Already after reading only 25% of the 500MB file, Python consumes 500MB. So it seem that storing the content of the file as a list of tuples of ints is not very memory efficient. Is there a better way to do it, so that I can read in my 500MB file into my 1GB of memory?

like image 561
asmaier Avatar asked Dec 13 '09 14:12

asmaier


People also ask

How do you read data from a chunk in Python?

Reading Large File in Python To read a large file in chunk, we can use read() function with while loop to read some chunk data from a text file at a time.


2 Answers

There is a recipe for sorting files larger than RAM on this page, though you'd have to adapt it for your case involving CSV-format data. There are also links to additional resources there.

Edit: True, the file on disk is not "larger than RAM", but the in-memory representation can easily become much larger than available RAM. For one thing, your own program doesn't get the entire 1GB (OS overhead etc). For another, even if you stored this in the most compact form for pure Python (two lists of integers, assuming 32-bit machine etc), you'd be using 934MB for those 30M pairs of integers.

Using numpy you can also do the job, using only about 250MB. It isn't particular fast to load this way, as you have to count the lines and pre-allocate the array, but it may be the fastest actual sort given that it's in-memory:

import time import numpy as np import csv  start = time.time() def elapsed():     return time.time() - start  # count data rows, to preallocate array f = open('links.csv', 'rb') def count(f):     while 1:         block = f.read(65536)         if not block:              break         yield block.count(',')  linecount = sum(count(f)) print '\n%.3fs: file has %s rows' % (elapsed(), linecount)  # pre-allocate array and load data into array m = np.zeros(linecount, dtype=[('a', np.uint32), ('b', np.uint32)]) f.seek(0) f = csv.reader(open('links.csv', 'rb')) for i, row in enumerate(f):     m[i] = int(row[0]), int(row[1])  print '%.3fs: loaded' % elapsed() # sort in-place m.sort(order='b')  print '%.3fs: sorted' % elapsed() 

Output on my machine with a sample file similar to what you showed:

6.139s: file has 33253213 lines 238.130s: read into memory 517.669s: sorted 

The default in numpy is Quicksort. The ndarray.sort() routine (which sorts in-place) can also take keyword argument kind="mergesort" or kind="heapsort" but it appears neither of these is capable of sorting on a Record Array which, incidentally, I used as the only way I could see to sort the columns together as opposed to the default which would sort them independently (totally messing up your data).

like image 55
Peter Hansen Avatar answered Sep 21 '22 14:09

Peter Hansen


All python objects have a memory overhead on top of the data they are actually storing. According to getsizeof on my 32 bit Ubuntu system a tuple has an overhead of 32 bytes and an int takes 12 bytes, so each row in your file takes a 56 bytes + a 4 byte pointer in the list - I presume it will be a lot more for a 64 bit system. This is in line with the figures you gave and means your 30 million rows will take 1.8 GB.

I suggest that instead of using python you use the unix sort utility. I am not a Mac-head but I presume the OS X sort options are the same the linux version, so this should work:

sort -n -t, -k2 links.csv 

-n means sort numerically

-t, means use a comma as the field separator

-k2 means sort on the second field

This will sort the file and write the result to stdout. You could redirect it to another file or pipe it to you python program to do further processing.

edit: If you do not want to sort the file before you run your python script, you could use the subprocess module to create a pipe to the shell sort utility, then read the sorted results from the output of the pipe.

like image 31
Dave Kirby Avatar answered Sep 18 '22 14:09

Dave Kirby