Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading Large File in Python

Tags:

python

I'm using Python 2.6.2 [GCC 4.3.3] running on Ubuntu 9.04. I need to read a big datafile (~1GB, >3 million lines) , line by line using a Python script.

I tried the methods below, I find it uses a very large space of memory (~3GB)

for line in open('datafile','r').readlines():
   process(line)

or,

for line in file(datafile):
   process(line)

Is there a better way to load a large file line by line, say

  • a) by explicitly mentioning the maximum number of lines the file could load at any one time in memory ? Or
  • b) by loading it by chunks of size, say, 1024 bytes, provided the last line of the said chunk loads completely without being truncated?

Several suggestions gave the methods I mentioned above and already tried, I'm trying to see if there is a better way to handle this. My search has not been fruitful so far. I appreciate your help.

p/s I have done some memory profiling using Heapy and found no memory leaks in the Python code I am using.

Update 20 August 2012, 16:41 (GMT+1)

Tried both approach as suggested by J.F. Sebastian, mgilson and IamChuckB, (datafile is a variable)

with open(datafile) as f:
    for line in f:
        process(line)

Also,

import fileinput
for line in fileinput.input([datafile]):
    process(line)

Strangely both of them uses ~3GB of memory, my datafile size in this test is 765.2MB consisting of 21,181,079 lines. I see the memory get incremented along the time (around 40-80MB steps) before stabilizing at 3GB.

An elementary doubt, Is it necessary to flush the line after usage?

I did memory profiling using Heapy to understand this better.

Level 1 Profiling

Partition of a set of 36043 objects. Total size = 5307704 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  15934  44  1301016  25   1301016  25 str
     1     50   0   628400  12   1929416  36 dict of __main__.NodeStatistics
     2   7584  21   620936  12   2550352  48 tuple
     3    781   2   590776  11   3141128  59 dict (no owner)
     4     90   0   278640   5   3419768  64 dict of module
     5   2132   6   255840   5   3675608  69 types.CodeType
     6   2059   6   247080   5   3922688  74 function
     7   1716   5   245408   5   4168096  79 list
     8    244   1   218512   4   4386608  83 type
     9    224   1   213632   4   4600240  87 dict of type
<104 more rows. Type e.g. '_.more' to view.>

============================================================

Level 2 Profiling for Level 1-Index 0

Partition of a set of 15934 objects. Total size = 1301016 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0   2132  13   274232  21    274232  21 '.co_code'
     1   2132  13   189832  15    464064  36 '.co_filename'
     2   2024  13   114120   9    578184  44 '.co_lnotab'
     3    247   2   110672   9    688856  53 "['__doc__']"
     4    347   2    92456   7    781312  60 '.func_doc', '[0]'
     5    448   3    27152   2    808464  62 '[1]'
     6    260   2    15040   1    823504  63 '[2]'
     7    201   1    11696   1    835200  64 '[3]'
     8    188   1    11080   1    846280  65 '[0]'
     9    157   1     8904   1    855184  66 '[4]'
<4717 more rows. Type e.g. '_.more' to view.>

Level 2 Profiling for Level 1-Index 1

Partition of a set of 50 objects. Total size = 628400 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0     50 100   628400 100    628400 100 '.__dict__'

Level 2 Profiling for Level 1-Index 2

Partition of a set of 7584 objects. Total size = 620936 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0   1995  26   188160  30    188160  30 '.co_names'
     1   2096  28   171072  28    359232  58 '.co_varnames'
     2   2078  27   157608  25    516840  83 '.co_consts'
     3    261   3    21616   3    538456  87 '.__mro__'
     4    331   4    21488   3    559944  90 '.__bases__'
     5    296   4    20216   3    580160  93 '.func_defaults'
     6     55   1     3952   1    584112  94 '.co_freevars'
     7     47   1     3456   1    587568  95 '.co_cellvars'
     8     35   0     2560   0    590128  95 '[0]'
     9     27   0     1952   0    592080  95 '.keys()[0]'
<189 more rows. Type e.g. '_.more' to view.>

Level 2 Profiling for Level 1-Index 3

Partition of a set of 781 objects. Total size = 590776 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0      1   0    98584  17     98584  17 "['locale_alias']"
     1     29   4    35768   6    134352  23 '[180]'
     2     28   4    34720   6    169072  29 '[90]'
     3     30   4    34512   6    203584  34 '[270]'
     4     27   3    33672   6    237256  40 '[0]'
     5     25   3    26968   5    264224  45 "['data']"
     6      1   0    24856   4    289080  49 "['windows_locale']"
     7     64   8    20224   3    309304  52 "['inters']"
     8     64   8    17920   3    327224  55 "['galog']"
     9     64   8    17920   3    345144  58 "['salog']"
<84 more rows. Type e.g. '_.more' to view.>

============================================================

Level 3 Profiling for Level 2-Index 0, Level 1-Index 0

Partition of a set of 2132 objects. Total size = 274232 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0   2132 100   274232 100    274232 100 '.co_code'

Level 3 Profiling for Level 2-Index 0, Level 1-Index 1

Partition of a set of 50 objects. Total size = 628400 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0     50 100   628400 100    628400 100 '.__dict__'

Level 3 Profiling for Level 2-Index 0, Level 1-Index 2

Partition of a set of 1995 objects. Total size = 188160 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0   1995 100   188160 100    188160 100 '.co_names'

Level 3 Profiling for Level 2-Index 0, Level 1-Index 3

Partition of a set of 1 object. Total size = 98584 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0      1 100    98584 100     98584 100 "['locale_alias']"

Still troubleshooting this.

Do share with me if you have faced this before.

Thanks for your help.

Update 21 August 2012, 01:55 (GMT+1)

  1. mgilson, the process function is used to post process a Network Simulator 2 (NS2) tracefile. Some of the lines in the tracefile is shared as below. I am using numerous objects, counters, tuples, and dictionaries in the python script to learn how a wireless network performs.
s 1.231932886 _25_ AGT  --- 0 exp 10 [0 0 0 0 Y Y] ------- [25:0 0:0 32 0 0] 
s 1.232087886 _25_ MAC  --- 0 ARP 86 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776108 _42_ MAC  --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776625 _34_ MAC  --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776633 _9_ MAC  --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776658 _0_ MAC  --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232856942 _35_ MAC  --- 0 ARP 28 [0 ffffffff 64 806 Y Y] ------- [REQUEST 100/25 0/0]
s 1.232871658 _0_ MAC  --- 0 ARP 86 [13a 67 1 806 Y Y] ------- [REPLY 1/0 103/25]
r 1.233096712 _29_ MAC  --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233097047 _4_ MAC  --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233097050 _26_ MAC  --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233097051 _1_ MAC  --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233109522 _25_ MAC  --- 0 ARP 28 [13a 67 1 806 Y Y] ------- [REPLY 1/0 103/25]
s 1.233119522 _25_ MAC  --- 0 ACK 38 [0 1 67 0 Y Y] 
r 1.233236204 _17_ MAC  --- 0 ARP 28 [0 ffffffff 65 806 Y Y] ------- [REQUEST 101/25 0/0]
r 1.233236463 _20_ MAC  --- 0 ARP 28 [0 ffffffff 65 806 Y Y] ------- [REQUEST 101/25 0/0]
D 1.233236694 _18_ MAC  COL 0 ARP 86 [0 ffffffff 65 806 67 1] ------- [REQUEST 101/25 0/0]
  1. The aim of doing 3 level profiling using Heapy is to assist me to narrow down which object(s) is eating up much of the memory. As you can see, unfortunately I could not see which one specifically need a tweaking as its too generic. Example I know though "dict of main.NodeStatistics" has only 50 objects out of 36043 (0.1%) objects, yet it takes up 12% of the total memory used to run the script, I am unable to find which specific dictionary I would need to look into.

  2. I tried implementing David Eyk's suggestion as below (snippet), trying to manually garbage collect at every 500,000 lines,

import gc
  for i,line in enumerate(file(datafile)):
    if (i%500000==0):
      print '-----------This is line number', i
      collected = gc.collect()
      print "Garbage collector: collected %d objects." % (collected)

Unfortunately, the memory usage is still at 3GB and the output (snippet) is as below,

-----------This is line number 0
Garbage collector: collected 0 objects.
-----------This is line number 500000
Garbage collector: collected 0 objects.
  1. Implemented martineau's suggestion, I see the memory usage is now 22MB from the earlier 3GB! Something that I had looking forward to achieve. The strange thing is the below,

I did the same memory profiling as before,

Level 1 Profiling

Partition of a set of 35474 objects. Total size = 5273376 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  15889  45  1283640  24   1283640  24 str
     1     50   0   628400  12   1912040  36 dict of __main__.NodeStatistics
     2   7559  21   617496  12   2529536  48 tuple
     3    781   2   589240  11   3118776  59 dict (no owner)
     4     90   0   278640   5   3397416  64 dict of module
     5   2132   6   255840   5   3653256  69 types.CodeType
     6   2059   6   247080   5   3900336  74 function
     7   1716   5   245408   5   4145744  79 list
     8    244   1   218512   4   4364256  83 type
     9    224   1   213632   4   4577888  87 dict of type
<104 more rows. Type e.g. '_.more' to view.>

Comparing the previous memory profiling output with the above, str has reduced 45 objects (17376 bytes), tuple has reduced 25 objects (3440 bytes) and dict(no owner) though no object change, it has reduced 1536 bytes of the memory size. All other objects are the same including dict of main.NodeStatistics. The total number of objects are 35474. The small reduction in object (0.2%) produced 99.3% of memory saving (22MB from 3GB). Very strange.

If you realize, though I know the place the memory starvation is occurring, I am yet able to narrow down which one causing the bleed.

Will continue to investigate this.

Thanks to all the pointers, using this opportunity to learn much on python as I ain't an expert. Appreciate your time taken to assist me.

Update 23 August 2012, 00:01 (GMT+1) -- SOLVED

  1. I continued debugging using the minimalistic code per martineau's suggestion. I began to add codes in the process function and observe the memory bleeding.

  2. I find the memory starts to bleed when I add a class as below,

class PacketStatistics(object):
    def __init__(self):
        self.event_id = 0
        self.event_source = 0
        self.event_dest = 0
        ...

I am using 3 classes with 136 counters.

  1. Discussed this issue with my friend Gustavo Carneiro, he suggested to use slot to replace dict.

  2. I converted the class as below,

class PacketStatistics(object):
    __slots__ = ('event_id', 'event_source', 'event_dest',...)
    def __init__(self):
        self.event_id = 0
        self.event_source = 0
        self.event_dest = 0
        ...
  1. When I converted all the 3 classes, the memory usage of 3GB before now became 504MB. A whopping 80% of memory usage saving!!

  2. The below is the memory profiling after the dict to slot convertion.

Partition of a set of 36157 objects. Total size = 4758960 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  15966  44  1304424  27   1304424  27 str
     1   7592  21   624776  13   1929200  41 tuple
     2    780   2   587424  12   2516624  53 dict (no owner)
     3     90   0   278640   6   2795264  59 dict of module
     4   2132   6   255840   5   3051104  64 types.CodeType
     5   2059   6   247080   5   3298184  69 function
     6   1715   5   245336   5   3543520  74 list
     7    225   1   232344   5   3775864  79 dict of type
     8    244   1   223952   5   3999816  84 type
     9    166   0   190096   4   4189912  88 dict of class
<101 more rows. Type e.g. '_.more' to view.>

The dict of __main__.NodeStatistics is not in the top 10 anymore.

I am happy with the result and glad to close this issue.

Thanks for all your guidance. Truly appreciate it.

rgds Saravanan K

like image 676
Saravanan K Avatar asked Aug 20 '12 14:08

Saravanan K


2 Answers

with open('datafile') as f:
    for line in f:
        process(line)

This works because files are iterators yielding 1 line at a time until there are no more lines to yield.

like image 145
mgilson Avatar answered Nov 15 '22 14:11

mgilson


The fileinput module will let you read it line by line without loading the entire file into memory. pydocs

import fileinput
for line in fileinput.input(['myfile']):
do_something(line)

Code example taken from yak.net

like image 45
chucksmash Avatar answered Nov 15 '22 14:11

chucksmash