I'm using Python 2.6.2 [GCC 4.3.3] running on Ubuntu 9.04. I need to read a big datafile (~1GB, >3 million lines) , line by line using a Python script.
I tried the methods below, I find it uses a very large space of memory (~3GB)
for line in open('datafile','r').readlines():
process(line)
or,
for line in file(datafile):
process(line)
Is there a better way to load a large file line by line, say
Several suggestions gave the methods I mentioned above and already tried, I'm trying to see if there is a better way to handle this. My search has not been fruitful so far. I appreciate your help.
p/s I have done some memory profiling using Heapy
and found no memory leaks in the Python code I am using.
Update 20 August 2012, 16:41 (GMT+1)
Tried both approach as suggested by J.F. Sebastian, mgilson and IamChuckB, (datafile is a variable)
with open(datafile) as f:
for line in f:
process(line)
Also,
import fileinput
for line in fileinput.input([datafile]):
process(line)
Strangely both of them uses ~3GB of memory, my datafile size in this test is 765.2MB consisting of 21,181,079 lines. I see the memory get incremented along the time (around 40-80MB steps) before stabilizing at 3GB.
An elementary doubt, Is it necessary to flush the line after usage?
I did memory profiling using Heapy to understand this better.
Level 1 Profiling
Partition of a set of 36043 objects. Total size = 5307704 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 15934 44 1301016 25 1301016 25 str
1 50 0 628400 12 1929416 36 dict of __main__.NodeStatistics
2 7584 21 620936 12 2550352 48 tuple
3 781 2 590776 11 3141128 59 dict (no owner)
4 90 0 278640 5 3419768 64 dict of module
5 2132 6 255840 5 3675608 69 types.CodeType
6 2059 6 247080 5 3922688 74 function
7 1716 5 245408 5 4168096 79 list
8 244 1 218512 4 4386608 83 type
9 224 1 213632 4 4600240 87 dict of type
<104 more rows. Type e.g. '_.more' to view.>
============================================================
Level 2 Profiling for Level 1-Index 0
Partition of a set of 15934 objects. Total size = 1301016 bytes.
Index Count % Size % Cumulative % Referred Via:
0 2132 13 274232 21 274232 21 '.co_code'
1 2132 13 189832 15 464064 36 '.co_filename'
2 2024 13 114120 9 578184 44 '.co_lnotab'
3 247 2 110672 9 688856 53 "['__doc__']"
4 347 2 92456 7 781312 60 '.func_doc', '[0]'
5 448 3 27152 2 808464 62 '[1]'
6 260 2 15040 1 823504 63 '[2]'
7 201 1 11696 1 835200 64 '[3]'
8 188 1 11080 1 846280 65 '[0]'
9 157 1 8904 1 855184 66 '[4]'
<4717 more rows. Type e.g. '_.more' to view.>
Level 2 Profiling for Level 1-Index 1
Partition of a set of 50 objects. Total size = 628400 bytes.
Index Count % Size % Cumulative % Referred Via:
0 50 100 628400 100 628400 100 '.__dict__'
Level 2 Profiling for Level 1-Index 2
Partition of a set of 7584 objects. Total size = 620936 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1995 26 188160 30 188160 30 '.co_names'
1 2096 28 171072 28 359232 58 '.co_varnames'
2 2078 27 157608 25 516840 83 '.co_consts'
3 261 3 21616 3 538456 87 '.__mro__'
4 331 4 21488 3 559944 90 '.__bases__'
5 296 4 20216 3 580160 93 '.func_defaults'
6 55 1 3952 1 584112 94 '.co_freevars'
7 47 1 3456 1 587568 95 '.co_cellvars'
8 35 0 2560 0 590128 95 '[0]'
9 27 0 1952 0 592080 95 '.keys()[0]'
<189 more rows. Type e.g. '_.more' to view.>
Level 2 Profiling for Level 1-Index 3
Partition of a set of 781 objects. Total size = 590776 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1 0 98584 17 98584 17 "['locale_alias']"
1 29 4 35768 6 134352 23 '[180]'
2 28 4 34720 6 169072 29 '[90]'
3 30 4 34512 6 203584 34 '[270]'
4 27 3 33672 6 237256 40 '[0]'
5 25 3 26968 5 264224 45 "['data']"
6 1 0 24856 4 289080 49 "['windows_locale']"
7 64 8 20224 3 309304 52 "['inters']"
8 64 8 17920 3 327224 55 "['galog']"
9 64 8 17920 3 345144 58 "['salog']"
<84 more rows. Type e.g. '_.more' to view.>
============================================================
Level 3 Profiling for Level 2-Index 0, Level 1-Index 0
Partition of a set of 2132 objects. Total size = 274232 bytes.
Index Count % Size % Cumulative % Referred Via:
0 2132 100 274232 100 274232 100 '.co_code'
Level 3 Profiling for Level 2-Index 0, Level 1-Index 1
Partition of a set of 50 objects. Total size = 628400 bytes.
Index Count % Size % Cumulative % Referred Via:
0 50 100 628400 100 628400 100 '.__dict__'
Level 3 Profiling for Level 2-Index 0, Level 1-Index 2
Partition of a set of 1995 objects. Total size = 188160 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1995 100 188160 100 188160 100 '.co_names'
Level 3 Profiling for Level 2-Index 0, Level 1-Index 3
Partition of a set of 1 object. Total size = 98584 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1 100 98584 100 98584 100 "['locale_alias']"
Still troubleshooting this.
Do share with me if you have faced this before.
Thanks for your help.
Update 21 August 2012, 01:55 (GMT+1)
s 1.231932886 _25_ AGT --- 0 exp 10 [0 0 0 0 Y Y] ------- [25:0 0:0 32 0 0] s 1.232087886 _25_ MAC --- 0 ARP 86 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0] r 1.232776108 _42_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0] r 1.232776625 _34_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0] r 1.232776633 _9_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0] r 1.232776658 _0_ MAC --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0] r 1.232856942 _35_ MAC --- 0 ARP 28 [0 ffffffff 64 806 Y Y] ------- [REQUEST 100/25 0/0] s 1.232871658 _0_ MAC --- 0 ARP 86 [13a 67 1 806 Y Y] ------- [REPLY 1/0 103/25] r 1.233096712 _29_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0] r 1.233097047 _4_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0] r 1.233097050 _26_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0] r 1.233097051 _1_ MAC --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0] r 1.233109522 _25_ MAC --- 0 ARP 28 [13a 67 1 806 Y Y] ------- [REPLY 1/0 103/25] s 1.233119522 _25_ MAC --- 0 ACK 38 [0 1 67 0 Y Y] r 1.233236204 _17_ MAC --- 0 ARP 28 [0 ffffffff 65 806 Y Y] ------- [REQUEST 101/25 0/0] r 1.233236463 _20_ MAC --- 0 ARP 28 [0 ffffffff 65 806 Y Y] ------- [REQUEST 101/25 0/0] D 1.233236694 _18_ MAC COL 0 ARP 86 [0 ffffffff 65 806 67 1] ------- [REQUEST 101/25 0/0]
The aim of doing 3 level profiling using Heapy is to assist me to narrow down which object(s) is eating up much of the memory. As you can see, unfortunately I could not see which one specifically need a tweaking as its too generic. Example I know though "dict of main.NodeStatistics" has only 50 objects out of 36043 (0.1%) objects, yet it takes up 12% of the total memory used to run the script, I am unable to find which specific dictionary I would need to look into.
I tried implementing David Eyk's suggestion as below (snippet), trying to manually garbage collect at every 500,000 lines,
import gc for i,line in enumerate(file(datafile)): if (i%500000==0): print '-----------This is line number', i collected = gc.collect() print "Garbage collector: collected %d objects." % (collected)
Unfortunately, the memory usage is still at 3GB and the output (snippet) is as below,
-----------This is line number 0
Garbage collector: collected 0 objects.
-----------This is line number 500000
Garbage collector: collected 0 objects.
I did the same memory profiling as before,
Level 1 Profiling
Partition of a set of 35474 objects. Total size = 5273376 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 15889 45 1283640 24 1283640 24 str
1 50 0 628400 12 1912040 36 dict of __main__.NodeStatistics
2 7559 21 617496 12 2529536 48 tuple
3 781 2 589240 11 3118776 59 dict (no owner)
4 90 0 278640 5 3397416 64 dict of module
5 2132 6 255840 5 3653256 69 types.CodeType
6 2059 6 247080 5 3900336 74 function
7 1716 5 245408 5 4145744 79 list
8 244 1 218512 4 4364256 83 type
9 224 1 213632 4 4577888 87 dict of type
<104 more rows. Type e.g. '_.more' to view.>
Comparing the previous memory profiling output with the above, str has reduced 45 objects (17376 bytes), tuple has reduced 25 objects (3440 bytes) and dict(no owner) though no object change, it has reduced 1536 bytes of the memory size. All other objects are the same including dict of main.NodeStatistics. The total number of objects are 35474. The small reduction in object (0.2%) produced 99.3% of memory saving (22MB from 3GB). Very strange.
If you realize, though I know the place the memory starvation is occurring, I am yet able to narrow down which one causing the bleed.
Will continue to investigate this.
Thanks to all the pointers, using this opportunity to learn much on python as I ain't an expert. Appreciate your time taken to assist me.
Update 23 August 2012, 00:01 (GMT+1) -- SOLVED
I continued debugging using the minimalistic code per martineau's suggestion. I began to add codes in the process function and observe the memory bleeding.
I find the memory starts to bleed when I add a class as below,
class PacketStatistics(object): def __init__(self): self.event_id = 0 self.event_source = 0 self.event_dest = 0 ...
I am using 3 classes with 136 counters.
Discussed this issue with my friend Gustavo Carneiro, he suggested to use slot to replace dict.
I converted the class as below,
class PacketStatistics(object): __slots__ = ('event_id', 'event_source', 'event_dest',...) def __init__(self): self.event_id = 0 self.event_source = 0 self.event_dest = 0 ...
When I converted all the 3 classes, the memory usage of 3GB before now became 504MB. A whopping 80% of memory usage saving!!
The below is the memory profiling after the dict to slot convertion.
Partition of a set of 36157 objects. Total size = 4758960 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 15966 44 1304424 27 1304424 27 str 1 7592 21 624776 13 1929200 41 tuple 2 780 2 587424 12 2516624 53 dict (no owner) 3 90 0 278640 6 2795264 59 dict of module 4 2132 6 255840 5 3051104 64 types.CodeType 5 2059 6 247080 5 3298184 69 function 6 1715 5 245336 5 3543520 74 list 7 225 1 232344 5 3775864 79 dict of type 8 244 1 223952 5 3999816 84 type 9 166 0 190096 4 4189912 88 dict of class <101 more rows. Type e.g. '_.more' to view.>
The dict of __main__.NodeStatistics
is not in the top 10 anymore.
I am happy with the result and glad to close this issue.
Thanks for all your guidance. Truly appreciate it.
rgds Saravanan K
with open('datafile') as f:
for line in f:
process(line)
This works because files are iterators yielding 1 line at a time until there are no more lines to yield.
The fileinput
module will let you read it line by line without loading the entire file into memory. pydocs
import fileinput
for line in fileinput.input(['myfile']):
do_something(line)
Code example taken from yak.net
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With