Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

If RAM isn't a concern, is reading line by line faster or reading everything into RAM and access it? - Python

If RAM isn't a concern (I have close to 200GB on the server), is reading line by line faster or reading everything into RAM and access it? Each line will be a string of around 200-500 unicode characters. There are close to 2 million lines for each file.

Line-by-line

import codecs
for i in codecs.open('unicodefile','r','utf8'):
  print i

Reading into RAM

import codecs
for i in codecs.open('unicodefile','r','utf8').readlines():
  print i
like image 975
alvas Avatar asked Feb 23 '13 10:02

alvas


1 Answers

I used cProfile on a ~1MB dictionary words file. I read the same file 3 times. The first reads tho whole file in just to even the playing field in terms of it being stored in cache. Here is the simple code:

def first_read():
    codecs.open(file, 'r', 'utf8').readlines()

def line_by_line():
    for i in codecs.open(file, 'r', 'utf8'):
        pass

def at_once():
    for i in codecs.open(file, 'r', 'utf8').readlines():
        pass

first_read()
cProfile.run('line_by_line()')
cProfile.run('at_once()')

And here are the results:

Line by line:

         366959 function calls in 1.762 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.762    1.762 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 codecs.py:322(__init__)
        1    0.000    0.000    0.000    0.000 codecs.py:395(__init__)
    14093    0.087    0.000    0.131    0.000 codecs.py:424(read)
    57448    0.285    0.000    0.566    0.000 codecs.py:503(readline)
    57448    0.444    0.000    1.010    0.000 codecs.py:612(next)
        1    0.000    0.000    0.000    0.000 codecs.py:651(__init__)
    57448    0.381    0.000    1.390    0.000 codecs.py:681(next)
        1    0.000    0.000    0.000    0.000 codecs.py:686(__iter__)
        1    0.000    0.000    0.000    0.000 codecs.py:841(open)
        1    0.372    0.372    1.762    1.762 test.py:9(line_by_line)
    13316    0.011    0.000    0.023    0.000 utf_8.py:15(decode)
        1    0.000    0.000    0.000    0.000 {_codecs.lookup}
    27385    0.027    0.000    0.027    0.000 {_codecs.utf_8_decode}
    98895    0.011    0.000    0.011    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    13316    0.099    0.000    0.122    0.000 {method 'endswith' of 'unicode' objects}
       27    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
    14069    0.027    0.000    0.027    0.000 {method 'read' of 'file' objects}
    13504    0.020    0.000    0.020    0.000 {method 'splitlines' of 'unicode' objects}
        1    0.000    0.000    0.000    0.000 {open}

All at once:

         15 function calls in 0.023 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.023    0.023 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 codecs.py:322(__init__)
        1    0.000    0.000    0.000    0.000 codecs.py:395(__init__)
        1    0.000    0.000    0.003    0.003 codecs.py:424(read)
        1    0.000    0.000    0.014    0.014 codecs.py:576(readlines)
        1    0.000    0.000    0.000    0.000 codecs.py:651(__init__)
        1    0.000    0.000    0.014    0.014 codecs.py:677(readlines)
        1    0.000    0.000    0.000    0.000 codecs.py:841(open)
        1    0.009    0.009    0.023    0.023 test.py:13(at_once)
        1    0.000    0.000    0.000    0.000 {_codecs.lookup}
        1    0.003    0.003    0.003    0.003 {_codecs.utf_8_decode}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.001    0.001    0.001    0.001 {method 'read' of 'file' objects}
        1    0.010    0.010    0.010    0.010 {method 'splitlines' of 'unicode' objects}
        1    0.000    0.000    0.000    0.000 {open}

As you can see from the results, reading the whole file in at once is much faster, but you run the risk of a MemoryError being thrown in the file is too large.

like image 108
Gareth Webber Avatar answered Sep 30 '22 18:09

Gareth Webber