If RAM isn't a concern (I have close to 200GB on the server), is reading line by line faster or reading everything into RAM and access it? Each line will be a string of around 200-500 unicode characters. There are close to 2 million lines for each file.
Line-by-line
import codecs
for i in codecs.open('unicodefile','r','utf8'):
print i
Reading into RAM
import codecs
for i in codecs.open('unicodefile','r','utf8').readlines():
print i
I used cProfile on a ~1MB dictionary words file. I read the same file 3 times. The first reads tho whole file in just to even the playing field in terms of it being stored in cache. Here is the simple code:
def first_read():
codecs.open(file, 'r', 'utf8').readlines()
def line_by_line():
for i in codecs.open(file, 'r', 'utf8'):
pass
def at_once():
for i in codecs.open(file, 'r', 'utf8').readlines():
pass
first_read()
cProfile.run('line_by_line()')
cProfile.run('at_once()')
And here are the results:
Line by line:
366959 function calls in 1.762 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.762 1.762 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 codecs.py:322(__init__)
1 0.000 0.000 0.000 0.000 codecs.py:395(__init__)
14093 0.087 0.000 0.131 0.000 codecs.py:424(read)
57448 0.285 0.000 0.566 0.000 codecs.py:503(readline)
57448 0.444 0.000 1.010 0.000 codecs.py:612(next)
1 0.000 0.000 0.000 0.000 codecs.py:651(__init__)
57448 0.381 0.000 1.390 0.000 codecs.py:681(next)
1 0.000 0.000 0.000 0.000 codecs.py:686(__iter__)
1 0.000 0.000 0.000 0.000 codecs.py:841(open)
1 0.372 0.372 1.762 1.762 test.py:9(line_by_line)
13316 0.011 0.000 0.023 0.000 utf_8.py:15(decode)
1 0.000 0.000 0.000 0.000 {_codecs.lookup}
27385 0.027 0.000 0.027 0.000 {_codecs.utf_8_decode}
98895 0.011 0.000 0.011 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
13316 0.099 0.000 0.122 0.000 {method 'endswith' of 'unicode' objects}
27 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
14069 0.027 0.000 0.027 0.000 {method 'read' of 'file' objects}
13504 0.020 0.000 0.020 0.000 {method 'splitlines' of 'unicode' objects}
1 0.000 0.000 0.000 0.000 {open}
All at once:
15 function calls in 0.023 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.023 0.023 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 codecs.py:322(__init__)
1 0.000 0.000 0.000 0.000 codecs.py:395(__init__)
1 0.000 0.000 0.003 0.003 codecs.py:424(read)
1 0.000 0.000 0.014 0.014 codecs.py:576(readlines)
1 0.000 0.000 0.000 0.000 codecs.py:651(__init__)
1 0.000 0.000 0.014 0.014 codecs.py:677(readlines)
1 0.000 0.000 0.000 0.000 codecs.py:841(open)
1 0.009 0.009 0.023 0.023 test.py:13(at_once)
1 0.000 0.000 0.000 0.000 {_codecs.lookup}
1 0.003 0.003 0.003 0.003 {_codecs.utf_8_decode}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.001 0.001 0.001 0.001 {method 'read' of 'file' objects}
1 0.010 0.010 0.010 0.010 {method 'splitlines' of 'unicode' objects}
1 0.000 0.000 0.000 0.000 {open}
As you can see from the results, reading the whole file in at once is much faster, but you run the risk of a MemoryError being thrown in the file is too large.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With