Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is it faster to read a file without line breaks?

In Python 3.6, it takes longer to read a file if there are line breaks. If I have two files, one with line breaks and one without lines breaks (but otherwise they have the same text) then the file with line breaks will take around 100-200% the time to read. I have provided a specific example.

Step #1: Create the files

sizeMB = 128
sizeKB = 1024 * sizeMB

with open(r'C:\temp\bigfile_one_line.txt', 'w') as f:
    for i in range(sizeKB):
        f.write('Hello World!\t'*73)  # There are roughly 73 phrases in one KB

with open(r'C:\temp\bigfile_newlines.txt', 'w') as f:
    for i in range(sizeKB):  
        f.write('Hello World!\n'*73)

Step #2: Read the file with one single line and time performance

IPython

%%timeit
with open(r'C:\temp\bigfile_one_line.txt', 'r') as f:
    text = f.read()

Output

1 loop, best of 3: 368 ms per loop

Step #3: Read the file with many lines and time performance

IPython

%%timeit
with open(r'C:\temp\bigfile_newlines.txt', 'r') as f:
    text = f.read()

Output

1 loop, best of 3: 589 ms per loop

This is just one example. I have tested this for many different situations, and they do the same thing:

  1. Different file sizes from 1MB to 2GB
  2. Using file.readlines() instead of file.read()
  3. Using a space instead of tab ('\t') in the single line file (i.e. 'Hello World! ')

My conclusion is that files with new lines characters ('\n') take longer to read than files without them. However, I would expect all characters to be treated the same. This can have important consequences for performance when reading a lot of files. Does anyone know why this happens?

I am using Python 3.6.1, Anaconda 4.3.24, and Windows 10.

like image 657
pwaivers Avatar asked Sep 25 '17 23:09

pwaivers


1 Answers

When you open a file in Python in text mode (the default), it uses what it calls "universal newlines" (introduced with PEP 278, but somewhat changed later with the release of Python 3). What universal newlines means is that regardless of what kind of newline characters are used in the file, you'll see only \n in Python. So a file containing foo\nbar would appear the same as a file containing foo\r\nbar or foo\rbar (since \n, \r\n and \r are all line ending conventions used on some operating systems at some time).

The logic that provides that support is probably what causes your performance differences. Even if the \n characters in the file are not being transformed, the code needs to examine them more carefully than it does non-newline characters.

I suspect the performance difference you see will disappear if you opened your files in binary mode where no such newline support is provided. You can also pass a newline parameter to open in Python 3, which can have various meanings depending on exactly what value you give. I have no idea what impact any specific value would have on performance, but it might be worth testing if the performance difference you're seeing actually matters to your program. I'd try passing newline="" and newline="\n" (or whatever your platform's conventional line ending is).

like image 68
Blckknght Avatar answered Sep 21 '22 23:09

Blckknght