Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python writelines() and write() huge time difference

I was working on a script which reading a folder of files(each of size ranging from 20 MB to 100 MB), modifies some data in each line, and writes back to a copy of the file.

with open(inputPath, 'r+') as myRead:      my_list = myRead.readlines()      new_my_list = clean_data(my_list) with open(outPath, 'w+') as myWrite:      tempT = time.time()      myWrite.writelines('\n'.join(new_my_list) + '\n')      print(time.time() - tempT) print(inputPath, 'Cleaning Complete.') 

On running this code with a 90 MB file (~900,000 lines), it printed 140 seconds as the time taken to write to the file. Here I used writelines(). So I searched for different ways to improve file writing speed, and in most of the articles that I read, it said write() and writelines() should not show any difference since I am writing a single concatenated string. I also checked the time taken for only the following statement:

new_string = '\n'.join(new_my_list) + '\n' 

And it took only 0.4 seconds, so the large time taken was not because of creating the list. Just to try out write() I tried this code:

with open(inputPath, 'r+') as myRead:      my_list = myRead.readlines()      new_my_list = clean_data(my_list) with open(outPath, 'w+') as myWrite:      tempT = time.time()      myWrite.write('\n'.join(new_my_list) + '\n')      print(time.time() - tempT) print(inputPath, 'Cleaning Complete.') 

And it printed 2.5 seconds. Why is there such a large difference in the file writing time for write() and writelines() even though it is the same data? Is this normal behaviour or is there something wrong in my code? The output file seems to be the same for both cases, so I know that there is no loss in data.

like image 933
Arjun Balgovind Avatar asked Jun 15 '17 06:06

Arjun Balgovind


People also ask

How is write () different from Writelines () in Python?

The difference between Write() and WriteLine() method is based on new line character. Write() method displays the output but do not provide a new line character. WriteLine() method displays the output and also provides a new line character it the end of the string, This would set a new line for the next output.

What is the difference in write () and Writelines ()? Give examples?

While Write() and WriteLine() both are the Console Class methods. The only difference between the Write() and WriteLine() is that Console. Write is used to print data without printing the new line, while Console. WriteLine is used to print data along with printing the new line.

What is Writelines () in Python?

The writelines() method writes the items of a list to the file. Where the texts will be inserted depends on the file mode and stream position. "a" : The texts will be inserted at the current file stream position, default at the end of the file.

Does Python Writelines add newline?

The writelines() method expects an iterable argument. Also, the write() method displays the output but does not provide a new line character, whereas the writelines() method displays the output and provides a new line character at the end of the string.


2 Answers

file.writelines() expects an iterable of strings. It then proceeds to loop and call file.write() for each string in the iterable. In Python, the method does this:

def writelines(self, lines)     for line in lines:         self.write(line) 

You are passing in a single large string, and a string is an iterable of strings too. When iterating you get individual characters, strings of length 1. So in effect you are making len(data) separate calls to file.write(). And that is slow, because you are building up a write buffer a single character at a time.

Don't pass in a single string to file.writelines(). Pass in a list or tuple or other iterable instead.

You could send in individual lines with added newline in a generator expression, for example:

 myWrite.writelines(line + '\n' for line in new_my_list) 

Now, if you could make clean_data() a generator, yielding cleaned lines, you could stream data from the input file, through your data cleaning generator, and out to the output file without using any more memory than is required for the read and write buffers and however much state is needed to clean your lines:

with open(inputPath, 'r+') as myRead, open(outPath, 'w+') as myWrite:     myWrite.writelines(line + '\n' for line in clean_data(myRead)) 

In addition, I'd consider updating clean_data() to emit lines with newlines included.

like image 193
Martijn Pieters Avatar answered Oct 01 '22 02:10

Martijn Pieters


as a complement to Martijn answer, the best way would be to avoid to build the list using join in the first place

Just pass a generator comprehension to writelines, adding the newline in the end: no unnecessary memory allocation and no loop (besides the comprehension)

myWrite.writelines("{}\n".format(x) for x in my_list) 
like image 31
Jean-François Fabre Avatar answered Oct 01 '22 00:10

Jean-François Fabre