Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a large file - line by line?

I want to iterate over each line of an entire file. One way to do this is by reading the entire file, saving it to a list, then going over the line of interest. This method uses a lot of memory, so I am looking for an alternative.

My code so far:

for each_line in fileinput.input(input_file):     do_something(each_line)      for each_line_again in fileinput.input(input_file):         do_something(each_line_again) 

Executing this code gives an error message: device active.

Any suggestions?

The purpose is to calculate pair-wise string similarity, meaning for each line in file, I want to calculate the Levenshtein distance with every other line.

like image 246
384X21 Avatar asked Nov 04 '11 13:11

384X21


People also ask

How do I read a file line by line?

Java Read File line by line using BufferedReader We can use java.io.BufferedReader readLine() method to read file line by line to String. This method returns null when end of file is reached.

How do I read one line of a file at a time?

The readline() method helps to read just one line at a time, and it returns the first line from the file given. We will make use of readline() to read all the lines from the file given. To read all the lines from a given file, you can make use of Python readlines() function.

How can I read a large text file line by line using Java?

BufferedReader is used to read the file line by line. Basically, BufferedReader() is used for the processing of large files. BufferedReader is very efficient for reading. Note: Specify the size of the BufferReader or keep that size as a Default size of BufferReader.


2 Answers

The correct, fully Pythonic way to read a file is the following:

with open(...) as f:     for line in f:         # Do something with 'line' 

The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered I/O and memory management so you don't have to worry about large files.

There should be one -- and preferably only one -- obvious way to do it.

like image 64
Katriel Avatar answered Oct 17 '22 09:10

Katriel


Two memory efficient ways in ranked order (first is best) -

  1. use of with - supported from python 2.5 and above
  2. use of yield if you really want to have control over how much to read

1. use of with

with is the nice and efficient pythonic way to read large files. advantages - 1) file object is automatically closed after exiting from with execution block. 2) exception handling inside the with block. 3) memory for loop iterates through the f file object line by line. internally it does buffered IO (to optimized on costly IO operations) and memory management.

with open("x.txt") as f:     for line in f:         do something with data 

2. use of yield

Sometimes one might want more fine-grained control over how much to read in each iteration. In that case use iter & yield. Note with this method one explicitly needs close the file at the end.

def readInChunks(fileObj, chunkSize=2048):     """     Lazy function to read a file piece by piece.     Default chunk size: 2kB.      """     while True:         data = fileObj.read(chunkSize)         if not data:             break         yield data  f = open('bigFile') for chunk in readInChunks(f):     do_something(chunk) f.close() 

Pitfalls and for the sake of completeness - below methods are not as good or not as elegant for reading large files but please read to get rounded understanding.

In Python, the most common way to read lines from a file is to do the following:

for line in open('myfile','r').readlines():     do_something(line) 

When this is done, however, the readlines() function (same applies for read() function) loads the entire file into memory, then iterates over it. A slightly better approach (the first mentioned two methods above are the best) for large files is to use the fileinput module, as follows:

import fileinput  for line in fileinput.input(['myfile']):     do_something(line) 

the fileinput.input() call reads lines sequentially, but doesn't keep them in memory after they've been read or even simply so this, since file in python is iterable.

References

  1. Python with statement
like image 22
Srikar Appalaraju Avatar answered Oct 17 '22 08:10

Srikar Appalaraju