Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Does Not Read Entire Text File

I'm running into a problem that I haven't seen anyone on StackOverflow encounter or even google for that matter.

My main goal is to be able to replace occurences of a string in the file with another string. Is there a way there a way to be able to acess all of the lines in the file.

The problem is that when I try to read in a large text file (1-2 gb) of text, python only reads a subset of it.

For example, I'll do a really simply command such as:

newfile = open("newfile.txt","w")
f = open("filename.txt","r")
for line in f:
    replaced = line.replace("string1", "string2")
    newfile.write(replaced)

And it only writes the first 382 mb of the original file. Has anyone encountered this problem previously?

I tried a few different solutions such as using:

import fileinput
for i, line in enumerate(fileinput.input("filename.txt", inplace=1)
   sys.stdout.write(line.replace("string1", "string2")

But it has the same effect. Nor does reading the file in chunks such as using

f.read(10000)

I've narrowed it down to mostly likely being a reading in problem and not a writing problem because it happens for simply printing out lines. I know that there are more lines. When I open it in a full text editor such as Vim, I can see what the last line should be, and it is not the last line that python prints.

Can anyone offer any advice or things to try?

I'm currently using a 32-bit version of Windows XP with 3.25 gb of ram, and running Python 2.7

*Edit Solution Found (Thanks Lattyware). Using an Iterator

def read_in_chunks(file, chunk_size=1000): 
   while True: 
      data = file.read(chunk_size) 
      if not data: break 
      yield data
like image 634
user1297872 Avatar asked Mar 28 '12 10:03

user1297872


People also ask

How do I read an entire text file in Python?

To read a text file in Python, you follow these steps: First, open a text file for reading by using the open() function. Second, read text from the text file using the file read() , readline() , or readlines() method of the file object. Third, close the file using the file close() method.

How do I read a 10gb file in Python?

In Python, files are read by using the readlines() method. The readlines() method returns a list where each item of the list is a complete sentence in the file. This method is useful when the file size is small.

How do I read a specific part of a file in Python?

Method 1: fileobject.readlines() A file object can be created in Python and then readlines() method can be invoked on this object to read lines into a stream. This method is preferred when a single line or a range of lines from a file needs to be accessed simultaneously.


3 Answers

Try:

f = open("filename.txt", "rb")

On Windows, rb means open file in binary mode. According to the docs, text mode vs. binary mode only has an impact on end-of-line characters. But (if I remember correctly) I believe opening files in text mode on Windows also does something with EOF (hex 1A).

You can also specify the mode when using fileinput:

fileinput.input("filename.txt", inplace=1, mode="rb")
like image 129
codeape Avatar answered Oct 16 '22 16:10

codeape


Are you sure the problem is with reading and not with writing out? Do you close the file that is written to, either explicitly newfile.close() or using the with construct?

Not closing the output file is often the source of such problems when buffering is going on somewhere. If that's the case in your setting too, closing should fix your initial solutions.

like image 4
benroth Avatar answered Oct 16 '22 17:10

benroth


If you use the file like this:

with open("filename.txt") as f:
    for line in f:
        newfile.write(line.replace("string1", "string2"))

It should only read into memory one line at a time, unless you keep a reference to that line in memory.
After each line is read it will be up to pythons garbage collector to get rid of it. Give this a try and see if it works for you :)

like image 1
Serdalis Avatar answered Oct 16 '22 17:10

Serdalis