I'm running into a problem that I haven't seen anyone on StackOverflow encounter or even google for that matter.
My main goal is to be able to replace occurences of a string in the file with another string. Is there a way there a way to be able to acess all of the lines in the file.
The problem is that when I try to read in a large text file (1-2 gb) of text, python only reads a subset of it.
For example, I'll do a really simply command such as:
newfile = open("newfile.txt","w")
f = open("filename.txt","r")
for line in f:
replaced = line.replace("string1", "string2")
newfile.write(replaced)
And it only writes the first 382 mb of the original file. Has anyone encountered this problem previously?
I tried a few different solutions such as using:
import fileinput
for i, line in enumerate(fileinput.input("filename.txt", inplace=1)
sys.stdout.write(line.replace("string1", "string2")
But it has the same effect. Nor does reading the file in chunks such as using
f.read(10000)
I've narrowed it down to mostly likely being a reading in problem and not a writing problem because it happens for simply printing out lines. I know that there are more lines. When I open it in a full text editor such as Vim, I can see what the last line should be, and it is not the last line that python prints.
Can anyone offer any advice or things to try?
I'm currently using a 32-bit version of Windows XP with 3.25 gb of ram, and running Python 2.7
*Edit Solution Found (Thanks Lattyware). Using an Iterator
def read_in_chunks(file, chunk_size=1000):
while True:
data = file.read(chunk_size)
if not data: break
yield data
To read a text file in Python, you follow these steps: First, open a text file for reading by using the open() function. Second, read text from the text file using the file read() , readline() , or readlines() method of the file object. Third, close the file using the file close() method.
In Python, files are read by using the readlines() method. The readlines() method returns a list where each item of the list is a complete sentence in the file. This method is useful when the file size is small.
Method 1: fileobject.readlines() A file object can be created in Python and then readlines() method can be invoked on this object to read lines into a stream. This method is preferred when a single line or a range of lines from a file needs to be accessed simultaneously.
Try:
f = open("filename.txt", "rb")
On Windows, rb
means open file in binary mode. According to the docs, text mode vs. binary mode only has an impact on end-of-line characters. But (if I remember correctly) I believe opening files in text mode on Windows also does something with EOF (hex 1A).
You can also specify the mode when using fileinput
:
fileinput.input("filename.txt", inplace=1, mode="rb")
Are you sure the problem is with reading and not with writing out?
Do you close the file that is written to, either explicitly newfile.close()
or using the with
construct?
Not closing the output file is often the source of such problems when buffering is going on somewhere. If that's the case in your setting too, closing should fix your initial solutions.
If you use the file like this:
with open("filename.txt") as f:
for line in f:
newfile.write(line.replace("string1", "string2"))
It should only read into memory one line at a time, unless you keep a reference to that line in memory.
After each line is read it will be up to pythons garbage collector to get rid of it. Give this a try and see if it works for you :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With