Python Writing From Multiple Files Overwrites Previous Content

Question

I'm successfully loading and outputting the way I want except that each new write loop is overwriting previous instead of appending, such that I am left with only the data from the last file in the loop.

#!/usr/bin/env python3
import glob
import xml.etree.ElementTree as ET
filenames = glob.glob("C:\Users\####\Desktop\BNC2\[A00-A0B]*.xml")
for filename in filenames:
    with open(filename, 'r', encoding="utf-8") as content:
        tree = ET.parse(content)
        root = tree.getroot()
        outF = open("C:\Users\####\Desktop\bnc.txt", "w")
        for w in root.iter('w'):
            lemma = w.get('hw')
            pos = w.get('pos')
            tag = w.get('c5')

            outF.write(w.text + "," + lemma + "," + pos + "," + tag)
            outF.write("\n")

Example:

File 1 - a,b,c,d

File 2 - e,f,g,h

Desired Output:

a,b,c,d

e,f,g,h

Current Output:

e,f,g,h

sberry · Accepted Answer

The problem is that you are opening the file outF with the w flag but should use the a flag instead.

changing

outF = open("C:\Users\####\Desktop\bnc.txt", "w")

to

outF = open("C:\Users\####\Desktop\bnc.txt", "a")

should solve the problem. You could also use w+ which will not truncate the file the way w does. But here's another idea altogether (which will work with w)

#!/usr/bin/env python3
import glob
import xml.etree.ElementTree as ET
filenames = glob.glob("C:\Users\####\Desktop\BNC2\[A00-A0B]*.xml")
out_lines = []
for filename in filenames:
    with open(filename, 'r', encoding="utf-8") as content:
        tree = ET.parse(content)
        root = tree.getroot()
        for w in root.iter('w'):
            lemma = w.get('hw')
            pos = w.get('pos')
            tag = w.get('c5')

            out_lines.append(w.text + "," + lemma + "," + pos + "," + tag)

with open("C:\Users\####\Desktop\bnc.txt", "w") as out_file:
    for line in out_lines:
        out_file.write("{}\n".format(line))

Jordan · Answer

The problem is that on this line:

outF = open("C:\Users\####\Desktop\bnc.txt", "w")

The same file is opened and closed over and over again.

Behind the scenes:

When you call open, the Python interpreter makes a system call to the operating system, asking the OS to look for the file with that name and return an integer (called a "file descriptor" or "FD") that refers to the file. If the system call succeeds, then the interpreter receives a FD, stores the FD in a new Python object, and returns that object from the open function.

When you call write, the interpreter takes your string and stores it in an internal buffer. When the buffer fills up, or when the outF object is destroyed (as we will see), the interpreter makes a system call asking the OS to write the contents of the buffer to the file that the FD refers to.

When there are no more references to a Python object, the interpreter is free to garbage collect it. But first, the interpreter needs to internally call the object's __del__ method, a.k.a. the object's destructor. A file object's destructor makes a final system call to tell the OS "I don't need this FD anymore, and you can close the file."

This next part is subtle. open creates and returns a new object (we'll call it f1); outF = open(...) assigns the identifier outF to f1. f1's reference count (the amount of identifiers assigned to it) is now 1. On the next iteration of outF = open(...), you're telling the interpreter that you no longer want outF to refer to f1. f1's reference count drops to 0, allowing the garbage collector to destroy the object and close the file. This new call to open returns a new object (call it f2) that just so happens to refer to the file that was just closed. outF is assigned to f2, and f2's reference count is now 1.

There is no need to open and close the file over and over again. I recommend opening it before the loop:

#!/usr/bin/env python3
import glob
import xml.etree.ElementTree as ET
filenames = glob.glob("C:\Users\####\Desktop\BNC2\[A00-A0B]*.xml")
with open("C:\Users\####\Desktop\bnc.txt", "w") as outF:
    for filename in filenames:
        with open(filename, 'r', encoding="utf-8") as content:
            tree = ET.parse(content)
            root = tree.getroot()
            for w in root.iter('w'):
                lemma = w.get('hw')
                pos = w.get('pos')
                tag = w.get('c5')

                outF.write(w.text + "," + lemma + "," + pos + "," + tag)
                outF.write("\n")

This has two advantages over building a list within the loop and then opening the file after the loop. This method iterates once instead of twice, and it requires a constant amount of space within the program's memory space (the constant size of the output buffer) instead of an amount of space that grows.

Python Writing From Multiple Files Overwrites Previous Content

Tags:

python

loops

pglove

2 Answers

sberry

Jordan

Recent Activity

Donate For Us

Python Writing From Multiple Files Overwrites Previous Content

Tags:

python

loops

pglove

2 Answers

sberry

Jordan

Related questions

Recent Activity

Donate For Us