Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large file not flushed to disk immediately after calling close()?

I'm creating large file with my python script (more than 1GB, actually there's 8 of them). Right after I create them I have to create process that will use those files.

The script looks like:

# This is more complex function, but it basically does this:
def use_file():
    subprocess.call(['C:\\use_file', 'C:\\foo.txt']);


f = open( 'C:\\foo.txt', 'wb')
for i in 10000:
    f.write( one_MB_chunk)
f.flush()
os.fsync( f.fileno())
f.close()

time.sleep(5) # With this line added it just works fine

t = threading.Thread( target=use_file)
t.start()

But application use_file acts like foo.txt is empty. There are some weird things going on:

  • if I execute C:\use_file C:\foo.txt in console (after script finished) I get correct results
  • if I execute manually use_file() in another python console I get correct results
  • C:\foo.txt is visible on disk right after open() was called, but remains size 0B until the end of script
  • if I add time.sleep(5) it just starts working as expected (or rather required)

I've already found:

  • os.fsync() but it doesn't seem to work (result from use_file is as if C:\foo.txt was empty)
  • Using buffering=(1<<20) (when opening file) doesn't seem to work either

I'm more and more curious about this behaviour.

Questions:

  • Does python fork close() operation into background? Where is this documented?
  • How to work this around?
  • Am I missing something?
  • After adding sleep: is that a windows/python bug?

Notes: (for the case that there's something wrong with the other side) application use_data uses:

handle = CreateFile("foo.txt", GENERIC_READ, FILE_SHARE_READ, NULL,
                               OPEN_EXISTING, 0, NULL);
size = GetFileSize(handle, NULL)

And then processes size bytes from foo.txt.

like image 295
Vyktor Avatar asked Dec 07 '12 11:12

Vyktor


People also ask

What does flush () do in Python?

The flush() method in Python file handling clears the internal buffer of the file. In Python, files are automatically flushed while closing them. However, a programmer can flush a file before closing it by using the flush() method.

Which function forces the writing of data on disc still pending in output buffer?

Instead, they batch the writes together in a buffer and save all of them to disk at the same time. Using fflush( ) forces anything pending in the write buffer to be actually written to disk.

What is internal buffer of a file?

The internal buffers are buffers created by the runtime/library/language that you're programming against and is meant to speed things up by avoiding system calls for every write.


1 Answers

f.close() calls f.flush(), which sends the data to the OS. That doesn't necessarily write the data to disk, because the OS buffers it. As you rightly worked out, if you want to force the OS to write it to disk, you need to os.fsync().

Have you considered just piping the data directly into use_file?


EDIT: you say that os.fsync() 'doesn't work'. To clarify, if you do

f = open(...)
# write data to f
f.flush()
os.fsync(f.fileno())
f.close()

import pdb; pdb.set_trace()

and then look at the file on disk, does it have data?

like image 105
Katriel Avatar answered Sep 17 '22 14:09

Katriel