Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python fails to open 11gb csv in r+ mode but opens in r mode

I'm having problems with some code that loops through a bunch of .csvs and deletes the final line if there's nothing in it (i.e. files that end with the \n newline character)

My code works successfully on all files except one, which is the largest file in the directory at 11gb. The second largest file is 4.5gb.

The line it fails on is simply:

with open(path_str,"r+") as my_file:

and I get the following message:

IOError: [Errno 22] invalid mode ('r+') or filename: 'F:\\Shapefiles\\ab_premium\\processed_csvs\\a.csv'

The path_str I create using os.file.join to avoid errors, and I tried renaming the file to a.csv just to make sure there wasn't anything odd going on with the filename. This made no difference.

Even more strangely, the file is happy to open in r mode. I.e. the following code works fine:

with open(path_str,"r") as my_file:

I have tried navigating around the file in read mode, and it's happy to read characters at the start, end, and in the middle of the file.

Does anyone know of any limits on the size of file that Python can deal with or why I might be getting this error? I'm on Windows 7 64bit and have 16gb of RAM.

like image 284
RobinL Avatar asked Apr 19 '15 11:04

RobinL


1 Answers

The default I/O stack in Python 2 is layered over CRT FILE streams. On Windows these are built on top of a POSIX emulation API that uses file descriptors (which in turn is layered over the user-mode Windows API, which is layered over the kernel-mode I/O system, which itself is a deeply layered system based on I/O request packets; the hardware is down there somewhere...). In the POSIX layer, opening a file with _O_RDWR | _O_TEXT mode (as in "r+"), requires seeking to the end of the file to remove CTRL+Z, if it's present. Here's a quote from the CRT's fopen documentation:

Open in text (translated) mode. In this mode, CTRL+Z is interpreted as an end-of-file character on input. In files opened for reading/writing with "a+", fopen checks for a CTRL+Z at the end of the file and removes it, if possible. This is done because using fseek and ftell to move within a file that ends with a CTRL+Z, may cause fseek to behave improperly near the end of the file.

The problem here is that the above check calls the 32-bit _lseek (bear in mind that sizeof long is 4 bytes on 64-bit Windows, unlike most other 64-bit platforms), instead of _lseeki64. Obviously this fails for an 11 GB file. Specifically, SetFilePointer fails because it gets called with a NULL value for lpDistanceToMoveHigh. Here's the return value and LastErrorValue for the latter call:

0:000> kc 2
Call Site
KERNELBASE!SetFilePointer
MSVCR90!lseek_nolock

0:000> r rax                       
rax=00000000ffffffff

0:000> dt _TEB @$teb LastErrorValue
ntdll!_TEB
   +0x068 LastErrorValue : 0x57

The error code 0x57 is ERROR_INVALID_PARAMETER. This is referring to lpDistanceToMoveHigh being NULL when trying to seek from the end of a large file.

To work around this problem with CRT FILE streams, I recommend opening the file using io.open instead. This is a backported implementation of Python 3's I/O stack. It always opens files in raw binary mode (_O_BINARY), and it implements its own buffering and text-mode layers on top of the raw layer.

>>> import io                    
>>> f = io.open('a.csv', 'r+')
>>> f     
<_io.TextIOWrapper name='a.csv' encoding='cp1252'>
>>> f.buffer   
<_io.BufferedRandom name='a.csv'>
>>> f.buffer.raw
<_io.FileIO name='a.csv' mode='rb+'>
>>> f.seek(0, os.SEEK_END)
11811160064L
like image 188
Eryk Sun Avatar answered Oct 27 '22 04:10

Eryk Sun