Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Change python file in place

Tags:

python

file

I have a large xml file (40 Gb) that I need to split into smaller chunks. I am working with limited space, so is there a way to delete lines from the original file as I write them to new files?

Thanks!

like image 376
Maulin Avatar asked Jul 17 '09 19:07

Maulin


3 Answers

Say you want to split the file into N pieces, then simply start reading from the back of the file (more or less) and repeatedly call truncate:

Truncate the file's size. If the optional size argument is present, the file is truncated to (at most) that size. The size defaults to the current position. The current file position is not changed. ...

import os
import stat

BUF_SIZE = 4096
size = os.stat("large_file")[stat.ST_SIZE]
chunk_size = size // N 
# or simply set a fixed chunk size based on your free disk space
c = 0

in_ = open("large_file", "r+")

while size > 0:
    in_.seek(-min(size, chunk_size), 2)
    # now you have to find a safe place to split the file at somehow
    # just read forward until you found one
    ...
    old_pos = in_.tell()
    with open("small_chunk%2d" % (c, ), "w") as out:
        b = in_.read(BUF_SIZE)
        while len(b) > 0:
            out.write(b)
            b = in_.read(BUF_SIZE)
    in_.truncate(old_pos)
    size = old_pos
    c += 1

Be careful, as I didn't test any of this. It might be needed to call flush after the truncate call, and I don't know how fast the file system is going to actually free up the space.

like image 114
Torsten Marek Avatar answered Oct 18 '22 02:10

Torsten Marek


If you're on Linux/Unix, why not use the split command like this guy does?

split --bytes=100m /input/file /output/dir/prefix

EDIT: then use csplit.

like image 45
plastic chris Avatar answered Oct 18 '22 03:10

plastic chris


I'm pretty sure there is, as I've even been able to edit/read from the source files of scripts I've run, but the biggest problem would probably be all the shifting that would be done if you started at the beginning of the file. On the other hand, if you go through the file and record all the starting positions of the lines, you could then go in reverse order of position to copy the lines out; once that's done, you could go back, take the new files, one at a time, and (if they're small enough), use readlines() to generate a list, reverse the order of the list, then seek to the beginning of the file and overwrite the lines in their old order with the lines in their new one.

(You would truncate the file after reading the first block of lines from the end by using the truncate() method, which truncates all data past the current file position if used without any arguments besides that of the file object, assuming you're using one of the classes or a subclass of one of the classes from the io package to read your file. You'd just have to make sure that the current file position ends up at the beginning of the last line to be written to a new file.)

EDIT: Based on your comment about having to make the separations at the proper closing tags, you'll probably also have to develop an algorithm to detect such tags (perhaps using the peek method), possibly using a regular expression.

like image 1
JAB Avatar answered Oct 18 '22 01:10

JAB