Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficent way to split a large text file in python [duplicate]

this is a previous question where to improve the time performance of a function in python i need to find an efficient way to split my text file

I have the following text file (more than 32 GB) not sorted

....................
0 274 593869.99 6734999.96 121.83 1,
0 273 593869.51 6734999.92 121.57 1,
0 273 593869.15 6734999.89 121.57 1,
0 273 593868.79 6734999.86 121.65 1,
0 272 593868.44 6734999.84 121.65 1,
0 273 593869.00 6734999.94 124.21 1,
0 273 593868.68 6734999.92 124.32 1,
0 274 593868.39 6734999.90 124.44 1,
0 275 593866.94 6734999.71 121.37 1,
0 273 593868.73 6734999.99 127.28 1,
.............................

the first and second columns are the ID (ex: 0 -273) of location of the x,y,z point in a grid.

def point_grid_id(x,y,minx,maxy,distx,disty):
    """give id (row,col)"""
    col = int((x - minx)/distx)
    row = int((maxy - y)/disty)
    return (row, col)

the (minx, maxx) is the origin of my grid with size distx,disty. The the numbers of Id tiles are

tiles_id = [j for j in np.ndindex(ny, nx)] #ny = number of row, nx= number of columns 
from [(0,0),(0,1),(0,2),...,(ny-1,nx-1)]
n = len(tiles_id)

I need to slice the ~32 GB file in n (= len(tiles_id)) numbers of files.

i can do this without sorting but reading n times the file. For this reason I wish to find an efficient splitting method for the file starting form (0,0) (= tiles_id[0]) . After that i can read only one time the splitted files.

like image 684
Gianni Spear Avatar asked Mar 05 '13 15:03

Gianni Spear


2 Answers

Sorting is hardly possible for a 32GB file, no matter if you use Python or a command line tool (sort). Databases seem too powerful, but may be used. However, if you are unwilling to use databases, I would suggest simply splitting the source file in files using the tile id.

You read a line, make a file name out of a tile id and append the line to the file. And continue that until the source file is finished. It is not going to be too fast, but at least it has a complexity of O(N) unlike sorting.

And, of course, individual sorting of files and concatenating them is possible. The main bottleneck in sorting a 32GB file should be memory, not CPU.

Here it is, I think:

def temp_file_name(l):
    id0, id1 = l.split()[:2]
    return "tile_%s_%s.tmp" % (id0, id1)

def split_file(name):
    ofiles = {}
    try:
        with open(name) as f:
            for l in f:
                if l:
                    fn = temp_file_name(l)
                    if fn not in ofiles:
                        ofiles[fn] = open(fn, 'w')
                    ofiles[fn].write(l)
    finally:
        for of in ofiles.itervalues():
            of.close()

split_file('srcdata1.txt')

But if there is a lot of tiles, more than number of files you can open, you may do so:

def split_file(name):
    with open(name) as f:
        for l in f:
            if l:
                fn = temp_file_name(l)
                with open(fn, 'a') as of:
                    of.write(l)

And the most perfectionist way is to close some files and remove them from dictionary after reaching a limit on open files number.

like image 149
Ellioh Avatar answered Nov 17 '22 08:11

Ellioh


A quick google lead me to this recipe in ActiveState code. It didn;t gave any performance comparison but it seems to do the JOB.

In short, it seems to do what @Ellioh suggested, and you have a ready made recipe and you may not have to reinvent the wheel.

like image 43
Abhijit Avatar answered Nov 17 '22 09:11

Abhijit