I have a text file fo several GB with this format
0 274 593869.99 6734999.96 121.83 1,
0 273 593869.51 6734999.92 121.57 1,
0 273 593869.15 6734999.89 121.57 1,
0 273 593868.79 6734999.86 121.65 1,
0 273 593868.44 6734999.84 121.65 1,
0 273 593869.00 6734999.94 124.21 1,
0 273 593868.68 6734999.92 124.32 1,
0 273 593868.39 6734999.90 124.44 1,
0 273 593866.94 6734999.71 121.37 1,
0 273 593868.73 6734999.99 127.28 1,
I have a simple function to filter in Python 2.7 on Windows. The function reads the entire file, selects the line with the same idtile (first and second column) and returns the list of points (x,y,z, and label) and the idtile.
tiles_id = [j for j in np.ndindex(ny, nx)] #ny = number of row, nx= number of columns
idtile = tiles_id[0]
def file_filter(name,idtile):
lst = []
for line in file(name, mode="r"):
element = line.split() # add value
if (int(element[0]),int(element[1])) == idtile:
lst.append(element[2:])
dy, dx = int(element[0]),int(element[1])
return(lst, dy, dx)
The file is more than 32 GB and the bottle-neck is the reading of the file. I am looking for some suggestions or examples in order to speed up my function (ex: Parallel computing or other approaches).
My solution is to split the text file into tiles (using x and y location). The solution is not elegant and I am looking for an efficient approach.
Your 'idtile's appear to be in a certain order. That is, the sample data suggests that once you traverse through a certain 'idtile' and hit the next, there is no chance that a line with that 'idtile' will show up again. If this is the case, you may break the for loop once you finish dealing with the 'idtile' you want and hit a different one. Off the top of my head:
loopkiller = false
for line in file(name, mode="r"):
element = line.split()
if (int(element[0]),int(element[1])) == idtile:
lst.append(element[2:])
dy, dx = int(element[0]),int(element[1])
loopkiller = true
elif loopkiller:
break;
This way, once you are done with a certain 'idtile', you stop; whereas in your example, you keep on reading until the end of the file.
If your idtiles appear in a random order, maybe you could try writing an ordered version of your file first.
Also, evaluating the digits of your idtiles seperately may help you traverse the file faster. Supposing your idtile is a two-tuple of one-digit and three-digit integers, perhaps something along the lines of:
for line in file(name, mode="r"):
element = line.split()
if int(element[0][0]) == idtile[0]:
if element[1][0] == str(idtile[1])[0]:
if element[1][1] == str(idtile[1])[1]:
if element[1][2] == str(idtile[1])[2]:
dy, dx = int(element[0]),int(element[1])
else go_forward(walk)
else go_forward(run)
else go_forward(sprint)
else go_forward(warp)
I would suggest to compare the times used for your full reading procedure and for just reading lines and doing nothing to them. If those times are close, the only thing you can really do is to change approach (splitting your files etc.), for what you can probably optimize is data processing time, not file reading time.
I also see two moments in your code that are worth fixing:
with open(name) as f:
for line in f:
pass #Here goes the loop body
Use with to explicitly close your file. Your solution should work in CPython, but that depends on implementation and may not be that effective always.
You perform transformation of a string to int twice. It is a relatively slow operation. Remove the second one by reusing the result.
P.S. It looks like an array of depth or height values for a set of points on Earth surface, and the surface is split in tiles. :-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With