Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I efficiently open 30gb of file and process pieces of it without slowing down?

I have a some large files (more than 30gb) with pieces of information which I need to do some calculations on, like averaging. The pieces I mention are the slices of file, and I know the beginning line numbers and count of following lines for each slice.

So I have a dictionary with keys as beginning line numbers and values as count of following rows, and I use this dictionary to loop through the file and get slices over it. for each slice, I create a table, make some conversions and averaging, create a new table an convert it into a dictionary. I use islice for slicing and pandas dataframe to create tables from each slice.

however, in time process is getting slower and slower, even the size of the slices are more or less the same. First 1k slices - processed in 1h Second 1k slices - processed in 4h Third 1k slices - processed in 8h Second 1k slices - processed in 17h And I am waiting for days to complete the processes.

Right now I am doing this on a windows 10 machine, 1tb SSD, 32 GB ram. Previously I also tried on a Linux machine (ubuntu 18.4) with 250gb SSD and 8gb ram + 8gb virtual ram. Both resulted more or less the same.

What I noticed in windows is, 17% of CPU and 11% of memory is being used, but disk usage is 100%. I do not fully know what disk usage means and how I can improve it.

As a part of the code I was also importing data into mongodb while working on Linux, and I thought maybe it was because of indexing in mongodb. but when I print the processing time and import time I noticed that almost all time is spent on processing, import takes few seconds.
Also to gain time, I am now doing the processing part on a stronger windows machine and writing the docs as txt files. I expect that writing on disk slows down the process a bit but txt file sizes are not more than 600kb.

Below is the piece of code, how I read the file:

with open(infile) as inp:
    for i in range(0,len(seg_ids)): 
        inp.seek(0)
        segment_slice = islice(inp,list(seg_ids.keys())[i], (list(seg_ids.keys())[i]+list(seg_ids.values())[i]+1)) 
        segment = list(segment_slice)

        for _, line in enumerate(segment[1:]):
            #create dataframe and perform calculations

So I want to learn if there is a way to improve the processing time. I suppose my code reads whole file from beginning for each slice, and going through the end of the file reading time goes longer and longer.

As a note, because of the time constraints, I started with the most important slices I have to process first. So the rest will be more random slices on the files. So solution should be applicable for random slices, if there are any (I hope).

I am not experienced in scripting so please forgive me if I am asking a silly question, but I really could not find any answer.

like image 683
E.Ergin Avatar asked Apr 23 '19 00:04

E.Ergin


Video Answer


3 Answers

The problem here is you are rereading a HUGE unindexed file multiple times line by line from the start of the file. No wonder this takes HOURS.

Each islice in your code is starting from the very beginning of the file -- every time -- and reads and discards all the lines from the file before it even reaches the start of the data of interest. That is very slow and inefficient.

The solution is to create a poor man's index to that file and then read smaller chucks for each slice.

Let's create a test file:

from pathlib import Path

p=Path('/tmp/file')

with open(p, 'w') as f:
    for i in range(1024*1024*500):
        f.write(f'{i+1}\n')
        
print(f'{p} is {p.stat().st_size/(1024**3):.2f} GB')        

That creates a file of roughly 4.78 GB. Not as large as 30 GB but big enough to be SLOW if you are not thoughtful.

Now try reading that entire file line-by-line with the Unix utility wc to count the total lines (wc being the fastest way to count lines, in general):

$ time wc -l /tmp/file
 524288000 file

real    0m3.088s
user    0m2.452s
sys 0m0.636s

Compare that with the speed for Python3 to read the file line by line and print the total:

$ time python3 -c 'with open("file","r") as f: print(sum(1 for l in f))'
524288000

real    0m53.880s
user    0m52.849s
sys 0m0.940s

Python is almost 18x slower to read the file line by line than wc.

Now do a further comparison. Check out the speed of the Unix utility tail to print the last n lines of a file:

$ time tail -n 3 file 
524287998
524287999
524288000

real    0m0.007s
user    0m0.003s
sys 0m0.004s

The tail utility is 445x faster than wc (and roughly 8,000x faster than Python) to get to the last three lines of the file because it is using a windowed index buffer. ie, tail reads some number of bytes at the end of the file and then gets the last n lines from the buffer it read.

It is possible to use the same tail approach to your application.

Consider this photo:

enter image description here

The approach you are using the equivalent to reading every tape on that rack to find the data that is on two of the middle tapes only -- and doing it over and over and over...

In the 1950's (the era of the photo) each tape was roughly indexed for what it held. Computers would call for a specific tape in the rack -- not ALL the tapes in the rack.

The solution to your issue (in oversight) is to build a tape-like indexing scheme:

  1. Run through the 30 GB file ONCE and create an index of sub-blocks by starting line number for the block. Think of each sub-block as roughly one tape (except it runs easily all the way to the end...)
  2. Instead of using int.seek(0) before every read, you would seek to the block that contains the line number of interest (just like tail does) and then use islice offset adjusted to where that block's starting line number is in relationship to the start of the file.
  3. You have a MASSIVE advantage compared with what they had to do in the 50's and 60's: You only need to calculate the starting block since you have access to the entire remaining file. 1950's tape index would call for tapes x,y,z,... to read data larger than one tape could hold. You only need to find x which contains the starting line number of interest.
  4. BTW, since each IBM tape of this type held roughly 3 MB, your 30 GB file would be more than 10 million of these tapes...

Correctly implemented (and this is not terribly hard to do) it would speed up the read performance by 100x or more.

Constructing a useful index to a text file by line offset might be as easy as something like this:

def index_file(p, delimiter=b'\n', block_size=1024*1024):
    index={0:0}
    total_lines, cnt=(0,0)
    with open(p, 'rb') as f:
        while buf:=f.raw.read(block_size):
            cnt=buf.count(delimiter)
            idx=buf.rfind(delimiter)
            key=cnt+total_lines
            index[key]=f.tell()-(len(buf)-idx)+len(delimiter)
            total_lines+=cnt 
    
    return index

# this index is created in about 4.9 seconds on my computer...
# with a 4.8 GB file, there are ~4,800 index entries

That constructs an index correlating starting line number (in that block) with byte offset from the beginning of the file:

>>> idx=index_file(p)
>>> idx
{0: 0, 165668: 1048571, 315465: 2097150, 465261: 3145722,
...
 524179347: 5130682368, 524284204: 5131730938, 524288000: 5131768898}

Then if you want access to lines[524179347:524179500] you don't need to read 4.5 GB to get there; you can just do f.seek(5130682368) and start reading instantly.

like image 126
dawg Avatar answered Nov 15 '22 19:11

dawg


A couple of things come to mind.

First, if you bring the data into a pandas DataFrame there is a 'chunksize' argument for importing large data. It allows you to process / dump what you need / don't, while proving information such as df.describe which will give you summary stats.

Also, I hear great things about dask. It is a scalable platform via parallel, multi-core, multi-machine processing and is almost as simple as using pandas and numpy with very little management of resources required.

like image 32
tzujan Avatar answered Nov 15 '22 20:11

tzujan


Use pandas or dask and pay attention to the options for read_csv(). mainly: chunck_size, nrows, skiprows, usecols, engine (use C), low_memory, memory_map

[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html][1]

like image 29
Mo Huss Avatar answered Nov 15 '22 18:11

Mo Huss