Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speeding up the light processing of ~50GB CSV file

I have a ~50GB csv file with which I have to

  • Take several subsets of the columns of the CSV
  • Apply a different format string specification to each subset of columns of the CSV.
  • Output a new CSV for each subset with its own format specification.

I opted to use Pandas, and have a general approach of iterating over chunks of a handy chunk-size (of just over half a million lines) to produce a DataFrame, and appending the chunk to each output CSV. So something like this:

_chunk_size = 630100

column_mapping = {
    'first_output_specification' : ['Scen', 'MS', 'Time', 'CCF2', 'ESW10'],
    # ..... similar mappings for rest of output specifications
}
union_of_used_cols = ['Scen', 'MS', 'Time', 'CCF1', 'CCF2', 'VS', 'ESW 0.00397', 'ESW0.08',
                    'ESW0.25', 'ESW1', 'ESW 2', 'ESW3', 'ESW 5', 'ESW7', 'ESW 10', 'ESW12',
                    'ESW 15', 'ESW18', 'ESW 20', 'ESW22', 'ESW 25', 'ESW30', 'ESW 35', 
                    'ESW40']

chnk_iter = pd.read_csv('my_big_csv.csv', header=0, index_col=False,
                        iterator=True, na_filter=False, usecols=union_of_used_cols)

cnt = 0
while cnt < 100:
    chnk = chnk_iter.get_chunk(_chunk_size)
    chnk.to_csv('first_output_specification', float_format='%.8f',
                columns=column_mapping['first_output_specification'],
                mode='a',
                header=True,
                index=False)
    # ..... do the same thing for the rest of the output specifications

    cnt += 1

My problem is that this is really slow. Each chunk takes about a minute to generate append to the CSV files for, and thus I'm looking at almost 2 hours for the task to complete.

I have tried to place a few optimizations by only using the union of the column subsets when reading in the CSV, as well as setting na_filter=False, but it still isn't acceptable.

I was wondering if there is a faster way to do this light processing of a CSV file in Python, either by means of an optimization or correction to my approach or perhaps simply there is a better tool suited for this kind of job then Pandas... to me (an inexperienced Pandas user) this looks like it is as fast as it could get with Pandas, but I may very well be mistaken.

like image 244
Eric Hansen Avatar asked Jul 25 '16 08:07

Eric Hansen


1 Answers

I don't think you're getting any advantage from a Panda's dataframe, so it is just adding overhead. Instead, you can use python's own CSV module that is easy to use and nicely optimized in C.

Consider reading much larger chunks into memory (perhaps 10MB at a time), then writing-out each of the reformatted column subsets before advancing to the next chunk. That way, the input file only gets read and parsed once.

One other approach you could try is to preprocess the data with the Unix cut command to extract only the relevant columns (so that Python doesn't have to create objects and allocate memory for data in the unused columns): cut -d, -f1,3,5 somedata.csv

Lastly, try running the code under PyPy so that the CPU bound portion of your script gets optimized through their tracing JIT.

like image 178
Raymond Hettinger Avatar answered Nov 10 '22 09:11

Raymond Hettinger