I have a ~50GB csv file with which I have to
I opted to use Pandas, and have a general approach of iterating over chunks of a handy chunk-size (of just over half a million lines) to produce a DataFrame, and appending the chunk to each output CSV. So something like this:
_chunk_size = 630100
column_mapping = {
'first_output_specification' : ['Scen', 'MS', 'Time', 'CCF2', 'ESW10'],
# ..... similar mappings for rest of output specifications
}
union_of_used_cols = ['Scen', 'MS', 'Time', 'CCF1', 'CCF2', 'VS', 'ESW 0.00397', 'ESW0.08',
'ESW0.25', 'ESW1', 'ESW 2', 'ESW3', 'ESW 5', 'ESW7', 'ESW 10', 'ESW12',
'ESW 15', 'ESW18', 'ESW 20', 'ESW22', 'ESW 25', 'ESW30', 'ESW 35',
'ESW40']
chnk_iter = pd.read_csv('my_big_csv.csv', header=0, index_col=False,
iterator=True, na_filter=False, usecols=union_of_used_cols)
cnt = 0
while cnt < 100:
chnk = chnk_iter.get_chunk(_chunk_size)
chnk.to_csv('first_output_specification', float_format='%.8f',
columns=column_mapping['first_output_specification'],
mode='a',
header=True,
index=False)
# ..... do the same thing for the rest of the output specifications
cnt += 1
My problem is that this is really slow. Each chunk takes about a minute to generate append to the CSV files for, and thus I'm looking at almost 2 hours for the task to complete.
I have tried to place a few optimizations by only using the union of the column subsets when reading in the CSV, as well as setting na_filter=False
, but it still isn't acceptable.
I was wondering if there is a faster way to do this light processing of a CSV file in Python, either by means of an optimization or correction to my approach or perhaps simply there is a better tool suited for this kind of job then Pandas... to me (an inexperienced Pandas user) this looks like it is as fast as it could get with Pandas, but I may very well be mistaken.
I don't think you're getting any advantage from a Panda's dataframe, so it is just adding overhead. Instead, you can use python's own CSV module that is easy to use and nicely optimized in C.
Consider reading much larger chunks into memory (perhaps 10MB at a time), then writing-out each of the reformatted column subsets before advancing to the next chunk. That way, the input file only gets read and parsed once.
One other approach you could try is to preprocess the data with the Unix cut command to extract only the relevant columns (so that Python doesn't have to create objects and allocate memory for data in the unused columns): cut -d, -f1,3,5 somedata.csv
Lastly, try running the code under PyPy so that the CPU bound portion of your script gets optimized through their tracing JIT.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With