How do I write out a large data files to a CSV file in chunks?
I have a set of large data files (1M rows x 20 cols). However, only 5 or so columns of the data files are of interest to me.
I want to make things easier by making copies of these files with only the columns of interest so I have smaller files to work with for post-processing. So I plan to read the file into a dataframe, then write to csv file.
I've been looking into reading large data files in chunks into a dataframe. However, I haven't been able to find anything on how to write out the data to a csv file in chunks.
Here is what I'm trying now, but this doesn't append the csv file:
with open(os.path.join(folder, filename), 'r') as src: df = pd.read_csv(src, sep='\t',skiprows=(0,1,2),header=(0), chunksize=1000) for chunk in df: chunk.to_csv(os.path.join(folder, new_folder, "new_file_" + filename), columns = [['TIME','STUFF']])
The short answer is yes, there is a size limit for pandas DataFrames, but it's so large you will likely never have to worry about it. The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.
Use chunksize to read a large CSV file Call pandas. read_csv(file, chunksize=chunk) to read file , where chunk is the number of lines to be read in per chunk.
Use efficient datatypesThe default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.
Solution:
header = True for chunk in chunks: chunk.to_csv(os.path.join(folder, new_folder, "new_file_" + filename), header=header, cols=[['TIME','STUFF']], mode='a') header = False
Notes:
mode='a'
tells pandas to append.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With