Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing large Pandas Dataframes to CSV file in chunks

Tags:

How do I write out a large data files to a CSV file in chunks?

I have a set of large data files (1M rows x 20 cols). However, only 5 or so columns of the data files are of interest to me.

I want to make things easier by making copies of these files with only the columns of interest so I have smaller files to work with for post-processing. So I plan to read the file into a dataframe, then write to csv file.

I've been looking into reading large data files in chunks into a dataframe. However, I haven't been able to find anything on how to write out the data to a csv file in chunks.

Here is what I'm trying now, but this doesn't append the csv file:

with open(os.path.join(folder, filename), 'r') as src:     df = pd.read_csv(src, sep='\t',skiprows=(0,1,2),header=(0), chunksize=1000)     for chunk in df:         chunk.to_csv(os.path.join(folder, new_folder,                                   "new_file_" + filename),                                    columns = [['TIME','STUFF']]) 
like image 984
Korean_Of_the_Mountain Avatar asked Jul 22 '16 16:07

Korean_Of_the_Mountain


People also ask

How big is too big for a pandas DataFrame?

The short answer is yes, there is a size limit for pandas DataFrames, but it's so large you will likely never have to worry about it. The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.

How do I chunk a CSV file in Python?

Use chunksize to read a large CSV file Call pandas. read_csv(file, chunksize=chunk) to read file , where chunk is the number of lines to be read in per chunk.

Is pandas efficient for large data sets?

Use efficient datatypesThe default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.


1 Answers

Solution:

header = True for chunk in chunks:      chunk.to_csv(os.path.join(folder, new_folder, "new_file_" + filename),         header=header, cols=[['TIME','STUFF']], mode='a')      header = False 

Notes:

  • The mode='a' tells pandas to append.
  • We only write a column header on the first chunk.
like image 140
Scratch'N'Purr Avatar answered Sep 17 '22 18:09

Scratch'N'Purr