Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Divide .csv file into chunks with Python

Tags:

python

pandas

csv

I have a large .csv file that is well over 300 gb. I would like to chunk it into smaller files of 100,000,000 rows each (each row has approximately 55-60 bytes).

I wrote the following code:

import pandas as pd
df = pd.read_csv('/path/to/really/big.csv',header=None,chunksize=100000000)
count = 1
for chunk in df:
    name = '/output/to/this/directory/file_%s.csv' %s count
    chunk.to_csv(name,header=None,index=None)
    print(count)
    count+=1

This code works fine, and I have plenty of memory on disk to store the approximate 5.5-6 gb at a time, but it's slow.

Is there a better way?

EDIT

I have written the following iterative solution:

with open('/path/to/really/big.csv', 'r') as csvfile:
    read_rows = csv.reader(csvfile)
    file_count = 1
    row_count = 1
    f = open('/output/to/this/directory/file_%s.csv' %s count,'w')
    for row in read_rows:
        f.write(''.join(row))
        row_count+=1
        if row_count % 100000000 == 0:
            f.close()
            file_count += 1
            f = open('/output/to/this/directory/file_%s.csv' %s count,'w')

EDIT 2

I would like to call attention to Vor's comment about using a Unix/Linux split command, this is the fastest solution I have found.

like image 511
invoker Avatar asked Sep 23 '15 15:09

invoker


People also ask

How do you split a file into chunks in Python?

To split a big binary file in multiple files, you should first read the file by the size of chunk you want to create, then write that chunk to a file, read the next chunk and repeat until you reach the end of original file.


1 Answers

there is an existing tool for this in Unix/Linux.

split -l 100000 -d source destination

will add two digit numerical suffix to destination prefix for the chunks.

like image 104
karakfa Avatar answered Oct 30 '22 16:10

karakfa