Questions Linux Laravel Mysql Ubuntu Git Menu

HTML CSS JAVASCRIPT SQL PYTHON PHP BOOTSTRAP JAVA JQUERY R React Kotlin

Divide .csv file into chunks with Python

Tags:

python

pandas

csv

I have a large .csv file that is well over 300 gb. I would like to chunk it into smaller files of 100,000,000 rows each (each row has approximately 55-60 bytes).

I wrote the following code:

import pandas as pd
df = pd.read_csv('/path/to/really/big.csv',header=None,chunksize=100000000)
count = 1
for chunk in df:
    name = '/output/to/this/directory/file_%s.csv' %s count
    chunk.to_csv(name,header=None,index=None)
    print(count)
    count+=1

This code works fine, and I have plenty of memory on disk to store the approximate 5.5-6 gb at a time, but it's slow.

Is there a better way?

EDIT

I have written the following iterative solution:

with open('/path/to/really/big.csv', 'r') as csvfile:
    read_rows = csv.reader(csvfile)
    file_count = 1
    row_count = 1
    f = open('/output/to/this/directory/file_%s.csv' %s count,'w')
    for row in read_rows:
        f.write(''.join(row))
        row_count+=1
        if row_count % 100000000 == 0:
            f.close()
            file_count += 1
            f = open('/output/to/this/directory/file_%s.csv' %s count,'w')

EDIT 2

I would like to call attention to Vor's comment about using a Unix/Linux split command, this is the fastest solution I have found.

like image

511

asked Sep 23 '15 15:09

invoker

People also ask

How do you split a file into chunks in Python?

To split a big binary file in multiple files, you should first read the file by the size of chunk you want to create, then write that chunk to a file, read the next chunk and repeat until you reach the end of original file.

1 Answers

there is an existing tool for this in Unix/Linux.

split -l 100000 -d source destination

will add two digit numerical suffix to destination prefix for the chunks.

like image

104

answered Oct 30 '22 16:10

karakfa

Sign in to Comment

Related questions
                            
                                Error using Pytesser :**[WinError 2] The system cannot find the file specified**
                            
                                Python: Float infinite length (Precision float)
                            
                                creating spark data structure from multiline record
                            
                                Flask-SQLAlchemy: multiple filters through one relation
                            
                                Setting a functools.partial as an instance method in Python
                            
                                Python send email with "quoted-printable" transfer-encoding and "utf-8" content-encoding
                            
                                Special condition syntax with parentheses and brackets [duplicate]
                            
                                How can I use zip(), python
                            
                                Show Tables in SQLite Database in Python
                            
                                How to convert string like '001100' to numpy.array([0,0,1,1,0,0]) quickly?
                            
                                django - make datetimefield accept unix timestamp
                            
                                Get positions of points in PathCollection created by scatter()
                            
                                Midrule in LaTeX output of Python Pandas
                            
                                No module named dateutil.parser [duplicate]
                            
                                Matplotlib - How do I set ylim() for a series of plots?
                            
                                Remove the duplicate values and sum the corresponding column values
                            
                                Kivy Layout height to adapt to child widgets's height
                            
                                Escaping "\n" new line in list comprehension vs for loop in Python
                            
                                Sqlalchemy - update column based on changes in another column
                            
                                Python: displaying a line of text outside a matplotlib chart

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With