Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I split a large file csv file (7GB) in Python

Tags:

python

split

csv

I have a 7GB csv file which I'd like to split into smaller chunks, so it is readable and faster for analysis in Python on a notebook. I would like to grab a small set from it, maybe 250MB, so how can I do this?

like image 823
Sohail Avatar asked Nov 17 '13 17:11

Sohail


People also ask

How do I open a 20gb CSV file?

So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.

How do you split a file by size in Python?

To split a big binary file in multiple files, you should first read the file by the size of chunk you want to create, then write that chunk to a file, read the next chunk and repeat until you reach the end of original file.


3 Answers

You don't need Python to split a csv file. Using your shell:

$ split -l 100 data.csv

Would split data.csv in chunks of 100 lines.

like image 53
Thomas Orozco Avatar answered Oct 10 '22 19:10

Thomas Orozco


I had to do a similar task, and used the pandas package:

for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
    chunk.to_csv('chunk{}.csv'.format(i), index=False)
like image 32
Quentin Febvre Avatar answered Oct 10 '22 20:10

Quentin Febvre


Here is a little python script I used to split a file data.csv into several CSV part files. The number of part files can be controlled with chunk_size (number of lines per part file).

The header line (column names) of the original file is copied into every part CSV file.

It works for big files because it reads one line at a time with readline() instead of loading the complete file into memory at once.

#!/usr/bin/env python3

def main():
    chunk_size = 9998  # lines

    def write_chunk(part, lines):
        with open('data_part_'+ str(part) +'.csv', 'w') as f_out:
            f_out.write(header)
            f_out.writelines(lines)

    with open('data.csv', 'r') as f:
        count = 0
        header = f.readline()
        lines = []
        for line in f:
            count += 1
            lines.append(line)
            if count % chunk_size == 0:
                write_chunk(count // chunk_size, lines)
                lines = []
        # write remainder
        if len(lines) > 0:
            write_chunk((count // chunk_size) + 1, lines)

if __name__ == '__main__':
    main()
like image 8
Roberto Avatar answered Oct 10 '22 20:10

Roberto