I have a 7GB csv
file which I'd like to split into smaller chunks, so it is readable and faster for analysis in Python on a notebook. I would like to grab a small set from it, maybe 250MB, so how can I do this?
So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.
To split a big binary file in multiple files, you should first read the file by the size of chunk you want to create, then write that chunk to a file, read the next chunk and repeat until you reach the end of original file.
You don't need Python to split a csv file. Using your shell:
$ split -l 100 data.csv
Would split data.csv
in chunks of 100 lines.
I had to do a similar task, and used the pandas package:
for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
chunk.to_csv('chunk{}.csv'.format(i), index=False)
Here is a little python script I used to split a file data.csv
into several CSV part files. The number of part files can be controlled with chunk_size
(number of lines per part file).
The header line (column names) of the original file is copied into every part CSV file.
It works for big files because it reads one line at a time with readline()
instead of loading the complete file into memory at once.
#!/usr/bin/env python3
def main():
chunk_size = 9998 # lines
def write_chunk(part, lines):
with open('data_part_'+ str(part) +'.csv', 'w') as f_out:
f_out.write(header)
f_out.writelines(lines)
with open('data.csv', 'r') as f:
count = 0
header = f.readline()
lines = []
for line in f:
count += 1
lines.append(line)
if count % chunk_size == 0:
write_chunk(count // chunk_size, lines)
lines = []
# write remainder
if len(lines) > 0:
write_chunk((count // chunk_size) + 1, lines)
if __name__ == '__main__':
main()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With