Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - split large excel file

I have an excel file with about 500,000 rows and I want to split it to several excel file, each with 50,000 rows.

I want to do it with pandas so it will be the quickest and easiest.

any ideas how to make it?

thank you for your help

like image 445
TheDaJon Avatar asked Dec 25 '16 12:12

TheDaJon


People also ask

How do you separate data from Excel in Python?

Excel's text to column feature lets you easily split this data into separate columns. You simply select the column, click Data → Text to Columns, and delimit by a comma. And voila! Now to do this in Pandas is just as easy!

Is pandas more powerful than Excel?

Speed - Pandas is much faster than Excel, which is especially noticeable when working with larger quantities of data. Automation - A lot of the tasks that can be achieved with Pandas are extremely easy to automate, reducing the amount of tedious and repetitive tasks that need to be performed daily.


3 Answers

Assuming that your Excel file has only one (first) sheet containing data, I'd make use of chunksize parameter:

import pandas as pd
import numpy as np

i=0
for df in pd.read_excel(file_name, chunksize=50000):
    df.to_excel('/path/to/file_{:02d}.xlsx'.format(i), index=False)
    i += 1

UPDATE:

chunksize = 50000
df = pd.read_excel(file_name)
for chunk in np.split(df, len(df) // chunksize):
    chunk.to_excel('/path/to/file_{:02d}.xlsx'.format(i), index=False)
like image 145
MaxU - stop WAR against UA Avatar answered Oct 11 '22 17:10

MaxU - stop WAR against UA


use np.split_array as per this answer https://stackoverflow.com/a/17315875/1394890 if you get

array split does not result in an equal division

like image 34
wild Avatar answered Oct 11 '22 16:10

wild


As explained by MaxU, I will also make use of a variable chunksize and divide the total number of rows in large file into required number of rows.

import pandas as pd
import numpy as np

chunksize = 50000
i=0
df = pd.read_excel("path/to/file.xlsx")
for chunk in np.split(df, len(df) // chunksize):
    chunk.to_excel('path/to/destination/folder/file_{:02d}.xlsx'.format(i), index=True)
    i += 1

Hope this would help you.

like image 28
Tarun Balani Avatar answered Oct 11 '22 17:10

Tarun Balani