I have a 14MB Excel file with five worksheets that I'm reading into a Pandas dataframe, and although the code below works, it takes 9 minutes!
Does anyone have suggestions for speeding it up?
import pandas as pd def OTT_read(xl,site_name): df = pd.read_excel(xl.io,site_name,skiprows=2,parse_dates=0,index_col=0, usecols=[0,1,2],header=None, names=['date_time','%s_depth'%site_name,'%s_temp'%site_name]) return df def make_OTT_df(FILEDIR,OTT_FILE): xl = pd.ExcelFile(FILEDIR + OTT_FILE) site_names = xl.sheet_names df_list = [OTT_read(xl,site_name) for site_name in site_names] return site_names,df_list FILEDIR='c:/downloads/' OTT_FILE='OTT_Data_All_stations.xlsx' site_names_OTT,df_list_OTT = make_OTT_df(FILEDIR,OTT_FILE)
Importing csv files in Python is 100x faster than Excel files. We can now load these files in 0.63 seconds. That's nearly 10 times faster! Python loads CSV files 100 times faster than Excel files.
To read an excel file as a DataFrame, use the pandas read_excel() method. You can read the first sheet, specific sheets, multiple sheets or all sheets. Pandas converts this to the DataFrame structure, which is a tabular like structure.
To read all excel files in a directory, use the Glob module and the read_excel() method.
As others have suggested, csv reading is faster. So if you are on windows and have Excel, you could call a vbscript to convert the Excel to csv and then read the csv. I tried the script below and it took about 30 seconds.
# create a list with sheet numbers you want to process sheets = map(str,range(1,6)) # convert each sheet to csv and then read it using read_csv df={} from subprocess import call excel='C:\\Users\\rsignell\\OTT_Data_All_stations.xlsx' for sheet in sheets: csv = 'C:\\Users\\rsignell\\test' + sheet + '.csv' call(['cscript.exe', 'C:\\Users\\rsignell\\ExcelToCsv.vbs', excel, csv, sheet]) df[sheet]=pd.read_csv(csv)
Here's a little snippet of python to create the ExcelToCsv.vbs script:
#write vbscript to file vbscript="""if WScript.Arguments.Count < 3 Then WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>" Wscript.Quit End If csv_format = 6 Set objFSO = CreateObject("Scripting.FileSystemObject") src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0)) dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1)) worksheet_number = CInt(WScript.Arguments.Item(2)) Dim oExcel Set oExcel = CreateObject("Excel.Application") Dim oBook Set oBook = oExcel.Workbooks.Open(src_file) oBook.Worksheets(worksheet_number).Activate oBook.SaveAs dest_file, csv_format oBook.Close False oExcel.Quit """; f = open('ExcelToCsv.vbs','w') f.write(vbscript.encode('utf-8')) f.close()
This answer benefited from Convert XLS to CSV on command line and csv & xlsx files import to pandas data frame: speed issue
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With