Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way to read Excel files to pandas dataframe

I have a 14MB Excel file with five worksheets that I'm reading into a Pandas dataframe, and although the code below works, it takes 9 minutes!

Does anyone have suggestions for speeding it up?

import pandas as pd  def OTT_read(xl,site_name):     df = pd.read_excel(xl.io,site_name,skiprows=2,parse_dates=0,index_col=0,                        usecols=[0,1,2],header=None,                        names=['date_time','%s_depth'%site_name,'%s_temp'%site_name])     return df  def make_OTT_df(FILEDIR,OTT_FILE):     xl = pd.ExcelFile(FILEDIR + OTT_FILE)     site_names = xl.sheet_names     df_list = [OTT_read(xl,site_name) for site_name in site_names]     return site_names,df_list  FILEDIR='c:/downloads/' OTT_FILE='OTT_Data_All_stations.xlsx' site_names_OTT,df_list_OTT = make_OTT_df(FILEDIR,OTT_FILE) 
like image 477
jsignell Avatar asked Feb 27 '15 13:02

jsignell


People also ask

Do you read Excel files with Python there is a 1000x faster way?

Importing csv files in Python is 100x faster than Excel files. We can now load these files in 0.63 seconds. That's nearly 10 times faster! Python loads CSV files 100 times faster than Excel files.

How do I read an Excel file to pandas DataFrame?

To read an excel file as a DataFrame, use the pandas read_excel() method. You can read the first sheet, specific sheets, multiple sheets or all sheets. Pandas converts this to the DataFrame structure, which is a tabular like structure.

How do I view multiple Excel files in pandas?

To read all excel files in a directory, use the Glob module and the read_excel() method.


1 Answers

As others have suggested, csv reading is faster. So if you are on windows and have Excel, you could call a vbscript to convert the Excel to csv and then read the csv. I tried the script below and it took about 30 seconds.

# create a list with sheet numbers you want to process sheets = map(str,range(1,6))  # convert each sheet to csv and then read it using read_csv df={} from subprocess import call excel='C:\\Users\\rsignell\\OTT_Data_All_stations.xlsx' for sheet in sheets:     csv = 'C:\\Users\\rsignell\\test' + sheet + '.csv'      call(['cscript.exe', 'C:\\Users\\rsignell\\ExcelToCsv.vbs', excel, csv, sheet])     df[sheet]=pd.read_csv(csv) 

Here's a little snippet of python to create the ExcelToCsv.vbs script:

#write vbscript to file vbscript="""if WScript.Arguments.Count < 3 Then     WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"     Wscript.Quit End If  csv_format = 6  Set objFSO = CreateObject("Scripting.FileSystemObject")  src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0)) dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1)) worksheet_number = CInt(WScript.Arguments.Item(2))  Dim oExcel Set oExcel = CreateObject("Excel.Application")  Dim oBook Set oBook = oExcel.Workbooks.Open(src_file) oBook.Worksheets(worksheet_number).Activate  oBook.SaveAs dest_file, csv_format  oBook.Close False oExcel.Quit """;  f = open('ExcelToCsv.vbs','w') f.write(vbscript.encode('utf-8')) f.close() 

This answer benefited from Convert XLS to CSV on command line and csv & xlsx files import to pandas data frame: speed issue

like image 194
Rich Signell Avatar answered Sep 29 '22 22:09

Rich Signell