I have a 140MB Excel file I need to analyze using pandas. The problem is that if I open this file as xlsx it takes python 5 minutes simply to read it. I tried to manually save this file as csv and then it takes Python about a second to open and read it! There are different 2012-2014 solutions that why Python 3 don't really work on my end.
Can somebody suggest how to convert very quickly file 'C:\master_file.xlsx'
to 'C:\master_file.csv'
?
Create a variable to store the path of the input excel file. To create/load a workbook object, pass the input excel file to the openpyxl module's load_workbook() function (loads a workbook). Opening an output CSV file in write mode with open() and writer() functions to convert an input excel file into a CSV file.
(1) Keep selecting all sheets. If not, you can check the checkbox before Worksheet name to select all sheets; (2) Check the Specify save format option; (3) Click the box below Specify save format option, and select CSV (Macintosh)(*.
There is a project aiming to be very pythonic on dealing with data called "rows". It relies on "openpyxl" for xlsx, though. I don't know if this will be faster than Pandas, but anyway:
$ pip install rows openpyxl
And:
import rows
data = rows.import_from_xlsx("my_file.xlsx")
rows.export_to_csv(data, open("my_file.csv", "wb"))
I faced the same problem as you. Pandas and openpyxl didn't work for me.
I came across with this solution and that worked great for me:
import win32com.client
xl=win32com.client.Dispatch("Excel.Application")
xl.DisplayAlerts = False
xl.Workbooks.Open(Filename=your_file_path,ReadOnly=1)
wb = xl.Workbooks(1)
wb.SaveAs(Filename='new_file.csv', FileFormat='6') #6 means csv
wb.Close(False)
xl.Application.Quit()
wb=None
xl=None
Here you convert the file to csv by means of Excel. All the other ways that I tried refuse to work.
Use read-only mode in openpyxl. Something like the following should work.
import csv
import openpyxl
wb = load_workbook("myfile.xlsx", read_only=True)
ws = wb['sheetname']
with open("myfile.csv", "wb") as out:
writer = csv.writer(out)
for row in ws:
values = (cell.value for cell in row)
writer.writerow(values)
Fastest way that pops to mind:
As an added benefit, you'll be able to do cleanup of the data before saving it to csv.
import pandas as pd
df = pd.read_excel('C:\master_file.xlsx', header=0) #, sheetname='<your sheet>'
df.to_csv('C:\master_file.csv', index=False, quotechar="'")
At some point, dealing with lots of data will take lots of time. Just a fact of life. Good to look for options if it's a problem, though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With