Reading data (just 20000 numbers) from a xlsx file takes forever:
import pandas as pd
xlsxfile = pd.ExcelFile("myfile.xlsx")
data = xlsxfile.parse('Sheet1', index_col = None, header = None)
takes about 9 seconds.
If I save the same file in csv format it takes ~25ms:
import pandas as pd
csvfile = "myfile.csv"
data = pd.read_csv(csvfile, index_col = None, header = None)
Is this an issue of openpyxl or am I missing something? Are there any alternatives?
A CSV (comma-separated values) file is a text file that has a specific format which allows data to be saved in a table structured format.
The difference between CSV and XLS file formats is that CSV format is a plain text format in which values are separated by commas (Comma Separated Values), while XLS file format is an Excel Sheets binary file format which holds information about all the worksheets in a file, including both content and formatting.
A CSV file is a list of data separated by commas. For instance, it may look like the following: Name,email,phone number,address. Example,[email protected],555-555-5555,Example Address. Example2,[email protected],555-555-5551,Example2 Address.
xlrd has support for .xlsx files, and this answer suggests that at least the beta version of xlrd with .xlsx support was quicker than openpyxl.
The current stable version of Pandas (11.0) uses openpyxl for .xlsx files, but this has been changed for the next release. If you want to give it a go, you can download the dev version from GitHub
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With