Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

csv & xlsx files import to pandas data frame: speed issue

Reading data (just 20000 numbers) from a xlsx file takes forever:

import pandas as pd
xlsxfile = pd.ExcelFile("myfile.xlsx")
data = xlsxfile.parse('Sheet1', index_col = None, header = None)

takes about 9 seconds.

If I save the same file in csv format it takes ~25ms:

import pandas as pd
csvfile = "myfile.csv"
data = pd.read_csv(csvfile, index_col = None, header = None)

Is this an issue of openpyxl or am I missing something? Are there any alternatives?

like image 992
sashkello Avatar asked Apr 24 '13 03:04

sashkello


People also ask

What CSV means?

A CSV (comma-separated values) file is a text file that has a specific format which allows data to be saved in a table structured format.

What is a CSV file vs Excel?

The difference between CSV and XLS file formats is that CSV format is a plain text format in which values are separated by commas (Comma Separated Values), while XLS file format is an Excel Sheets binary file format which holds information about all the worksheets in a file, including both content and formatting.

What is a CSV file example?

A CSV file is a list of data separated by commas. For instance, it may look like the following: Name,email,phone number,address. Example,[email protected],555-555-5555,Example Address. Example2,[email protected],555-555-5551,Example2 Address.


1 Answers

xlrd has support for .xlsx files, and this answer suggests that at least the beta version of xlrd with .xlsx support was quicker than openpyxl.

The current stable version of Pandas (11.0) uses openpyxl for .xlsx files, but this has been changed for the next release. If you want to give it a go, you can download the dev version from GitHub

like image 120
Matti John Avatar answered Sep 30 '22 14:09

Matti John