Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to open a huge excel file efficiently

I have a 150MB one-sheet excel file that takes about 7 minutes to open on a very powerful machine using the following:

# using python import xlrd wb = xlrd.open_workbook(file) sh = wb.sheet_by_index(0) 

Is there any way to open the excel file quicker? I'm open to even very outlandish suggestions (such as hadoop, spark, c, java, etc.). Ideally I'm looking for a way to open the file in under 30 seconds if that's not a pipe dream. Also, the above example is using python, but it doesn't have to be python.


Note: this is an Excel file from a client. It cannot be converted into any other format before we receive it. It is not our file


UPDATE: Answer with a working example of code that will open the following 200MB excel file in under 30 seconds will be rewarded with bounty: https://drive.google.com/file/d/0B_CXvCTOo7_2VW9id2VXRWZrbzQ/view?usp=sharing. This file should have string (col 1), date (col 9), and number (col 11).

like image 874
David542 Avatar asked Sep 22 '17 22:09

David542


People also ask

Why do large Excel files take so long to open?

Styles: When a workbook has too many cell styles, shapes, or formatting, it can cause Excel Online to take many times longer to open a file than usual. To see the cell styles for an Excel workbook, locate the Home tab, and then click the Down arrow in the lower-right corner of the Styles group.

How do you open an Excel file with more than 1 million rows?

So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.


1 Answers

Most programming languages that work with Office products have some middle layer and this is usually where the bottleneck is, a good example is using PIA's/Interop or Open XML SDK.

One way to get the data at a lower level (bypassing the middle layer) is using a Driver.

150MB one-sheet excel file that takes about 7 minutes.

The best I could do is a 130MB file in 135 seconds, roughly 3 times faster:

Stopwatch sw = new Stopwatch(); sw.Start();  DataSet excelDataSet = new DataSet();  string filePath = @"c:\temp\BigBook.xlsx";  // For .XLSXs we use =Microsoft.ACE.OLEDB.12.0;, for .XLS we'd use Microsoft.Jet.OLEDB.4.0; with  "';Extended Properties=\"Excel 8.0;HDR=YES;\""; string connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" + filePath + "';Extended Properties=\"Excel 12.0;HDR=YES;\"";  using (OleDbConnection conn = new OleDbConnection(connectionString)) {     conn.Open();     OleDbDataAdapter objDA = new System.Data.OleDb.OleDbDataAdapter     ("select * from [Sheet1$]", conn);     objDA.Fill(excelDataSet);     //dataGridView1.DataSource = excelDataSet.Tables[0]; } sw.Stop(); Debug.Print("Load XLSX tool: " + sw.ElapsedMilliseconds + " millisecs. Records = "  + excelDataSet.Tables[0].Rows.Count); 

enter image description here

Win 7x64, Intel i5, 2.3ghz, 8GB ram, SSD250GB.

If I could recommend a hardware solution as well, try to resolve it with an SSD if you're using standard HDD's.

Note: I cant download your Excel spreadsheet example as I'm behind a corporate firewall.

PS. See MSDN - Fastest Way to import xlsx files with 200 MB of Data, the consensus being OleDB is the fastest.

PS 2. Here's how you can do it with python: http://code.activestate.com/recipes/440661-read-tabular-data-from-excel-spreadsheets-the-fast/

like image 102
Jeremy Thompson Avatar answered Sep 18 '22 05:09

Jeremy Thompson