I have a 150MB one-sheet excel file that takes about 7 minutes to open on a very powerful machine using the following:
# using python import xlrd wb = xlrd.open_workbook(file) sh = wb.sheet_by_index(0)
Is there any way to open the excel file quicker? I'm open to even very outlandish suggestions (such as hadoop, spark, c, java, etc.). Ideally I'm looking for a way to open the file in under 30 seconds if that's not a pipe dream. Also, the above example is using python, but it doesn't have to be python.
Note: this is an Excel file from a client. It cannot be converted into any other format before we receive it. It is not our file
UPDATE: Answer with a working example of code that will open the following 200MB excel file in under 30 seconds will be rewarded with bounty: https://drive.google.com/file/d/0B_CXvCTOo7_2VW9id2VXRWZrbzQ/view?usp=sharing. This file should have string (col 1), date (col 9), and number (col 11).
Styles: When a workbook has too many cell styles, shapes, or formatting, it can cause Excel Online to take many times longer to open a file than usual. To see the cell styles for an Excel workbook, locate the Home tab, and then click the Down arrow in the lower-right corner of the Styles group.
So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.
Most programming languages that work with Office products have some middle layer and this is usually where the bottleneck is, a good example is using PIA's/Interop or Open XML SDK.
One way to get the data at a lower level (bypassing the middle layer) is using a Driver.
150MB one-sheet excel file that takes about 7 minutes.
The best I could do is a 130MB file in 135 seconds, roughly 3 times faster:
Stopwatch sw = new Stopwatch(); sw.Start(); DataSet excelDataSet = new DataSet(); string filePath = @"c:\temp\BigBook.xlsx"; // For .XLSXs we use =Microsoft.ACE.OLEDB.12.0;, for .XLS we'd use Microsoft.Jet.OLEDB.4.0; with "';Extended Properties=\"Excel 8.0;HDR=YES;\""; string connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" + filePath + "';Extended Properties=\"Excel 12.0;HDR=YES;\""; using (OleDbConnection conn = new OleDbConnection(connectionString)) { conn.Open(); OleDbDataAdapter objDA = new System.Data.OleDb.OleDbDataAdapter ("select * from [Sheet1$]", conn); objDA.Fill(excelDataSet); //dataGridView1.DataSource = excelDataSet.Tables[0]; } sw.Stop(); Debug.Print("Load XLSX tool: " + sw.ElapsedMilliseconds + " millisecs. Records = " + excelDataSet.Tables[0].Rows.Count);
Win 7x64, Intel i5, 2.3ghz, 8GB ram, SSD250GB.
If I could recommend a hardware solution as well, try to resolve it with an SSD if you're using standard HDD's.
Note: I cant download your Excel spreadsheet example as I'm behind a corporate firewall.
PS. See MSDN - Fastest Way to import xlsx files with 200 MB of Data, the consensus being OleDB is the fastest.
PS 2. Here's how you can do it with python: http://code.activestate.com/recipes/440661-read-tabular-data-from-excel-spreadsheets-the-fast/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With