Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read large worksheets from large Excel files (27MB+) with PHPExcel?

Tags:

php

phpexcel

I have large Excel worksheets that I want to be able to read into MySQL using PHPExcel.

I am using the recent patch which allows you to read in Worksheets without opening the whole file. This way I can read one worksheet at a time.

However, one Excel file is 27MB large. I can successfully read in the first worksheet since it is small, but the second worksheet is so large that the cron job that started the process at 22:00 was not finished at 8:00 AM, the worksheet is simple too big.

Is there any way to read in a worksheet line by line, e.g. something like this:

$inputFileType = 'Excel2007'; $inputFileName = 'big_file.xlsx'; $objReader = PHPExcel_IOFactory::createReader($inputFileType); $worksheetNames = $objReader->listWorksheetNames($inputFileName);  foreach ($worksheetNames as $sheetName) {     //BELOW IS "WISH CODE":     foreach($row = 1; $row <=$max_rows; $row+= 100) {         $dataset = $objReader->getWorksheetWithRows($row, $row+100);         save_dataset_to_database($dataset);     } } 

Addendum

@mark, I used the code you posted to create the following example:

function readRowsFromWorksheet() {      $file_name = htmlentities($_POST['file_name']);     $file_type = htmlentities($_POST['file_type']);      echo 'Read rows from worksheet:<br />';     debug_log('----------start');     $objReader = PHPExcel_IOFactory::createReader($file_type);     $chunkSize = 20;     $chunkFilter = new ChunkReadFilter();     $objReader->setReadFilter($chunkFilter);      for ($startRow = 2; $startRow <= 240; $startRow += $chunkSize) {         $chunkFilter->setRows($startRow, $chunkSize);         $objPHPExcel = $objReader->load('data/' . $file_name);         debug_log('reading chunk starting at row '.$startRow);         $sheetData = $objPHPExcel->getActiveSheet()->toArray(null, true, true, true);         var_dump($sheetData);         echo '<hr />';     }     debug_log('end'); } 

As the following log file shows, it runs fine on a small 8K Excel file, but when I run it on a 3 MB Excel file, it never gets past the first chunk, is there any way I can optimize this code for performance, otherwise it doesn't look like it is not performant enough to get chunks out of a large Excel file:

2011-01-12 11:07:15: ----------start 2011-01-12 11:07:15: reading chunk starting at row 2 2011-01-12 11:07:15: reading chunk starting at row 22 2011-01-12 11:07:15: reading chunk starting at row 42 2011-01-12 11:07:15: reading chunk starting at row 62 2011-01-12 11:07:15: reading chunk starting at row 82 2011-01-12 11:07:15: reading chunk starting at row 102 2011-01-12 11:07:15: reading chunk starting at row 122 2011-01-12 11:07:15: reading chunk starting at row 142 2011-01-12 11:07:15: reading chunk starting at row 162 2011-01-12 11:07:15: reading chunk starting at row 182 2011-01-12 11:07:15: reading chunk starting at row 202 2011-01-12 11:07:15: reading chunk starting at row 222 2011-01-12 11:07:15: end 2011-01-12 11:07:52: ----------start 2011-01-12 11:08:01: reading chunk starting at row 2 (...at 11:18, CPU usage at 93% still running...) 

Addendum 2

When I comment out:

//$sheetData = $objPHPExcel->getActiveSheet()->toArray(null, true, true, true); //var_dump($sheetData); 

Then it parses at an acceptable speed (about 2 rows per second), is there anyway to increase the performance of toArray()?

2011-01-12 11:40:51: ----------start 2011-01-12 11:40:59: reading chunk starting at row 2 2011-01-12 11:41:07: reading chunk starting at row 22 2011-01-12 11:41:14: reading chunk starting at row 42 2011-01-12 11:41:22: reading chunk starting at row 62 2011-01-12 11:41:29: reading chunk starting at row 82 2011-01-12 11:41:37: reading chunk starting at row 102 2011-01-12 11:41:45: reading chunk starting at row 122 2011-01-12 11:41:52: reading chunk starting at row 142 2011-01-12 11:42:00: reading chunk starting at row 162 2011-01-12 11:42:07: reading chunk starting at row 182 2011-01-12 11:42:15: reading chunk starting at row 202 2011-01-12 11:42:22: reading chunk starting at row 222 2011-01-12 11:42:22: end 

Addendum 3

This seems to work adequately, for instance, at least on the 3 MB file:

for ($startRow = 2; $startRow <= 240; $startRow += $chunkSize) {     echo 'Loading WorkSheet using configurable filter for headings row 1 and for rows ', $startRow, ' to ', ($startRow + $chunkSize - 1), '<br />';     $chunkFilter->setRows($startRow, $chunkSize);     $objPHPExcel = $objReader->load('data/' . $file_name);     debug_log('reading chunk starting at row ' . $startRow);     foreach ($objPHPExcel->getActiveSheet()->getRowIterator() as $row) {         $cellIterator = $row->getCellIterator();         $cellIterator->setIterateOnlyExistingCells(false);         echo '<tr>';         foreach ($cellIterator as $cell) {             if (!is_null($cell)) {                 //$value = $cell->getCalculatedValue();                 $rawValue = $cell->getValue();                 debug_log($rawValue);             }         }     } } 
like image 769
Edward Tanguay Avatar asked Jan 12 '11 08:01

Edward Tanguay


People also ask

How do I open a large Excel spreadsheet?

Open up Excel and navigate to Data > Get & Transform Data > From Text/CSV. This will open up a file browser where you can select your source file. Click Import. Here's the first important bit: once a data preview window opens up, click on the small arrow besides Load to open a dropdown menu and click on Load To...

What is Phpexcel library?

PHP provides a library to deal with Excel files. It is called PHP Excel library. It enables you to read and write spreadsheets in various formats including csv, xls, ods, and xlsx. You will need to ensure that you have PHP's upgraded version not older than PHP 5.2 .


2 Answers

It is possible to read a worksheet in "chunks" using Read Filters, although I can make no guarantees about efficiency.

$inputFileType = 'Excel5'; $inputFileName = './sampleData/example2.xls';   /**  Define a Read Filter class implementing PHPExcel_Reader_IReadFilter  */ class chunkReadFilter implements PHPExcel_Reader_IReadFilter {     private $_startRow = 0;      private $_endRow = 0;      /**  Set the list of rows that we want to read  */     public function setRows($startRow, $chunkSize) {         $this->_startRow    = $startRow;         $this->_endRow        = $startRow + $chunkSize;     }      public function readCell($column, $row, $worksheetName = '') {         //  Only read the heading row, and the rows that are configured in $this->_startRow and $this->_endRow         if (($row == 1) || ($row >= $this->_startRow && $row < $this->_endRow)) {             return true;         }         return false;     } }   echo 'Loading file ',pathinfo($inputFileName,PATHINFO_BASENAME),' using IOFactory with a defined reader type of ',$inputFileType,'<br />'; /**  Create a new Reader of the type defined in $inputFileType  **/  $objReader = PHPExcel_IOFactory::createReader($inputFileType);    echo '<hr />';   /**  Define how many rows we want to read for each "chunk"  **/ $chunkSize = 20; /**  Create a new Instance of our Read Filter  **/ $chunkFilter = new chunkReadFilter();  /**  Tell the Reader that we want to use the Read Filter that we've Instantiated  **/ $objReader->setReadFilter($chunkFilter);  /**  Loop to read our worksheet in "chunk size" blocks  **/ /**  $startRow is set to 2 initially because we always read the headings in row #1  **/  for ($startRow = 2; $startRow <= 240; $startRow += $chunkSize) {     echo 'Loading WorkSheet using configurable filter for headings row 1 and for rows ',$startRow,' to ',($startRow+$chunkSize-1),'<br />';     /**  Tell the Read Filter, the limits on which rows we want to read this iteration  **/     $chunkFilter->setRows($startRow,$chunkSize);     /**  Load only the rows that match our filter from $inputFileName to a PHPExcel Object  **/     $objPHPExcel = $objReader->load($inputFileName);      //    Do some processing here      $sheetData = $objPHPExcel->getActiveSheet()->toArray(null,true,true,true);     var_dump($sheetData);     echo '<br /><br />'; } 

Note that this Read Filter will always read the first row of the worksheet, as well as the rows defined by the chunk rule.

When using a read filter, PHPExcel still parses the entire file, but only loads those cells that match the defined read filter, so it only uses the memory required by that number of cells. However, it will parse the file multiple times, once for each chunk, so it will be slower. This example reads 20 rows at a time: to read line by line, simply set $chunkSize to 1.

This can also cause problems if you have formulae that reference cells in different "chunks", because the data simply isn't available for cells outside of the current "chunk".

like image 59
Mark Baker Avatar answered Sep 19 '22 08:09

Mark Baker


Currently to read .xlsx, .csv and .ods the best option is spreadsheet-reader (https://github.com/nuovo/spreadsheet-reader) because it can read the files without loading it all into memory. For the .xls extension it has limitations because it uses the PHPExcel for reading.

like image 34
Leonardo Delfino Avatar answered Sep 22 '22 08:09

Leonardo Delfino