Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to get the worksheet names from 20MB+ Excel files with PHPExcel?

I am using PHPExcel to read data out of Excel files.

With the following code, I am able to read one particular worksheet out of a 3MB Excel file in just a couple seconds. Works nicely.

However, I now have 27MB and 88MB Excel files which I need to get data out of. They are so large that even OpenOffice cannot open them.

I have found that I can use an index number instead of a name when I load a sheet, but this seems inconsistent, e.g. in one particular Excel file setLoadSheetsOnly(0) gave me the third sheet while setLoadSheetsOnly(1) gave me an error even though there were four worksheets in the file. So that seems unreliable for some reason.

Is there a way I can read out the names of the worksheets from a large file so that I can access only one of its worksheets at a time?

        $objReader = PHPExcel_IOFactory::createReaderForFile("data/" . $file_name);
        $objReader->setLoadSheetsOnly(array($sheet_name));
        $objReader->setReadDataOnly(true);
        $objPHPExcel = $objReader->load("data/" . $file_name);

        echo '<table border="1">';
        for ($row = 1; $row < $number_of_rows; $row++) {
            echo '<tr>';
            for ($column = 0; $column < $number_of_columns; $column++) {
                $value = $objPHPExcel->setActiveSheetIndex(0)->getCellByColumnAndRow($column, $row)->getValue();
                echo '<td>';
                echo $value . '&nbsp;';
                echo '</td>';
            }
            echo '</tr>';
        }
        echo '</table>';
        die;

ADDENDUM:

I found some code that gets close, but it doesn't seem to always be accurate, e.g. here it missed the second worksheet in a 27MB file:

alt text

and here it only got the third worksheet and missed 3 others:

alt text

$objReader = PHPExcel_IOFactory::createReaderForFile("data/" . $file_name);
$objReader->setLoadSheetsOnly(0);
$objReader->setReadDataOnly(true);
$objPHPExcel = $objReader->load("data/" . $file_name);

echo $objPHPExcel->getSheetCount(), ' worksheets<hr/>';
$loadedSheetNames = $objPHPExcel->getSheetNames();
foreach ($loadedSheetNames as $sheetIndex => $loadedSheetName) {
    echo $sheetIndex, ' -> ', $loadedSheetName, '<br />';
}
die;
like image 567
Edward Tanguay Avatar asked Dec 28 '10 12:12

Edward Tanguay


2 Answers

Unfortunately, it isn't possible to read the names of the worksheets without loading the whole file.

Using an index number rather than a name when calling setLoadSheetsOnly() won't give a predictable result: the code logic that performs that check uses in_array() to test if the sheetname that it is about to read is in the array of sheetnames to read. e.g.

// check if sheet should be skipped
if (isset($this->_loadSheetsOnly) && !in_array($sheet['name'], $this->_loadSheetsOnly)) {
    continue;
}

I would suspect that the comparison of a string against a numeric value will be giving a true result for 0 == "mySheetName" when executing this test (based on PHP's loose typing and comparison casting rules).

I could probably provide a Reader method that would return a list of worksheet names without actually loading the entire file, although there would be a performance hit involved.

EDIT

If you add the following method to Classes/PHPExcel/Reader/Excel2007.php

/**
 * Reads names of the worksheets from a file, without loading the whole file to a PHPExcel object
 *
 * @param   string      $pFilename
 * @throws  Exception
 */
public function listWorksheetNames($pFilename)
{
    // Check if file exists
    if (!file_exists($pFilename)) {
        throw new Exception("Could not open " . $pFilename . " for reading! File does not exist.");
    }

    $worksheetNames = array();

    $zip = new ZipArchive;
    $zip->open($pFilename);

    $rels = simplexml_load_string($this->_getFromZipArchive($zip, "_rels/.rels")); //~ http://schemas.openxmlformats.org/package/2006/relationships");
    foreach ($rels->Relationship as $rel) {
        switch ($rel["Type"]) {
            case "http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument":
                $xmlWorkbook = simplexml_load_string($this->_getFromZipArchive($zip, "{$rel['Target']}"));  //~ http://schemas.openxmlformats.org/spreadsheetml/2006/main");

                if ($xmlWorkbook->sheets) {
                    foreach ($xmlWorkbook->sheets->sheet as $eleSheet) {
                        // Check if sheet should be skipped
                        $worksheetNames[] = (string) $eleSheet["name"];
                    }
                }
        }
    }

    $zip->close();

    return $worksheetNames;
}

You can call it by using:

$inputFileType = 'Excel2007';
$inputFileName = 'biostat-behfisk-2005.xlsx';
$objReader = PHPExcel_IOFactory::createReader($inputFileType);
$worksheetNames = $objReader->listWorksheetNames($inputFileName);

foreach ($worksheetNames as $sheetName) {
    echo $sheetName, '<br />';
}

the returned $worksheetNames should contain an array of all worksheet names as UTF-8 strings. Because it's only reading the absolute minimum from the .xlsx to retrieve these names, it should be fairly fast. I'll do some more tests before checking it into the PHPExcel SVN, but (for now) it seems to do what you need.

EDIT2

Equivalent method for the Excel5 Reader

/**
 * Reads names of the worksheets from a file, without loading the whole file to a PHPExcel object
 *
 * @param   string      $pFilename
 * @throws  Exception
 */
public function listWorksheetNames($pFilename)
{
    // Check if file exists
    if (!file_exists($pFilename)) {
        throw new Exception("Could not open " . $pFilename . " for reading! File does not exist.");
    }

    $worksheetNames = array();

    // Read the OLE file
    $this->_loadOLE($pFilename);

    // total byte size of Excel data (workbook global substream + sheet substreams)
    $this->_dataSize = strlen($this->_data);

    $this->_pos     = 0;
    $this->_sheets  = array();

    // Parse Workbook Global Substream
    while ($this->_pos < $this->_dataSize) {
        $code = self::_GetInt2d($this->_data, $this->_pos);

        switch ($code) {
            case self::XLS_Type_BOF:    $this->_readBof();      break;
            case self::XLS_Type_SHEET:  $this->_readSheet();    break;
            case self::XLS_Type_EOF:    $this->_readDefault();  break 2;
            default:                    $this->_readDefault();  break;
        }
    }

    foreach ($this->_sheets as $sheet) {
        if ($sheet['sheetType'] != 0x00) {
            // 0x00: Worksheet, 0x02: Chart, 0x06: Visual Basic module
            continue;
        }

        $worksheetNames[] = $sheet['name'];
    }

    return $worksheetNames;
}

Not as efficient as the Excel2007 Reader version, but should still be faster than parsing the whole .xls file just for the sheet names because I'm only parsing the global stream.

like image 192
Mark Baker Avatar answered Oct 17 '22 23:10

Mark Baker


I didn't want to amend phpexcel so I went with this:

public function getWorksheetNames($pFilename) {

    $worksheetNames = array ();

    $zip = zip_open ( $pFilename );
    while ( $entry = zip_read ( $zip ) ) {

        $entry_name = zip_entry_name ( $entry );
        if ($entry_name == 'xl/workbook.xml') {
            if (zip_entry_open ( $zip, $entry, "r" )) {
                $buf = zip_entry_read ( $entry, zip_entry_filesize ( $entry ) );
                $workbook = simplexml_load_string ( $buf );
                foreach ( $workbook->sheets as $sheets ) {
                    foreach( $sheets as $sheet) {
                        $attributes=$sheet->attributes();
                        $worksheetNames[]=$attributes['name'];
                    }
                }
                zip_entry_close ( $entry );
            }
            break;
        }

    }
    zip_close ( $zip );
    return $worksheetNames;
}

It only works on excel 2007 or later but did what I needed

like image 39
Mark Avatar answered Oct 17 '22 23:10

Mark