Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the number of pages in a Word Document on linux?

Tags:

php

ms-word

I saw this question PHP - Get number of pages in a Word document . I also need to determine the pages count from given word file (doc/docx). I tried to investigate phplivedocx/ZF (@hobodave linked to those in the original post answers), but I lost my hands and legs there. I can't use any outer web service either (like DOC2PDF sites, and then count the pages in the PDF version, or so...).

Simply: Is there any php code (using ZF or anything else in PHP, excluding COM object or other execution-files, such 'AbiWord'; I'm using shared Linux server, without exec or similar function), to find the pages count of word file?

EDIT: The word versions that about to be supported are Microsoft-Word 2003 & 2007.

like image 731
Yaakov Shoham Avatar asked Jan 24 '12 11:01

Yaakov Shoham


3 Answers

Getting the number of pages for docx files is very easy:

function get_num_pages_docx($filename)
{
    $zip = new ZipArchive();

    if($zip->open($filename) === true)
    {  
        if(($index = $zip->locateName('docProps/app.xml')) !== false)
        {
            $data = $zip->getFromIndex($index);
            $zip->close();

            $xml = new SimpleXMLElement($data);
            return $xml->Pages;
        }

        $zip->close();
    }

    return false;
}

For 97-2003 format it's certainly challenging, but by no means impossible. The number of pages is stored in the SummaryInformation section of the document, but due to the OLE format of the files that makes it a pain to find. The structure is defined extremely thoroughly (though badly imo) here and simpler here. I looked at this for an hour today, but didn't get very far! (not a level of abstraction I'm used to), but output the hex to better understand the structure:

function get_num_pages_doc($filename) 
{
    $handle = fopen($filename, 'r');
    $line = @fread($handle, filesize($filename));

    echo '<div style="font-family: courier new;">';

        $hex = bin2hex($line);
        $hex_array = str_split($hex, 4);
        $i = 0;
        $line = 0;
        $collection = '';
        foreach($hex_array as $key => $string)
        {
            $collection .= hex_ascii($string);
            $i++;

            if($i == 1)
            {
                echo '<b>'.sprintf('%05X', $line).'0:</b> ';
            }

            echo strtoupper($string).' ';

            if($i == 8)
            {
                echo ' '.$collection.' <br />'."\n";
                $collection = '';
                $i = 0;

                $line += 1;
            }
        }

    echo '</div>';

    exit();
}

function hex_ascii($string, $html_safe = true)
{
    $return = '';

    $conv = array($string);
    if(strlen($string) > 2)
    {
        $conv = str_split($string, 2);
    }

    foreach($conv as $string)
    {
        $num = hexdec($string);

        $ascii = '.';
        if($num > 32)
        {   
            $ascii = unichr($num);
        }

        if($html_safe AND ($num == 62 OR $num == 60))
        {
            $return .= htmlentities($ascii);
        }
        else
        {
            $return .= $ascii;
        }
    }

    return $return;
}

function unichr($intval)
{
    return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
}

which will out put code where you can find the sections such as:

007000: 0500 5300 7500 6D00 6D00 6100 7200 7900 ..S.u.m.m.a.r.y.
007010: 4900 6E00 6600 6F00 7200 6D00 6100 7400 I.n.f.o.r.m.a.t.
007020: 6900 6F00 6E00 0000 0000 0000 0000 0000 i.o.n...........
007030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 

Which will allow you to see the referencing info such as:

007040: 2800 0201 FFFF FFFF FFFF FFFF FFFF FFFF (...ÿÿÿÿÿÿÿÿÿÿÿÿ
007050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
007060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
007070: 0000 0000 2500 0000 0010 0000 0000 0000 ....%...........

Which will allow you to determine properties described:

_ab = ("SummaryInformation") 
_cb = 0028
_mse = 02 (STGTY_STREAM) 
_bflags = 01 (DE_BLACK) 
_sidLeftSib = FFFF FFFF 
_sidRightSib = FFFF FFFF (none) 
_sidChild = FFFF FFFF (n/a for STGTY_STREAM) 
_clsid = 0000 0000 0000 0000 0000 0000 0000 0000 (n/a) 
_dwUserFlags = 0000 0000 (n/a) 
_time[0] = CreateTime = 0000 0000 0000 0000 (n/a) 
_time[1] = ModifyTime = 0000 0000 0000 0000 (n/a)
_startSect = 0000 0000 
_ulSize = 0000 1000 
_dptPropType = 0000 (n/a)

Which will let you find the relevant section of code, unpack it and get the page number. Of course this is the hard bit that I just don't have time for, but should set you in the right direction.

M$ don't make it easy!

like image 196
Paul Norman Avatar answered Oct 23 '22 17:10

Paul Norman


Have a look at PhpWord from microsoft codeplex ... "http://phpword.codeplex.com/

It will allow you to open and read the word formatted file in PHP and do whatever processing you require.

like image 20
iWantSimpleLife Avatar answered Oct 23 '22 17:10

iWantSimpleLife


To get meta data properties of doc,docx,ppt and pptx like number of pages, number of slides using PHP i followed the following process and it worked liked charm and iam so happy, below is the process i followed , hope it helps someone

Download and configure Apache Tika.

once its done you could try executing the following commadn it will give all the meta data about your file

java -jar tika-app-1.5.jar -m test.docx
java -jar tika-app-1.5.jar -m test.doc
java -jar tika-app-1.5.jar -m test.pptx
java -jar tika-app-1.5.jar -m test.ppt

once tested you can execute this comman in PHP script. Thanks.

like image 37
opensource-developer Avatar answered Oct 23 '22 15:10

opensource-developer