Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can openoffice count words from console?

i have a small problem i need to count words inside the console to read doc, docx, pptx, ppt, xls, xlsx, odt, pdf ... so don't suggest me | wc -w or grep because they work only with text or console output and they count only spaces and in japanese, chinese, arabic , hindu , hebrew they use diferent delimiter so the word count is wrong and i tried to count with this

pdftotext file.pdf -| wc -w
/usr/local/bin/docx2txt.pl < file.docx | wc -w
/usr/local/bin/pptx2txt.pl < file.pptx | wc -w
antiword file.doc -| wc -w 
antiword file.word -| wc -w

in some cases microsoft word , openoffice sad 1000 words and the counters return 10 or 300 words if the language is ( japanese , chinese, hindu ect... ) , but if i use normal characters then i have no issue the biggest mistake is in some case 3 chars less witch is "OK"

i tried to convert with soffice , openoffice and then try WC -w but i can't even convert ,

soffice --headless --nofirststartwizard --accept=socket,host=127.0.0.1,port=8100; --convert-to pdf some.pdf /var/www/domains/vocabridge.com/devel/temp_files/23/0/东京_1000_words_Docx.docx 

OR

 openoffice.org  --headless  --convert-to  ........

OR

openoffice.org3 --invisible 

so if someone know any way to count correctly or display document statistic with openoffice or anything else or linux with the console please share it

thanks.

like image 546
ddjikic Avatar asked Oct 22 '22 16:10

ddjikic


2 Answers

If you have Microsoft Word (and Windows, obviously) you can write a VBA macro or if you want to run straight from the command line you can write a VBScript script with something like the following:

wordApp = CreateObject("Word.Application")
doc = ... ' open up a Word document using wordApp
docWordCount = doc.Words.Count
' Rinse and repeat...

If you have OpenOffice.org/LibreOffice you have similar (but more) options. If you want to stay in the office app and run a macro you can probably do that. I don't know the StarBasic API well enough to tell you how but I can give you the alternative: creating a Python script to get the word count from the command line. Roughly speaking, you do the following:

  • Start up your copy of OOo/LibO from the command line with the appropriate parameters to accept incoming socket connections. http://www.openoffice.org/udk/python/python-bridge.html has instructions on how to do that. Go there and use the browser's in-page find feature to search for `accept=socket'

  • Write a Python script to use the OOo/LibO UNO bridge (basically equivalent to the VBScript example above) to open up your Word/ODT documents one at a time and get the word count from each. The above page should give you a good start to doing that.

  • You get the word count from a document model object's WordCount property: http://www.openoffice.org/api/docs/common/ref/com/sun/star/text/GenericTextDocument.html#WordCount

like image 73
Yawar Avatar answered Oct 25 '22 17:10

Yawar


I found the answer create one service

#!/bin/sh
#
# chkconfig: 345 99 01
#
# description: your script is a test service
#

(while sleep 1; do
  ls pathwithfiles/in | while read file; do
    libreoffice --headless -convert-to pdf "pathwithfiles/in/$file" --outdir pathwithfiles/out
    rm "pathwithfiles/in/$file"
  done
done) &

then the php script that i needed counted everything

 $ext = pathinfo($absolute_file_path, PATHINFO_EXTENSION);
        if ($ext !== 'txt' && $ext !== 'pdf') {
            // Convert to pdf
            $tb = mktime() . mt_rand();
            $tempfile = 'locationofpdfs/in/' . $tb . '.' . $ext;
            copy($absolute_file_path, $tempfile);
            $absolute_file_path = 'locationofpdfs/out/' . $tb . '.pdf';
            $ext = 'pdf';
            while (!is_file($absolute_file_path)) sleep(1);
        }
        if ($ext !== 'txt') {
            // Convert to txt
            $tempfile = tempnam(sys_get_temp_dir(), '');
            shell_exec('pdftotext "' . $absolute_file_path . '" ' . $tempfile);
            $absolute_file_path = $tempfile;
            $ext = 'txt';
        }
        if ($ext === 'txt') {
            $seq = '/[\s\.,;:!\? ]+/mu';
            $plain = file_get_contents($absolute_file_path);
            $plain = preg_replace('#\{{{.*?\}}}#su', "", $plain);
            $str = preg_replace($seq, '', $plain);
            $chars = count(preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY));
            $words = count(preg_split($seq, $plain, -1, PREG_SPLIT_NO_EMPTY));
            if ($words === 0) return $chars;
            if ($chars / $words > 10) $words = $chars;
            return $words;
        }
like image 36
ddjikic Avatar answered Oct 25 '22 17:10

ddjikic