Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RTF / doc / docx text extraction in program written in C++/Qt

Tags:

c++

windows

qt

I am writing some program in Qt/C++, and I need to read text from Microsoft Word/RTF/docx files.

And I am looking for some command-line program that can make that extraction. It may be several programs.

The closest thing I found is DocToText, but it has several bugs, so I can't use it. I have also Microsoft Word installed on the PC. Maybe there is some way to read text using it (have no idea how to use COM)?

like image 526
Night Walker Avatar asked Dec 10 '22 20:12

Night Walker


2 Answers

Now, this is pretty ugly and pretty hacky, but it seems to work for me for basic text extraction. Obviously to use this in a Qt program you'd have to spawn a process for it etc, but the command line I've hacked together is:

unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'

So that's:

unzip -p file.docx: -p == "unzip to stdout"

grep '<w:t': Grab just the lines containing '<w:t' (<w:t> is the Word 2007 XML element for "text", as far as I can tell)

sed 's/<[^<]>//g'*: Remove everything inside tags

grep -v '^[[:space:]]$'*: Remove blank lines

There is likely a more efficient way to do this, but it seems to work for me on the few docs I've tested it with.

As far as I'm aware, unzip, grep and sed all have ports for Windows and any of the Unixes, so it should be reasonably cross-platform. Despit being a bit of an ugly hack ;)

like image 174
Ben Williams Avatar answered Dec 30 '22 07:12

Ben Williams


Try Apache Tika

like image 30
raven Avatar answered Dec 30 '22 06:12

raven