Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best Way to Process a Word Document [closed]

I receive word documents with specified formating corresponding to the data that is in them. For example, all headers have the exact same formating (Times New Roman-Font 14-Bold).

What is the best way to process such MS Word documents (.doc or .docx) into xml documents? Language is not an issue (I'll use Lisp/Boost.Spirit if I have to!).

like image 715
Mikhail Avatar asked Nov 24 '10 02:11

Mikhail


People also ask

How do I recover a Word document I accidentally closed?

If you don't find the missing file in these locations, open Word, and select File > Info > Manage Document > Recover Unsaved Documents.

How do I Finalise a Word document?

To review changes one at a time, click Review then Accept or Reject. To finalize the document all at once, click Accept All Changes or Reject All Changes.

How do I unfreeze Word without losing work?

Press CTRL+SHIFT+ESC.

What to do if you close Word without saving?

These can be found via File, Open and clicking the Recover Unsaved Documents button found at the very bottom of the Recent File List. Open Word and select File, Options. In the Options dialog box select Save from the left hand menu. Note the AutoRecover files location.


2 Answers

Take a look at the python-docx library.

like image 192
Etienne Avatar answered Oct 08 '22 21:10

Etienne


So I think you're saying that the structure of the document is encoded in the formatting, and you want to produce XML files that capture that structure, whilst keeping the content in plain text?

If that is so you will need to parse the documents, and build a data structure that can be processed, then dumped out as XML.

For parsing, there are a few options. Microsoft have published the specifications for their binary .doc format, the reading of which will be essential to write a parser for it. In the case of .docx you're a little more lucky, as it's already in XML format, so you could use any XML parsing library to read in the file, then search through the resulting tree for the data you are interested in. XML parsers are available for pretty much any language, one easy to use one that comes to mind is MiniDom for Python.

For generating your output XML, again an object-representation to XML library seems to be the way to go, MiniDom for example, does that too.

If you don't want to deal with writing your own .doc parser, you could run the documents through a converter that produces are more accessible format first - such as using Word itself to convert the .doc files to .docx, or a tool that produces RDFs from .docs, or you could use an existing word parser such as the one in OpenOffice.

like image 28
David Claridge Avatar answered Oct 08 '22 21:10

David Claridge