I receive word documents with specified formating corresponding to the data that is in them. For example, all headers have the exact same formating (Times New Roman-Font 14-Bold). What is the best way to process such MS Word documents (.doc or .docx) into xml documents? Language is not an issue (I'll use Lisp/Boost.Spirit if I have to!).

Take a look at the python-docx library.

Best Way to Process a Word Document [closed]

2 Answers

Take a look at the python-docx library.

192

answered Oct 08 '22 21:10

Etienne

So I think you're saying that the structure of the document is encoded in the formatting, and you want to produce XML files that capture that structure, whilst keeping the content in plain text?

If that is so you will need to parse the documents, and build a data structure that can be processed, then dumped out as XML.

For parsing, there are a few options. Microsoft have published the specifications for their binary .doc format, the reading of which will be essential to write a parser for it. In the case of .docx you're a little more lucky, as it's already in XML format, so you could use any XML parsing library to read in the file, then search through the resulting tree for the data you are interested in. XML parsers are available for pretty much any language, one easy to use one that comes to mind is MiniDom for Python.

For generating your output XML, again an object-representation to XML library seems to be the way to go, MiniDom for example, does that too.

If you don't want to deal with writing your own .doc parser, you could run the documents through a converter that produces are more accessible format first - such as using Word itself to convert the .doc files to .docx, or a tool that produces RDFs from .docs, or you could use an existing word parser such as the one in OpenOffice.

answered Oct 08 '22 21:10

David Claridge

Related questions
                            
                                How does the predict_proba() function in LightGBM work internally?
                            
                                Why can't Python's walrus operator be used to set instance attributes?
                            
                                reload flag with uvicorn: can we exclude certain code?
                            
                                Why is plus-equals valid for list and dictionary?
                            
                                AWS Lambda Container Running Selenium With Headless Chrome Works Locally But Not In AWS Lambda
                            
                                TypeError: '<' not supported between instances of 'function' and 'str'
                            
                                Pip is not working for Python 3.10 on Ubuntu
                            
                                Opening a handle to a device in Python on Windows
                            
                                How to write a functional test for a DBUS service written in Python?
                            
                                Daemonizing python's BaseHTTPServer
                            
                                C# way to mimic Python Dictionary Syntax
                            
                                Is the Python GIL really per interpreter?
                            
                                dict keys with spaces in Django templates
                            
                                How to parse/extract data from a mediawiki marked-up article via python
                            
                                Is there a Django ModelField that allows for multiple choices, aside from ManyToMany?
                            
                                django calendar free/busy/availabilitty
                            
                                Custom keys for Google App Engine models (Python)
                            
                                What is the difference between .get() and .fetch(1)
                            
                                Call Python From PHP And Get Return Code
                            
                                How to pickle a scapy packet?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best Way to Process a Word Document [closed]

Tags:

python

parsing

ms-word

xml-serialization

Mikhail

People also ask

2 Answers

Etienne

David Claridge

Recent Activity

Donate For Us