extracting text from MS word files in python

2 Answers

Use the native Python docx module. Here's how to extract all the text from a doc:

document = docx.Document(filename) docText = '\n\n'.join(     paragraph.text for paragraph in document.paragraphs ) print(docText)

See Python DocX site

Also check out Textract which pulls out tables etc.

Parsing XML with regexs invokes cthulu. Don't do it!

answered Sep 20 '22 23:09

mikemaccana

You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.

answered Sep 22 '22 23:09

John Fouhy

Related questions
                            
                                Tensorflow GPU Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found
                            
                                Remove lines that contain certain string
                            
                                Generate a Unique String in Python/Django
                            
                                Python class input argument
                            
                                Nested List and count()
                            
                                filling contours with opencv python
                            
                                Where is pip installed to when using get-pip.py?
                            
                                Ubuntu, how to install OpenCV for python3?
                            
                                Extract content of <script> with BeautifulSoup
                            
                                PyCharm not recognizing Django project imports: from my_app.models import thing
                            
                                Turning a list into nested lists in python
                            
                                pickle - putting more than 1 object in a file? [duplicate]
                            
                                Import Error: No module named django
                            
                                Can I extend list in Python with prepend elements instead of append?
                            
                                Django pre_save signal: check if instance is created not updated, does kwargs['created'] (still) exist?
                            
                                Iterate over a python sequence in multiples of n?
                            
                                List comprehension for running total
                            
                                GeoDjango on Windows: "Could not find the GDAL library" / "OSError: [WinError 126] The specified module could not be found"
                            
                                How do I merge two python iterators?
                            
                                How to put a tkinter window on top of the others?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

extracting text from MS word files in python

Tags:

python

linux

ms-word

Badri

People also ask

2 Answers

mikemaccana

John Fouhy

Recent Activity

Donate For Us