How to extract text from an existing docx file using python-docx

Tags:

I'm trying to use python-docx module (pip install python-docx) but it seems to be very confusing as in github repo test sample they are using opendocx function but in readthedocs they are using Document class. Even though they are only showing how to add text to a docx file, not reading existing one?

1st one (opendocx) is not working, may be deprecated. For second case I was trying to use:

from docx import Document  document = Document('test_doc.docx') print(document.paragraphs)

It returned a list of <docx.text.Paragraph object at 0x... >

Then I did:

for p in document.paragraphs:     print(p.text)

It returned all text but there were few thing missing. All URLs (CTRL+CLICK to go to URL) were not present in text on console.

What is the issue? Why URLs are missing?

How could I get complete text without iterating over loop (something like open().read())

204

asked Aug 10 '14 11:08

Nancy

1 Answers

you can try this

import docx  def getText(filename):     doc = docx.Document(filename)     fullText = []     for para in doc.paragraphs:         fullText.append(para.text)     return '\n'.join(fullText)

119

answered Sep 23 '22 22:09

Chinmoy Panda

Related questions
                            
                                How do you stop numpy from multithreading? [duplicate]
                            
                                Pandas - make a column dtype object or Factor
                            
                                unittest.mock: asserting partial match for method argument
                            
                                Sample each group after pandas groupby
                            
                                Choice made by Python 3.5 to choose the keys when comparing them in a dictionary
                            
                                Using Google OAuth2 with Flask
                            
                                Pytest and Python 3
                            
                                Matplotlib legends in subplot
                            
                                Running "unique" tasks with celery
                            
                                Unexpected behaviour with a conditional generator expression [duplicate]
                            
                                python properties and inheritance
                            
                                Lazy evaluation in Python
                            
                                Python daylight savings time
                            
                                How to unnest a nested list [duplicate]
                            
                                How to check deque length in Python
                            
                                How to change filehandle with Python logging on the fly with different classes and imports
                            
                                Hide Code when exporting Jupyter notebook to HTML
                            
                                Pandas: Check if row exists with certain values
                            
                                Create 2 dimensional array with 2 one dimensional array
                            
                                Convert timedelta to floating-point

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to extract text from an existing docx file using python-docx

Tags:

python

python-3.x

python-2.7

python-docx

Nancy

People also ask

1 Answers

Chinmoy Panda

Recent Activity

Donate For Us