I'm trying to use python-docx
module (pip install python-docx
) but it seems to be very confusing as in github repo test sample they are using opendocx
function but in readthedocs they are using Document
class. Even though they are only showing how to add text to a docx file, not reading existing one?
1st one (opendocx
) is not working, may be deprecated. For second case I was trying to use:
from docx import Document document = Document('test_doc.docx') print(document.paragraphs)
It returned a list of <docx.text.Paragraph object at 0x... >
Then I did:
for p in document.paragraphs: print(p.text)
It returned all text but there were few thing missing. All URLs (CTRL+CLICK to go to URL) were not present in text on console.
What is the issue? Why URLs are missing?
How could I get complete text without iterating over loop (something like open().read()
)
Reading Word Documents docx file in Python, call docx. Document() , and pass the filename demo. docx. This will return a Document object, which has a paragraphs attribute that is a list of Paragraph objects.
To extract text from MS word files in Python, we can use the zipfile library. to create ZipFile object with the path string to the Word file. Then we call read with 'word/document. xml' to read the Word file.
python-docx allows you to create new documents as well as make changes to existing ones. Actually, it only lets you make changes to existing documents; it's just that if you start with a document that doesn't have any content, it might feel at first like you're creating one from scratch.
you can try this
import docx def getText(filename): doc = docx.Document(filename) fullText = [] for para in doc.paragraphs: fullText.append(para.text) return '\n'.join(fullText)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With