Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract text from an existing docx file using python-docx

I'm trying to use python-docx module (pip install python-docx) but it seems to be very confusing as in github repo test sample they are using opendocx function but in readthedocs they are using Document class. Even though they are only showing how to add text to a docx file, not reading existing one?

1st one (opendocx) is not working, may be deprecated. For second case I was trying to use:

from docx import Document  document = Document('test_doc.docx') print(document.paragraphs) 

It returned a list of <docx.text.Paragraph object at 0x... >

Then I did:

for p in document.paragraphs:     print(p.text) 

It returned all text but there were few thing missing. All URLs (CTRL+CLICK to go to URL) were not present in text on console.

What is the issue? Why URLs are missing?

How could I get complete text without iterating over loop (something like open().read())

like image 204
Nancy Avatar asked Aug 10 '14 11:08

Nancy


People also ask

How do I read the contents of a docx file in python?

Reading Word Documents docx file in Python, call docx. Document() , and pass the filename demo. docx. This will return a Document object, which has a paragraphs attribute that is a list of Paragraph objects.

How do I extract text from a file in python?

To extract text from MS word files in Python, we can use the zipfile library. to create ZipFile object with the path string to the Word file. Then we call read with 'word/document. xml' to read the Word file.

How does python-docx work?

python-docx allows you to create new documents as well as make changes to existing ones. Actually, it only lets you make changes to existing documents; it's just that if you start with a document that doesn't have any content, it might feel at first like you're creating one from scratch.


1 Answers

you can try this

import docx  def getText(filename):     doc = docx.Document(filename)     fullText = []     for para in doc.paragraphs:         fullText.append(para.text)     return '\n'.join(fullText) 
like image 119
Chinmoy Panda Avatar answered Sep 23 '22 22:09

Chinmoy Panda