Problem statement: Extract sections from .docx file including autonumbering.
I tried python-docx to extract text from .docx file but it excludes the autonumbering.
from docx import Document
document = Document("wadali.docx")
def iter_items(paragraphs):
for paragraph in document.paragraphs:
if paragraph.style.name.startswith('Agt'):
yield paragraph
if paragraph.style.name.startswith('TOC'):
yield paragraph
if paragraph.style.name.startswith('Heading'):
yield paragraph
if paragraph.style.name.startswith('Title'):
yield paragraph
if paragraph.style.name.startswith('Heading'):
yield paragraph
if paragraph.style.name.startswith('Table Normal'):
yield paragraph
if paragraph.style.name.startswith('List'):
yield paragraph
for item in iter_items(document.paragraphs):
print item.text
Reading Word Documents docx file in Python, call docx. Document() , and pass the filename demo. docx. This will return a Document object, which has a paragraphs attribute that is a list of Paragraph objects.
Word documents contain formatted text wrapped within three object levels. Lowest level- Run objects, Middle level- Paragraph objects and Highest level- Document object. So, we cannot work with these documents using normal text editors. But, we can manipulate these word documents in python using the python-docx module.
python-docx allows you to create new documents as well as make changes to existing ones. Actually, it only lets you make changes to existing documents; it's just that if you start with a document that doesn't have any content, it might feel at first like you're creating one from scratch.
It appears that currently python-docx v0.8 does not fully support numbering. You need to do some hacking.
First, for the demo, to iterate the document paragraphs, you need to write your own iterator. Here is something functional:
import docx.document
import docx.oxml.table
import docx.oxml.text.paragraph
import docx.table
import docx.text.paragraph
def iter_paragraphs(parent, recursive=True):
"""
Yield each paragraph and table child within *parent*, in document order.
Each returned value is an instance of Paragraph. *parent*
would most commonly be a reference to a main Document object, but
also works for a _Cell object, which itself can contain paragraphs and tables.
"""
if isinstance(parent, docx.document.Document):
parent_elm = parent.element.body
elif isinstance(parent, docx.table._Cell):
parent_elm = parent._tc
else:
raise TypeError(repr(type(parent)))
for child in parent_elm.iterchildren():
if isinstance(child, docx.oxml.text.paragraph.CT_P):
yield docx.text.paragraph.Paragraph(child, parent)
elif isinstance(child, docx.oxml.table.CT_Tbl):
if recursive:
table = docx.table.Table(child, parent)
for row in table.rows:
for cell in row.cells:
for child_paragraph in iter_paragraphs(cell):
yield child_paragraph
You can use it to find all document paragraphs including paragraphs in table cells.
For instance:
import docx
document = docx.Document("sample.docx")
for paragraph in iter_paragraphs(document):
print(paragraph.text)
To access the numbering property, you need to search in the "protected" members paragraph._p.pPr.numPr
, which is a docx.oxml.numbering.CT_NumPr
object:
for paragraph in iter_paragraphs(document):
num_pr = paragraph._p.pPr.numPr
if num_pr is not None:
print(num_pr) # type: docx.oxml.numbering.CT_NumPr
Note that this object is extracted from the numbering.xml
file (inside the docx), if it exists.
To access it, you need to read your docx file like a package. For instance:
import docx.package
import docx.parts.document
import docx.parts.numbering
package = docx.package.Package.open("sample.docx")
main_document_part = package.main_document_part
assert isinstance(main_document_part, docx.parts.document.DocumentPart)
numbering_part = main_document_part.numbering_part
assert isinstance(numbering_part, docx.parts.numbering.NumberingPart)
ct_numbering = numbering_part._element
print(ct_numbering) # CT_Numbering
for num in ct_numbering.num_lst:
print(num) # CT_Num
print(num.abstractNumId) # CT_DecimalNumber
Mor information is available in the Office Open XMl documentation.
There is a package, docx2python which does this in a lot simpler fashion: pypi.org/project/docx2python/
The following code:
from docx2python import docx2python
document = docx2python("C:/input/MyDoc.docx")
print(document.body)
produces a list which contains the contents including bullet lists in a nice parse-able fashion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With