Using OpenXML, can I read the document content by page number?
wordDocument.MainDocumentPart.Document.Body
gives content of full document.
public void OpenWordprocessingDocumentReadonly()
{
string filepath = @"C:\...\test.docx";
// Open a WordprocessingDocument based on a filepath.
using (WordprocessingDocument wordDocument =
WordprocessingDocument.Open(filepath, false))
{
// Assign a reference to the existing document body.
Body body = wordDocument.MainDocumentPart.Document.Body;
int pageCount = 0;
if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
{
pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
}
for (int i = 1; i <= pageCount; i++)
{
//Read the content by page number
}
}
}
MSDN Reference
Update 1:
it looks like page breaks are set as below
<w:p w:rsidR="003328B0" w:rsidRDefault="003328B0">
<w:r>
<w:br w:type="page" />
</w:r>
</w:p>
So now I need to split the XML with above check and take InnerTex
for each, that will give me page vise text.
Now question becomes how can I split the XML with above check?
Update 2:
Page breaks are set only when you have page breaks, but if text is floating from one page to other pages, then there is no page break XML element is set, so it revert back to same challenge how o identify the page separations.
Go to your Solution Explorer > right click on references and then click Manage NuGet Packages . Then search in tab "Online" for DocumentFormat. OpenXml and install it. After you can use DocumentFormat.
0 on c:\program files\open xml sdk\v2\lib\DocumentFormat. OpenXml. dll.
The Open XML SDK provides tools for working with Office Word, Excel, and PowerPoint documents. It supports scenarios such as: - High-performance generation of word-processing documents, spreadsheets, and presentations. - Populating content in Word files from an XML data source.
Today MS Open Tech has announced the release of the Open XML SDK version 2.5 as open source software (Apache 2.0 license) under the stewardship of the . NET Foundation.
You cannot reference OOXML content via page numbering at the OOXML data level alone.
What about w:lastRenderedPageBreak
, which is a record of the position of a soft page break at the time the document was last rendered? No, w:lastRenderedPageBreak
does not help in general either because:
w:lastRenderedPageBreak
position is stale when content has
been changed since last opened by a program that paginates its
content.w:lastRenderedPageBreak
is known to be unreliable in various circumstances including
If you're willing to accept a dependence on Word Automation, with all of its inherent licensing and server operation limitations, then you have a chance of determining page boundaries, page numberings, page counts, etc.
Otherwise, the only real answer is to move beyond page-based referencing frameworks that are dependent upon proprietary, implementation-specific pagination algorithms.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With