Using OpenXML, can I read the document content by page number? <code>wordDocument.MainDocumentPart.Document.Body</code> gives content of full document. <pre class="prettyprint"><code> public void OpenWordprocessingDocumentReadonly() { string filepath = @"C:\...\test.docx"; // Open a WordprocessingDocument based on a filepath. using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(filepath, false)) { // Assign a reference to the existing document body. Body body = wordDocument.MainDocumentPart.Document.Body; int pageCount = 0; if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null) { pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text); } for (int i = 1; i <= pageCount; i++) { //Read the content by page number } } } </code></pre> MSDN Reference <hr> Update 1: it looks like page breaks are set as below <pre class="prettyprint"><code><w:p w:rsidR="003328B0" w:rsidRDefault="003328B0"> <w:r> <w:br w:type="page" /> </w:r> </w:p> </code></pre> So now I need to split the XML with above check and take <code>InnerTex</code> for each, that will give me page vise text. Now question becomes how can I split the XML with above check? <hr> Update 2: Page breaks are set only when you have page breaks, but if text is floating from one page to other pages, then there is no page break XML element is set, so it revert back to same challenge how o identify the page separations.

You cannot reference OOXML content via page numbering at the OOXML data level alone. <ul> <li> Hard page breaks are not the problem; hard page breaks can be counted.</li> <li> Soft page breaks are the problem. These are calculated according to line break and pagination algorithms which are implementation dependent; it is not intrinsic to the OOXML data. There is nothing to count.</li> </ul> What about <code>w:lastRenderedPageBreak</code>, which is a record of the position of a soft page break at the time the document was last rendered? No, <code>w:lastRenderedPageBreak</code> does not help in general either because: <ul> <li>By definition, <code>w:lastRenderedPageBreak</code> position is stale when content has been changed since last opened by a program that paginates its content.</li> <li>In MS Word's implementation, <code>w:lastRenderedPageBreak</code> is known to be unreliable in various circumstances including <ol> <li>when table spans two pages</li> <li>when next page starts with an empty paragraph</li> <li><a href="https://social.msdn.microsoft.com/Forums/office/en-US/3d0a3027-3062-4ecf-9e74-50cea3ae9a60/lastrenderedpagebreak-and-columns-problem?forum=oxmlsdk" rel="noreferrer">for multi-column layouts with text boxes starting a new column</a></li> <li><a href="https://social.msdn.microsoft.com/Forums/office/en-US/3d0a3027-3062-4ecf-9e74-50cea3ae9a60/lastrenderedpagebreak-and-columns-problem?forum=oxmlsdk" rel="noreferrer">for large images or long sequences of blank lines</a></li> </ol> </li> </ul> If you're willing to accept a dependence on Word Automation, with all of its inherent licensing and server operation limitations, then you have a chance of determining page boundaries, page numberings, page counts, etc. Otherwise, the only real answer is to move beyond page-based referencing frameworks that are dependent upon proprietary, implementation-specific pagination algorithms.

How to access OpenXML content by page number?

Tags:

c#

xml

docx

openxml

openxml-sdk

Using OpenXML, can I read the document content by page number?

wordDocument.MainDocumentPart.Document.Body gives content of full document.

  public void OpenWordprocessingDocumentReadonly()
        {
            string filepath = @"C:\...\test.docx";
            // Open a WordprocessingDocument based on a filepath.
            using (WordprocessingDocument wordDocument =
                WordprocessingDocument.Open(filepath, false))
            {
                // Assign a reference to the existing document body.  
                Body body = wordDocument.MainDocumentPart.Document.Body;
                int pageCount = 0;
                if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
                {
                    pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
                }
                for (int i = 1; i <= pageCount; i++)
                {
                    //Read the content by page number
                }
            }
        }

MSDN Reference

Update 1:

it looks like page breaks are set as below

<w:p w:rsidR="003328B0" w:rsidRDefault="003328B0">
        <w:r>
            <w:br w:type="page" />
        </w:r>
    </w:p>

So now I need to split the XML with above check and take InnerTex for each, that will give me page vise text.

Now question becomes how can I split the XML with above check?

Update 2:

Page breaks are set only when you have page breaks, but if text is floating from one page to other pages, then there is no page break XML element is set, so it revert back to same challenge how o identify the page separations.

536

asked Oct 12 '16 07:10

HaBo

1 Answers

You cannot reference OOXML content via page numbering at the OOXML data level alone.

Hard page breaks are not the problem; hard page breaks can be counted.
Soft page breaks are the problem. These are calculated according to line break and pagination algorithms which are implementation dependent; it is not intrinsic to the OOXML data. There is nothing to count.

What about w:lastRenderedPageBreak, which is a record of the position of a soft page break at the time the document was last rendered? No, w:lastRenderedPageBreak does not help in general either because:

By definition, w:lastRenderedPageBreak position is stale when content has been changed since last opened by a program that paginates its content.
In MS Word's implementation, w:lastRenderedPageBreak is known to be unreliable in various circumstances including
1. when table spans two pages
2. when next page starts with an empty paragraph
3. for multi-column layouts with text boxes starting a new column
4. for large images or long sequences of blank lines

If you're willing to accept a dependence on Word Automation, with all of its inherent licensing and server operation limitations, then you have a chance of determining page boundaries, page numberings, page counts, etc.

Otherwise, the only real answer is to move beyond page-based referencing frameworks that are dependent upon proprietary, implementation-specific pagination algorithms.

answered Sep 23 '22 06:09

kjhughes

Related questions
                            
                                Azure: MissingRegistrationForLocation: The subscription is not registered for the resource type 'XXXX' in the location 'YYYY'
                            
                                Show Properties of a Navigation Property in DataGridView (Second Level Properties)
                            
                                Deserialize JSON with json.NET into C# dynamic
                            
                                Is ContinueWith guaranteed to execute?
                            
                                Use Automapper in ITypeConverter
                            
                                ExecuteNonQueryAsync and commit in a SQL Transaction
                            
                                Postgresql and Entity Framework
                            
                                C# SpinWait for long-term waiting
                            
                                Does NUnit dispose of objects that implement IDisposable?
                            
                                Generate HMAC-SHA256 hash with BouncyCastle
                            
                                What to use instead of DbSet Create() Method in EF7, and is it recommended to simply new T()
                            
                                ASP.NET MVC 5 model binding list is empty
                            
                                Why does C# generate garbage when using a struct as a generic dictionary key?
                            
                                Why does my project reference point to a dll in the "obj" directory?
                            
                                Force IEnumerable<T> to evaluate without calling .ToArray() or .ToList()
                            
                                the string 'false' is not a valid boolean value
                            
                                JSON parse error: Missing a name for object member
                            
                                ASP.NET MVC Recompilation limit of 15 reached HostingEnvironment initiated shutdown HostingEnvironment caused shutdown
                            
                                Automatic dictionary key?
                            
                                ASP.NET MVC Data annotation validator for email or phone

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With