Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Office OpenXML splits text between tags and how to prevent it?

I'm currently trying to work with docx files using PHPWord library and its templating system. I have found and updated someones (cant remember the name, but its not important) path to this library that can work with tables (replicate its rows and then use standard setValue() from PHPWord on each of row).

If i create my own document, the data in xml is in normal structure, so the variable to be replaced ${variable} is in its own tag like this:

<w:tbl>
    <w:tr>
        ...
         ${variable}
    </w:tr>
</w:tbl>

I simplified the code, in actual code there is number of other tags descibing sizes, styles, etc.

My problem is i have to proccess documents from other people where i am prohibited to make big changes, I get a document where at some point they is a table with one blank row. I add the ${variable} variables and run it through PHPWord. Problem is, that it fails. After doing some research , I found out that the source XML looks like this:

    ....
        ...
         ${va

        ...
         riab

        ...
         le}
    ....

(again heavily simplified, but you get the picture)

This structure is a problem for me, because the function to clone rows uses strpos(), substr() and regular expressions to work and does not work with this structure (and I cant imagine elegant way to do it so).

So the question is - Does anybody know why docx does this and how to prevent him? I am looking for a solution via word, not PHP (I need current functions to work without much editing)

like image 589
j0hny Avatar asked Mar 23 '23 03:03

j0hny


2 Answers

I have worked with this problem a lot:

In word, the document can be saved like this

  <w:t>{</w:t>...
  <w:t>variable</w:t>
  <w:t>}</w:t>

I have therefore create a JS library that works even if variable names are splitted: Docxtemplater (works server side too) . What I have found out during development is that variables names aren't splitted if:

  • The text to find is only composed of a-zA-Z characters (no {, $ or })
  • The text might be splitted if the text wasn't written in one stroke: For example, if you make a spelling mistake, and write ${varuable} , then make an edit -> ${variable}, the text inside the xml is highly probably going to be splitted. Basically you have to write your variable names in one stroke, and if you wish to edit one, rewrite the variable name completely.

I don't think there's a way to fix a docx document with one command in Word, , but rewriting the variables to write them in one Stroke should work.

like image 191
edi9999 Avatar answered May 07 '23 10:05

edi9999


The primary cause of this is proofErr element. Whereby Word identifies something that it deems spelt incorrectly and wraps it in the <w:proofErr> element, inevitably splitting the original text.

If this happens to you I recommend the following, it's tedious, but the only sure-fire way:

  1. Rename .docx to .zip.
  2. Extract contents of the archive.
  3. Open word\document.xml.
  4. Make the corrections (i.e. put the split text together) and save.
  5. Rename .zip to .docx.

EDIT

This Visual Studio Extension lets you edit the contents of the OpenXML package directly. This allows you to skip steps 1 & 2.

like image 22
pim Avatar answered May 07 '23 11:05

pim