I have been using Apache POI to manipulate Microsoft Word .docx files — ie open a document that was originally created in Microsoft Word, modify it, save it to a new document.
I notice that new paragraphs created by Apache POI are missing a Revision Save ID, often known as an RSID or rsidR. This is used by Word to identify changes made to a document in one session, say between saves. It is optional — users could turn it off in Microsoft Word if they want — but in reality almost everyone has it on so almost every document is fulls of RSIDs. Read this excellent explanation of RSIDs for more about that.
In a Microsoft Word document, word/document.xml
contains paragraphs like this:
<w:p w:rsidR="007809A1" w:rsidRDefault="007809A1" w:rsidP="00191825">
<w:r>
<w:t>Paragraph of text here.</w:t>
</w:r>
</w:p>
However the same paragraph created by POI will look like this in word/document.xml
:
<w:p>
<w:r>
<w:t>Paragraph of text here.</w:t>
</w:r>
</w:p>
I've figured out that I can force POI to add an RSID to each paragraph using code like this:
byte[] rsid = ???;
XWPFParagraph paragraph = document.createParagraph();
paragraph.getCTP().setRsidR(rsid);
paragraph.getCTP().setRsidRDefault(rsid);
However I don't know how I should be generating the RSIDs.
Does POI have a way or generate and/or keep track of RSIDs? If not, is there any way I can ensure that an RSID that I generate doesn't conflict with one that's already in the document?
DOCX was originally developed by Microsoft as an XML-based format to replace the proprietary binary format that uses the . doc file extension. Since Word 2007, DOCX has been the default format for the Save operation.
The DOCX is a smaller document file format than the DOC, making it convenient to send via email and store on a hard drive. The DOCX is a compressed file, meaning it's shrunken in size to reduce its impact on storage space. A DOCX is opened either using Microsoft Word or alternative, third-party programs.
Starting with the 2007 Microsoft Office system, Microsoft Office uses the XML-based file formats, such as . docx, . xlsx, and .
It looks like the list of valid rsid entries is held in word/settings.xml in the <w:rsids>
entry. XWPF should be able to give you access to that already.
You'd probably want to generate a 8 hex digit long random number, check if that's in there, and re-generate if it is. Once you have a unique one, add it into that list, then tag your paragraphs with it.
What I'd suggest is that you join the poi dev list (mailing list details), and we can give you a hand on working up a patch for it. I think the things to do are:
We should take this to the dev list though :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With