Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate RSID attributes correctly in Word .docx files using Apache POI?

I have been using Apache POI to manipulate Microsoft Word .docx files — ie open a document that was originally created in Microsoft Word, modify it, save it to a new document.

I notice that new paragraphs created by Apache POI are missing a Revision Save ID, often known as an RSID or rsidR. This is used by Word to identify changes made to a document in one session, say between saves. It is optional — users could turn it off in Microsoft Word if they want — but in reality almost everyone has it on so almost every document is fulls of RSIDs. Read this excellent explanation of RSIDs for more about that.

In a Microsoft Word document, word/document.xml contains paragraphs like this:

<w:p w:rsidR="007809A1" w:rsidRDefault="007809A1" w:rsidP="00191825">
  <w:r>
    <w:t>Paragraph of text here.</w:t>
  </w:r>
</w:p>

However the same paragraph created by POI will look like this in word/document.xml:

<w:p>
  <w:r>
    <w:t>Paragraph of text here.</w:t>
  </w:r>
</w:p>

I've figured out that I can force POI to add an RSID to each paragraph using code like this:

    byte[] rsid = ???;
    XWPFParagraph paragraph = document.createParagraph();
    paragraph.getCTP().setRsidR(rsid);
    paragraph.getCTP().setRsidRDefault(rsid);

However I don't know how I should be generating the RSIDs.

Does POI have a way or generate and/or keep track of RSIDs? If not, is there any way I can ensure that an RSID that I generate doesn't conflict with one that's already in the document?

like image 705
gutch Avatar asked Feb 11 '11 06:02

gutch


People also ask

Is DOCX a XML format?

DOCX was originally developed by Microsoft as an XML-based format to replace the proprietary binary format that uses the . doc file extension. Since Word 2007, DOCX has been the default format for the Save operation.

How does the DOCX format work?

The DOCX is a smaller document file format than the DOC, making it convenient to send via email and store on a hard drive. The DOCX is a compressed file, meaning it's shrunken in size to reduce its impact on storage space. A DOCX is opened either using Microsoft Word or alternative, third-party programs.

Are Word documents XML?

Starting with the 2007 Microsoft Office system, Microsoft Office uses the XML-based file formats, such as . docx, . xlsx, and .


1 Answers

It looks like the list of valid rsid entries is held in word/settings.xml in the <w:rsids> entry. XWPF should be able to give you access to that already.

You'd probably want to generate a 8 hex digit long random number, check if that's in there, and re-generate if it is. Once you have a unique one, add it into that list, then tag your paragraphs with it.

What I'd suggest is that you join the poi dev list (mailing list details), and we can give you a hand on working up a patch for it. I think the things to do are:

  • Wrapper around the RSids entry in word/settings.xml, to let you easily fetch the list and generate a new (unique one)
  • A wrapper around the different RSid entries on a paragraph and a run
  • Methods on paragraphs and runs to get the RSid wrapper, add a new one, or clear the existing one

We should take this to the dev list though :)

like image 76
Gagravarr Avatar answered Oct 26 '22 19:10

Gagravarr