The XHTMLImporter from docx4j is not converting   into MS WORD non-breaking spaces.
Following code is used:
public void convert() throws Exception {
String stringFromFile = FileUtils.readFileToString(new File("tmp.xhtml"), "UTF-8");
String unescaped = stringFromFile;
System.out.println("Unescaped: " + unescaped);
// Setup font mapping
RFonts rfonts = Context.getWmlObjectFactory().createRFonts();
rfonts.setAscii("Century Gothic");
XHTMLImporterImpl.addFontMapping("Century Gothic", rfonts);
// Create an empty docx package
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
NumberingDefinitionsPart ndp = new NumberingDefinitionsPart();
wordMLPackage.getMainDocumentPart().addTargetPart(ndp);
ndp.unmarshalDefaultNumbering();
// Convert the XHTML, and add it into the empty docx we made
XHTMLImporter XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
XHTMLImporter.setHyperlinkStyle("Hyperlink");
wordMLPackage.getMainDocumentPart().getContent().addAll(
XHTMLImporter.convert(unescaped, null) );
System.out.println(
XmlUtils.marshaltoString(wordMLPackage.getMainDocumentPart().getJaxbElement(), true, true));
wordMLPackage.save(new java.io.File("OUT_from_XHTML.docx") );
}
When the XHTML input is like:
<p style="LINE-HEIGHT: 120%; MARGIN: 0in 0in 0pt"
class="MsoNormal"><span
style="LINE-HEIGHT: 120%; FONT-FAMILY: 'Courier New'; FONT-SIZE: 10pt; mso-fareast-font-family: 'Times New Roman'">Up
to Age 30<span
style="mso-spacerun: yes"> </span>
2.30<span
style="mso-spacerun: yes"> </span>
3.30</span></p>
then the docx output is like:
<w:r>
<w:rPr>
<w:rFonts w:ascii="Courier New"/>
<w:b w:val="false"/>
<w:i w:val="false"/>
<w:color w:val="000000"/>
<w:sz w:val="20"/>
</w:rPr>
<w:t>
2.30</w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:ascii="Courier New"/>
<w:b w:val="false"/>
<w:i w:val="false"/>
<w:color w:val="000000"/>
<w:sz w:val="20"/>
</w:rPr>
<w:t>
3.30</w:t>
</w:r>
When opening the document in Word 2013 then there are no spaces at all.
I haven't dig too deep in docx4j sources and just call
String escaped = unescaped.replace(" ", "\u00A0");
Unfortunately in the word document it became as usual space, but it wasn't critical in my case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With