I have a searchable XPS file which I convert to PDF like this:
gxps -sOutputFile=C:\temp\foo.pdf -sDEVICE=pdfwrite \
-dNOPAUSE C:\temp\foo.xps
The resulting PDF is not searchable.
gxps to generate searchable PDFs? Edit:
gxps version: 9.15
Build date: Mon Sep 22 12:35:05 2014
Sample input XPS file: https://www.dropbox.com/s/01rd7apzjb1kwuo/forSO.xps?dl=0
Sample output PDF file: https://www.dropbox.com/s/pefslcyznns5gim/forSO.pdf?dl=0
I've looked at the PDF and quickly investigated the fonts used by GXPS for the resulting PDF file, using pdffonts:
$ pdffonts forSO.pdf
name type encoding emb sub uni object ID
----------------------- ------------ ---------------- --- --- --- ---------
RFGWZI+Arial TrueType WinAnsi yes yes yes 11 0
Superficially, it looks OK:
emb column).type column).encoding column)./ToUnicode map (see uni column).Looking more closely however, the real /ToUnicode map which was embedded into the PDF by gxps seems to be heavily b0rken. Here is it, extracted as a complete indirect object from the PDF, with uncompressed stream:
41 0 obj
<<
/Length 863
>>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
/CMapName/R41 def
1 begincodespacerange
<00><ff>
endcodespacerange
42 beginbfrange
<04><04><0004>
<05><05><0004>
<06><06><0006>
<07><07><0006>
<08><08><0006>
<09><09><0006>
<0a><0a><000a>
<0b><0b><000a>
<0c><0c><000c>
<0d><0d><000c>
<0e><11><000e>
<12><12><000c>
<13><13><000c>
<14><14><000c>
<15><15><000c>
<16><16><0004>
<17><17><0004>
<18><18><0004>
<19><1a><0019>
<1b><1b><001a>
<1c><1c><001a>
<1d><1d><001a>
<1e><1e><001a>
<1f><1f><001a>
<20><20><0044>
<21><21><001a>
<22><22><001a>
<23><23><001a>
<24><24><0024>
<25><25><000c>
<26><26><001d>
<27><27><0023>
<28><28><0023>
<29><29><0028>
<41><41><0044>
<44><44><0044>
<49><49><0044>
<63><63><0044>
<69><69><0044>
<74><74><0044>
<76><76><0044>
<79><79><0044>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end
endstream
endobj
As one can see, the /ToUnicode table contains 42 keys, but these do map to only 12 different character values:
Some of these 12 different character values appear multiple times in this table, hence reverse-mapping multiple glyphs to the same character (which in turn does not seem to be correct even for a single one):
no. of | char
occurrences | value
------------+-----------
1 | <000e>
1 | <0019>
1 | <001d>
1 | <0024>
1 | <0028>
2 | <000a>
2 | <0023>
4 | <0006>
5 | <0004>
7 | <000c>
8 | <001a>
9 | <0044>
For example, character value 06 maps to glyphs with the numbers 06, 07, 08 and 09.
This doesn't look right.
IMHO, this would deserve a bug report into Ghostscript's Bugzilla (but I'm not sure if the GXPS component is still actively maintained or not).
Update: I found an existing entry into the Ghostscript/GXPS bugzilla database here:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With