Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ghostscript: producing searchable PDF with 'gxps'

I have a searchable XPS file which I convert to PDF like this:

gxps -sOutputFile=C:\temp\foo.pdf -sDEVICE=pdfwrite \
     -dNOPAUSE C:\temp\foo.xps

The resulting PDF is not searchable.

  • Is there a way for gxps to generate searchable PDFs?
  • If not, is there an similar app that can convert searchable XPS to searchable PDF on the command line?

Edit:

gxps version: 9.15
Build date: Mon Sep 22 12:35:05 2014
  • Sample input XPS file: https://www.dropbox.com/s/01rd7apzjb1kwuo/forSO.xps?dl=0

  • Sample output PDF file: https://www.dropbox.com/s/pefslcyznns5gim/forSO.pdf?dl=0

like image 399
dijxtra Avatar asked Feb 27 '26 04:02

dijxtra


1 Answers

I've looked at the PDF and quickly investigated the fonts used by GXPS for the resulting PDF file, using pdffonts:

 $ pdffonts forSO.pdf

   name                    type         encoding         emb sub uni object ID
   ----------------------- ------------ ---------------- --- --- --- ---------
   RFGWZI+Arial            TrueType     WinAnsi          yes yes yes     11  0

Superficially, it looks OK:

  1. The only font used is embedded (see emb column).
  2. The font type is a common one (see type column).
  3. The font encoding is a standard one (see encoding column).
  4. The font seems to have a companion /ToUnicode map (see uni column).

Looking more closely however, the real /ToUnicode map which was embedded into the PDF by gxps seems to be heavily b0rken. Here is it, extracted as a complete indirect object from the PDF, with uncompressed stream:

41 0 obj
<<
  /Length 863
>>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
/CMapName/R41 def
1 begincodespacerange
<00><ff>
endcodespacerange
42 beginbfrange
<04><04><0004>
<05><05><0004>
<06><06><0006>
<07><07><0006>
<08><08><0006>
<09><09><0006>
<0a><0a><000a>
<0b><0b><000a>
<0c><0c><000c>
<0d><0d><000c>
<0e><11><000e>
<12><12><000c>
<13><13><000c>
<14><14><000c>
<15><15><000c>
<16><16><0004>
<17><17><0004>
<18><18><0004>
<19><1a><0019>
<1b><1b><001a>
<1c><1c><001a>
<1d><1d><001a>
<1e><1e><001a>
<1f><1f><001a>
<20><20><0044>
<21><21><001a>
<22><22><001a>
<23><23><001a>
<24><24><0024>
<25><25><000c>
<26><26><001d>
<27><27><0023>
<28><28><0023>
<29><29><0028>
<41><41><0044>
<44><44><0044>
<49><49><0044>
<63><63><0044>
<69><69><0044>
<74><74><0044>
<76><76><0044>
<79><79><0044>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end
endstream
endobj

As one can see, the /ToUnicode table contains 42 keys, but these do map to only 12 different character values:

  • Some of these 12 different character values appear multiple times in this table, hence reverse-mapping multiple glyphs to the same character (which in turn does not seem to be correct even for a single one):

         no. of |   char 
    occurrences |   value
    ------------+-----------
             1  |   <000e>
             1  |   <0019>
             1  |   <001d>
             1  |   <0024>
             1  |   <0028>
             2  |   <000a>
             2  |   <0023>
             4  |   <0006>
             5  |   <0004>
             7  |   <000c>
             8  |   <001a>
             9  |   <0044>
    
  • For example, character value 06 maps to glyphs with the numbers 06, 07, 08 and 09.

This doesn't look right.

IMHO, this would deserve a bug report into Ghostscript's Bugzilla (but I'm not sure if the GXPS component is still actively maintained or not).

Update: I found an existing entry into the Ghostscript/GXPS bugzilla database here:

  • Bug 693945 - Incorrect Unicode Map Generated by gxps/pdfwrite
like image 140
Kurt Pfeifle Avatar answered Mar 02 '26 14:03

Kurt Pfeifle