Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parsing a postscript file

I’m translating PDFMiner (an open source PDF to text program written in python) into objective-c. Some fonts have a postscript file in them with character names and their encoded values. For example:

/Encoding 256 array
0 1 255 {1 index exch /.notdef put} for
dup 65 /A put
dup 67 /C put
dup 70 /F put
dup 45 /hyphen put

I don’t actually know what the above code does. I’m guessing it puts those pairs into a dictionary. I’m not sure what dup does at all. What the above code means to me is that if I see a 45 in the PDF then I’ll look it up and convert it to a hyphen, or if I see a 70 then I convert it to an F etc.

The code I’m copying uses a full blown Postscript tokenizer to parse all put commands in a postscript file. For each put command it creates a dictionary with the key value pair corresponding the the put operands.

My question is, do I really need to build an entire Postscript tokenizer to parse these things?

A far simpler alternative would be to scan for every occurrence of the string “put” then look at the two preceding words. If the two preceding words are a number followed by a /x then I assume this is what I want otherwise ignore it.

I don’t know postscript at all, but I figure anyone who does can tell me if my simpler alternative has any corner cases that will screw things up.

Thanks!

like image 842
user2444342 Avatar asked Jun 05 '26 11:06

user2444342


1 Answers

Lengthy explanation, jump to the end for the short answer.

PostScript is a stack-based programming language (somewhat akin to Forth) 'dup' is an operator which duplicates the top object on the operand stack.

In the case of your example, it creates an array of 256 elements, fills the array with the name /.notdef at every location in the array, then replaces names at particular array indices with other names (dup duplicates the array, put consumes the operands, including the array copy). Not shown above, but later that array will be associated with the name /Encoding, and the key-value pair is stored in a dictionary, which contains a font in this case.

When a character code is drawn, the interpreter looks up the font dictionary and retrieves the object associated with the key /Encoding. It then uses the character code as an index in that array, and retrieves the object found there. The interpreter then retrieves the CharStrings dictionary from the font, and uses the object extracted from the Encoding array as a key in the CharStrings dictionary. The object associated with that key is then used further. In the case of type 1/2/3 fonts that object is the glyph program used to draw the glyph shape. In a type 42 font that object is an integer which is then used as the GID to retrieve the glyph program from the GLYF table of the /sfnts array in the font dictionary.

Now the first thing to note is that PostScript is a programming language. Instead of the simple array setup you have above, I could write a PostScript routine, which takes a font dictionary, name and an index, and inserts it into the Encoding array of that font. So a simple approach of scanning for the pattern wouldn't work, because my program doesn't do it that way.

Further, the 'put' operator is used extensively in PostScript programming, so a simple search for 'put' wouldn't be sufficient.

All of which is a long-winded way of saying that if you want to work with PostScript program, you are going to need a full PostScript interpreter (a tokeniser isn't sufficient, you need a full interpreter).

Now as to your specific case, your idea will work for PostScript type 1 fonts (subject to the caveats above) because these are generally well-formed and follow some simple guidelines, and up to a point for type 3 PDF fonts but it won't work for TrueType fonts, or CIDFonts, and really won't work for any of the above which are subset. (it also won't work for CFF fonts, because they are binary coded, not interpreted as PostScript)

You should first check to see if the font has an associated ToUnicode CMap, if it does you are very much in luck, use it! This will map character codes to Unicode code points, job done.

In the absence of a ToUnicode CMap there is no guaranteed way you can extract text, it may be completely impossible. If you are using a TrueType font, you can reverse engineer the Encoding and extract the character code->GID mapping, then you can look up the TrueType font's CMAP table and see if it has a Unicode CMAP (most do), in which case you can use that.

If all that fails, you can check to see if the font has a standard Encoding in the font object, these are listed in the PDF Reference.

So what this tells you is that the Encoding in the font program is of no great use to you. Its the default ordering of the font if no other Encoding is applied. PDF files always apply their own Encoding (its a required entry in the font object). So the Encoding in the actual PostScript font isn't useful to you, it gets overridden by the entry from the PDF file. So there's no need for you to interpret the font at all, for a type 1, 2 or 3 font.

Only if all of the above fails, should you start looking at the names in the Encoding (the Encoding in the PDF file, not the font!). You need to be aware that just because someone sticks a name in the Encoding that doesn't mean that this is the actual glyph shape that is drawn by the glyph program....

For CIDFonts, if it doesn't have a ToUnicode CMap you're probably stuck, but you can examine the CIDSystemInfo for clues given the Registry and Ordering.

It all depends how thorough you want to be, in effect how well you want the resulting program to perform, bearing in mind that there is no possibility of 100% accuracy without using something like an OCR solution.

Anyway the short answer is 'you don't need to do that Dave'. The Encoding in the font program is of no use, as its overridden by the Encoding in the PDF file.

like image 117
KenS Avatar answered Jun 07 '26 23:06

KenS



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!