Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Boost spirit can handle Postscript/PDF like languages?

I noticed that Boost spirit offers some limits, in a question here on SO there is an user asking for help about boost spirit and the other user who gave the answer specified that boost spirit works well with statements and not with "generic text" ( I'm sorry if I don't recall it correctly ).

Now I would like to think about Postscript and PDF in terms of tokens and simplify my approach to this formats this way, the problem is that the PDF is kind of a mix between a markup language and a programming language with jumps and tables in it, and I can't think about something similar when considering the most popular file formats like XML, C++ code and others languages and formats.

There is also another fact: I can't really find people that had some kind of experience with boost::spirit wiriting a pdf parser or writer, so I'm asking, boost::spirit it's capable of parsing a PDF file and output the elements as tokens ?

like image 919
user2244984 Avatar asked Apr 08 '13 17:04

user2244984


1 Answers

Although this has nothing to do with Boost, let me assure you that the parsing of PDF (and PostScript) are about as trivial as you could want. Let's say that you have a scanner object that returns a series of tokens. The token types you will get from the scanner are:

  • String
  • Dict begin (<<)
  • Dict End (>>)
  • Name (/whatever)
  • Number
  • Hex array
  • Left Angle (<)
  • Right Angle (>)
  • Array begin ([)
  • Array end (])
  • Procedure begin ({)
  • Procedure end (})
  • Comment (%foo)
  • Word

My scanner is a finite-state automata with states for Start, Comment, String, HexArray, Token, DictEnd, and Done.

The way you parse PDF is not by parsing it, but by executing it. Given these tokens, my "parser" looks like this (in C#):

while (true) {
    MLPdfToken = scanner.GetToken();
    if (token == null)
        return MachineExit.EndOfFile;
    PdfObject obj = PdfObject.FromToken(token);
    PdfProcedure proc = obj as PdfProcedure;

    if (proc != null)
    {
        if (IsExecuting())
        {
            if (token.Type == PdfTokenType.RBrace)
                proc.Execute(this);
            else
                Push(obj);
        }
        else {
            proc.Execute(this);
        }
        if (proc.IsTerminal)
            return Machine.ParseComplete;
    }
    else {
        Push(obj);
    }
}

I'll also add that if you give every PdfObject an Execute() method such that the base class implementation is machine.Push(this) and IsTerminal that returns false, the REPL gets easier:

while (true) {
    MLPdfToken = scanner.GetToken();
    if (token == null)
        return MachineExit.EndOfFile;
    PdfObject obj = PdfObject.FromToken(token);

    if (IsExecuting())
    {
        if (token.Type == PdfTokenType.RBrace)
           obj.Execute(this);
        else
           Push(obj);
    }
    else {
        obj.Execute(this);
        if (obj.IsTerminal)
            return Machine.ParseComplete;                
    }
}

There's more support in Machine - Machine has a Stack of PdfObject and a few methods for accessing it (Push, Pop, Mark, CountToMark, Index, Dup, Swap), as well as ExecProcBegin and ExecProcEnd.

Beyond that, it's very light. The only thing that is slightly odd is that PdfObject.FromToken takes a token and if it is a primitive type (number, string, name, hex, bool) returns a corresponding PdfObject. Otherwise, it takes the given token and looks in a "proc set" dictionary of procedure names associated with PdfProcedure objects. So when you encounter the token << that gets looked up in a the proc set and comes up with this code:

void DictBegin(PdfMachine machine)
{
    machine.Push(new PdfMark(PdfMarkType.Dictionary));
}

So << really means "mark the stack as the start of a dictionary. >> gets more interesting:

void DictEnd(PdfMachine machine)
{
    PdfDict dict = new PdfDict();
    // PopThroughMark pops the entire stack up to the first matching mark,
    // throws an exception if it fails.
    PdfObject[] arr = machine.PopThroughMark(PdfMarkType.Dictionary);
    if ((arr.Length & 1) != 0)
        throw new PdfException("dictionaries need an even number of objects.");
    for (int i=0; i < arr.Length; i += 2)
    {
        PdfObject key = arr[i], val = arr[i + 1];
        if (key.Type != PdfObjectType.Name)
            throw new PdfException("dictionaries need a /name for the key.");
        dict.put((PdfName)key, val);
    }
    machine.Push(dict);
}

So >> Pops up to the nearest dictionary mark into an array then puts each pair into the dictionary. Now, I could have done this without allocating the array. I could just pop pairs, putting them into the dictionary until I either hit the mark, fail to get a name or underflow the stack.

The important takeaway is that there really isn't any syntax in PDF, nor is there any in PostScript. At least not so much as you'd notice. The only real Syntax (and the read-eval-(push) loop shows it) is '}'.

So when you this is a PDF 14 0 obj << /Type /Annot /SubType /Square >> endobj what your really seeing is a series of procedures:

  1. Push 14
  2. Push 0
  3. Execute obj (Pop two numbers and push a "definition" object).
  4. Execute dictionary begin
  5. Push /Type
  6. Push /Annot
  7. Push /SubType
  8. Push /Square
  9. Execute dictionary end
  10. Execute endobj (pop the top object and then get (not pop) the next one. If the second is a definition, set its "value" to the first object, else throw).

Since "endobj" is terminal, parsing ends and the top of the stack is the result.

So when you are asked to look up object 14 in the PDF, the cross-reference table tells you where to seek to, you make a new Machine with the stream pointer at that location and run it. If the top of the stack is a "definition" object, you've succeeded.

About now you should be nodding but not trusting me, since you're thinking about PDF streams, which look like this:

<< [/key value]* >> stream ...raw data... endstream endobj

Again, there is no syntax. The proc stream looks at the top of the stack, which should be a PdfDict. If it is, it consumes characters until the next newline (scanner does this), stores the current file position in the stream as data start, reads the stream length from the dict (which may cause another Machine to get newed up), and skips past the end of stream and pushes the new stream object on the stack. endstream is a no-op. The only difference between a PdfDict and a PdfStream is that a PdfStream has a start position and a bool saying that it's a stream, otherwise I dual-purpose the object.

PostScript is almost identical except that the execution environment is a little more complex. For example, you need several stacks in your machine: a parameter stack, a dictionary stack, and an execution stack. From there, you more or less just bind your tokenizer into the set of primitive procedures as well as the word exec, and then most of your interpreter is written in PS itself.

If you're talking about boost, you're looking at C++, which means that you can't be as fast and loose with memory as I am, so you'll want to either use smart pointers or figure out where you scope is and be careful to dispose objects instead of blithely throwing them away, but that's just the normal C++ stuff.

Currently, I make PDF tools for my company in .NET, but in a former life I worked on Acrobat versions 1-4, and most of what I described is exactly what Acrobat did under the hood (well, more or less - it was C, not C++, but it's the same approach).

With respect to the xref table (or xref stream), you read that first - the spec tells you that if you jump to EOF and scan back, you find the start of the xref table. You parse that (which is a CS 101 assignment), parse the trailer, seek to the /Prev if any and repeat until no more /Prev entries. That gives you a complete xref for looking up objects.

As for writing - there are a number of approaches that you can take. The most obvious one is that when an object is meant to be referenced, you create a new reference object by assigning the newest available xref entry to it. Whenever objects refer to other objects for writing, they ask if these objects are referenced. If they are, they write the reference (ie, 14 0 R). When it comes time to write a referenced object, you get the current stream pointer and store it in the xref, then write <objnum> <generation> obj <object contents> endobj. For example, my code to write a dictionary looks like this:

public override ToStream(PdfStreamingContext context)
{
    if (context.HasReference(this)) // is object referenced in xref
    {
        PdfUtils.WriteObjectDefinitionBegin(this, context);
    }
    context.Writer.Indent();
    context.Writer.WriteLine("<<");
    WriteContents(context);
    context.Writer.Exdent();
    context.Writer.Writeline(">>");
    if (context.HasReference(this))
    {
        PdfUtils.WriteObjectDefinitionEnd(this, context);
    }
}

I've chopped out some chaff so you can see the wheat underneath. The context is an object that holds a new xref table as well as an object for writing to streams that automagically handles appropriate newline discipline, indentation, line wrapping, and so on.

What you should see is that the basics here are straight forward, if not trivial. And now's when you should be asking yourself the question, "if it's trivial, how come there isn't more (serious) competition for Acrobat in the market? The answer is that even though it's trivial, it's still easy to write PDFs that aren't spec compliant and Acrobat handles most of those. The real challenge is to be able to honor the spec and make sure that you include all required values in a dictionary and that they are in range and semantically correct. Hell, even the date time format--which is pretty well-specified--is a mound of special case code in my library to manage where other people have screwed it up royally. Being able to generate consistently correct PDF is hard and consuming the garbage in the sea of PDFs in the world is harder.

I could (and probably should) write a book about how to do this. While a lot of the fringe code is grubby, the overall structure can be very pretty.

tl;dr - If you're thinking of a recursive descent parser for PDF, you're thinking too hard. All you need is a tokenizer and a simple REPL.

like image 168
plinth Avatar answered Sep 19 '22 15:09

plinth