Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting line locations with iText

Tags:

itext

How can one find where are lines located in a document with iText?

Suppose say I have a table in a PDF document, and want to read its contents; I would like to find where exactly the cells are located. In order to do that I thought I might find the intersections of lines.

like image 338
ipavlic Avatar asked Dec 16 '22 05:12

ipavlic


1 Answers

I think your only option using iText will be to parse the PDF tokens manually. Before doing that I would have a copy of the PDF spec handy.

(I'm a .Net guy so I use iTextSharp but other than some capitalization differences and property declarations they're almost 100% the same.)

You can get the individual tokens using the PRTokeniser object which you feed bytes into from calling getPageContent(pageNum) on your PdfReader.

//Get bytes for page 1
byte[] pageBytes = reader.getPageContent(1);
//Get the tokens for page 1
PRTokeniser tokeniser = new PRTokeniser(pageBytes);

Then just loop through the PRTokeniser:

PRTokeniser.TokType tokenType;
string tokenValue;

while (tokeniser.nextToken()) {
    tokenType = tokeniser.tokenType;
    tokenValue = tokeniser.stringValue;
    //...check tokenValue, do something with it
}

As far a tokenValue, you'd want to probably look for re and l values for rectangle and line. If you see an re then you want to look at the previous 4 values and if you see an l then previous 2 values. This also means that you need to store each tokenValue in an array so you can look back later.

Depending on what you used to create the PDF with you might get some interesting results. For instance, I created a 4 cell table with Microsoft Word and saved as a PDF. For some reason there are two sets of 10 rectangles with many duplicates, but the general idea still works.

Below is C# code targeting iTextSharp 5.1.1.0. You should be able to convert it to Java and iText very easily, I noted the one line that has .Net-specific code that needs to be adjusted from a Generic List (List<string>) to a Java equivalent, probably an ArrayList. You'll also need to adjust some casing, .Net uses Object.Method() whereas Java uses Object.method(). Lastly, .Net accesses properties without gets and sets, so Object.Property is both the getter and setter compared to Java's Object.getProperty and Object.setProperty.

Hopefully this gets you started at least!

        //Source file to read from
        string sourceFile = "c:\\Hello.pdf";

        //Bind a reader to our PDF
        PdfReader reader = new PdfReader(sourceFile);

        //Create our buffer for previous token values. For Java users, List<string> is a generic list, probably most similar to an ArrayList
        List<string> buf = new List<string>();

        //Get the raw bytes for the page
        byte[]  pageBytes = reader.GetPageContent(1);
        //Get the raw tokens from the bytes
        PRTokeniser tokeniser = new PRTokeniser(pageBytes);

        //Create some variables to set later
        PRTokeniser.TokType tokenType;
        string tokenValue;

        //Loop through each token
        while (tokeniser.NextToken()) {
            //Get the types and value
            tokenType = tokeniser.TokenType;
            tokenValue = tokeniser.StringValue;
            //If the type is a numeric type
            if (tokenType == PRTokeniser.TokType.NUMBER) {
                //Store it in our buffer for later user
                buf.Add(tokenValue);
            //Otherwise we only care about raw commands which are categorized as "OTHER"
            } else if (tokenType == PRTokeniser.TokType.OTHER) {
                //Look for a rectangle token
                if (tokenValue == "re") {
                    //Sanity check, make sure we have enough items in the buffer
                    if (buf.Count < 4) throw new Exception("Not enough elements in buffer for a rectangle");
                    //Read and convert the values
                    float x = float.Parse(buf[buf.Count - 4]);
                    float y = float.Parse(buf[buf.Count - 3]);
                    float w = float.Parse(buf[buf.Count - 2]);
                    float h = float.Parse(buf[buf.Count - 1]);
                    //..do something with them here
                }
            }
        }
like image 166
Chris Haas Avatar answered Dec 28 '22 08:12

Chris Haas