Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do extract text layer and background layer from pdf?

In my project I've to do a PDF Viewer in HTML5/CSS3 and the application has to allow user to add comments and annotation. Actually, I've to do something very similar to crocodoc.com.

At the beginning I was thinking to create images from the PDF and allow user create area and post comments associates to this area. Unfortunately, the client wants also navigate in this PDF and add only comments on allowed sections (for example, paragraphs or selected text).

And now I'm in front of one problem that is to get the text and the best way to do it. If any body has some clues how I can reach it, I would appreciate.

I tried pdftohtml, but output doesn't look like the original document whom is really complex (example of document). Even this one doesn't reflect really the output, but is much better than pdftohtml.

I'm open to any solutions, with preference for command line under linux.

like image 755
yvan Avatar asked Sep 08 '11 18:09

yvan


People also ask

How do I extract text from a PDF layer?

To extract information from a PDF in Acrobat DC, choose Tools > Export PDF and select an option. To extract text, export the PDF to a Word format or rich text format, and choose from several advanced options that include: Retain Flowing Text.


1 Answers

I've been down the same road as you, with even much more complex tasks.

After trying out everything I ended up using C# under Mono (so it runs on linux) with iTextSharp.

Even with a very complete library such as iTextSharp, some tasks required allot of trial-and-error :)

To extract the text from a page is easy (check the below snipper), however if you intend to keep the text coordinates, fonts and sizes, you will have more work to do.

int pdf_page = 5;
string page_text = "";

PdfReader reader = new PdfReader("path/to/pdf/file.pdf");
PRTokeniser token = new PRTokeniser(reader.GetPageContent(pdf_page));
while(token.NextToken())
{
    if(token.TokenType == PRTokeniser.TokType.STRING)
    {
        page_text += token.StringValue;
    }
    else if(token.StringValue == "Tj")
    {
        page_text += " ";
    }
}

Do a Console.WriteLine(token.StringValue) on all tokens to see how paragraphs of text are structured in PDFs. This way you can detect coordinates, font, font size, etc.

Addition:

Given the task you are required to do, I have a suggestion for you:

Extract the text with coordinates and font families and sizes - all information about each paragraph. Then, to a PDF-to-images, and in your online viewer, apply invisible selectable text over the paragraphs on the image where needed.

This way your users can select a part of the text where needed, without the need of reconstructing the whole PDF in html :)

like image 126
Tom Avatar answered Sep 18 '22 23:09

Tom