How do extract text layer and background layer from pdf?

Tags:

In my project I've to do a PDF Viewer in HTML5/CSS3 and the application has to allow user to add comments and annotation. Actually, I've to do something very similar to crocodoc.com.

At the beginning I was thinking to create images from the PDF and allow user create area and post comments associates to this area. Unfortunately, the client wants also navigate in this PDF and add only comments on allowed sections (for example, paragraphs or selected text).

And now I'm in front of one problem that is to get the text and the best way to do it. If any body has some clues how I can reach it, I would appreciate.

I tried pdftohtml, but output doesn't look like the original document whom is really complex (example of document). Even this one doesn't reflect really the output, but is much better than pdftohtml.

I'm open to any solutions, with preference for command line under linux.

755

asked Sep 08 '11 18:09

yvan

1 Answers

I've been down the same road as you, with even much more complex tasks.

After trying out everything I ended up using C# under Mono (so it runs on linux) with iTextSharp.

Even with a very complete library such as iTextSharp, some tasks required allot of trial-and-error :)

To extract the text from a page is easy (check the below snipper), however if you intend to keep the text coordinates, fonts and sizes, you will have more work to do.

int pdf_page = 5;
string page_text = "";

PdfReader reader = new PdfReader("path/to/pdf/file.pdf");
PRTokeniser token = new PRTokeniser(reader.GetPageContent(pdf_page));
while(token.NextToken())
{
    if(token.TokenType == PRTokeniser.TokType.STRING)
    {
        page_text += token.StringValue;
    }
    else if(token.StringValue == "Tj")
    {
        page_text += " ";
    }
}

Do a Console.WriteLine(token.StringValue) on all tokens to see how paragraphs of text are structured in PDFs. This way you can detect coordinates, font, font size, etc.

Addition:

Given the task you are required to do, I have a suggestion for you:

Extract the text with coordinates and font families and sizes - all information about each paragraph. Then, to a PDF-to-images, and in your online viewer, apply invisible selectable text over the paragraphs on the image where needed.

This way your users can select a part of the text where needed, without the need of reconstructing the whole PDF in html :)

126

answered Sep 18 '22 23:09

Tom

Related questions
                            
                                Automated PayPal payments
                            
                                PHP get real IP (proxy detection)
                            
                                Determine whether a static method has been called statically or as an instance method
                            
                                HTML safe wrapping of long lines
                            
                                jQuery, ajax, php, msyql: auto-suggest form input [closed]
                            
                                find which of a user's tweets were favorited
                            
                                Node.js chat - user authentication
                            
                                When uploading a very large file in PHP, how much RAM is required on the server?
                            
                                Doctrine ORM table with schema annotation
                            
                                How to determine that a PHP script is in termination phase?
                            
                                Mixing Active Records with Standard SQL Query in Codeigniter
                            
                                PHP: Prepared statements (newbie), just need to confirm this about SQL injection
                            
                                How to start TDD/BDD PHP CodeIgniter
                            
                                How to be sure they are uploading certain files
                            
                                PHP how to fail a request
                            
                                How can I update the DOM after Ajax Call (jQuery)?
                            
                                What is the fastest and most efficient way of storing and fetching images when you have millions of users on a LAMP server?
                            
                                How to create child object from existing parent object
                            
                                php convert stdClass object to array
                            
                                comparing 2 images in PHP

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do extract text layer and background layer from pdf?

Tags:

html

linux

php

pdf

ghostscript

yvan

People also ask

1 Answers

Tom

Recent Activity

Donate For Us