Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tika - how to extract text from PDF text: underlined, highlighted, crossed out

I'm using Tika* to parse a PDF file. There are no problems to retrieve the document's text, but I don't figure out how to extract text:

  • underlined
  • highlighted
  • crossed out

Adobe Writer gives you different text edit options, but I'm not able to see where they are "hidden".

Is there a solution to extract these metadata information? (underline, highligh ...)

Do you know if Tika is able to extract this data?

*http://tika.apache.org/

like image 299
Bronn Avatar asked Sep 09 '25 11:09

Bronn


1 Answers

Wow. 4 years is a long time to wait for an answer, and I figure you have found a solution by now. Anyways, for the sake of those who would visit this link, the answer is Yes. Apache Tika can extract not just text in a document, but also the formatting as well (e.g. bold, italicized). This was my Scenario:

    //inputStream is the document you wish to parse from.

    AutoDetectParser parser = new AutoDetectParser();
    ContentHandler handler = new BodyContentHandler(new ToXMLContentHandler());
    Metadata metadata = new Metadata();

    parser.parse(inputStream,handler,metadata);
    System.out.println(handler.toString());

The print statement prints an XML of your document. With a little work of cleaning up the XML (really HTML tags), you would be left with tags like < b >text< /b> for bold text and < i >text < / i > for italicized text. Then you could find a way to render it. Good luck.

like image 172
jaletechs Avatar answered Sep 11 '25 21:09

jaletechs