Tika - how to extract text from PDF text: underlined, highlighted, crossed out

Question

I'm using Tika* to parse a PDF file. There are no problems to retrieve the document's text, but I don't figure out how to extract text:

underlined
highlighted
crossed out

Adobe Writer gives you different text edit options, but I'm not able to see where they are "hidden".

Is there a solution to extract these metadata information? (underline, highligh ...)

Do you know if Tika is able to extract this data?

*http://tika.apache.org/

jaletechs · Accepted Answer

Wow. 4 years is a long time to wait for an answer, and I figure you have found a solution by now. Anyways, for the sake of those who would visit this link, the answer is Yes. Apache Tika can extract not just text in a document, but also the formatting as well (e.g. bold, italicized). This was my Scenario:

    //inputStream is the document you wish to parse from.

    AutoDetectParser parser = new AutoDetectParser();
    ContentHandler handler = new BodyContentHandler(new ToXMLContentHandler());
    Metadata metadata = new Metadata();

    parser.parse(inputStream,handler,metadata);
    System.out.println(handler.toString());

The print statement prints an XML of your document. With a little work of cleaning up the XML (really HTML tags), you would be left with tags like < b >text< /b> for bold text and < i >text < / i > for italicized text. Then you could find a way to render it. Good luck.

Tika - how to extract text from PDF text: underlined, highlighted, crossed out

Tags:

text

pdf

markup

apache-tika

Bronn

1 Answers

jaletechs

Recent Activity

Donate For Us

Tika - how to extract text from PDF text: underlined, highlighted, crossed out

Tags:

text

pdf

markup

apache-tika

Bronn

1 Answers

jaletechs

Related questions

Recent Activity

Donate For Us