How to get raw text from pdf file using java

Tags:

I have some pdf files, Using pdfbox i have converted them into text and stored into text files, Now from the text files i want to remove

Hyperlinks
All special characters
Blank lines
headers footers of pdf files
“1)”,“2)”, “a)”, “bullets”, etc.

I want to get valid text line by line like this:

We propose OntoGain, a method for ontology learning from multi-word concept terms extracted from plain text. OntoGain follows an ontology learning process dened by distinct processing layers. Building upon plain term extraction a con-cept hierarchy is formed by clustering the extracted concepts. The derived term taxonomy is then enriched with non-taxonomic relations. Several dierent state-of-the-art methods have been examined for implementing each layer. OntoGain is based upon multi-word term concepts, as multi-word or compound terms are vested with more solid and distinctive semantics than plain single word terms. We opted for a hierarchical clustering method and Formal Concept Analysis (FCA) algorithm for building the term taxonomy. Furthermore an association rule algorithm is applied for revealing non-taxonomic relations. A method which tries to carry out the most appropriate generalization level between a relation's concepts is also implemented. To show proof of concept, a system prototype is implemented. The OntoGain allows transformation of the derived ontology into OWL using Jena Semantic Web Frame-work1. OntoGain is applied on two separate data sources a medical and computer corpus and its results are compared with similar results obtained by Text2Onto, a state-of-the-art-ontology learning method. The analysis of 11.5 CCD1.1 results indicates that OntoGain performs better than Text2Onto in terms of precision extracts more correct concepts while being more selective extracts fewer but more reasonable concepts.

How can I achieve this?

649

asked Aug 07 '13 08:08

user2609542

3 Answers

Using pdfbox we can achive this

Example :

public static void main(String args[]) {

    PDFParser parser = null;
    PDDocument pdDoc = null;
    COSDocument cosDoc = null;
    PDFTextStripper pdfStripper;

    String parsedText;
    String fileName = "E:\\Files\\Small Files\\PDF\\JDBC.pdf";
    File file = new File(fileName);
    try {
        parser = new PDFParser(new FileInputStream(file));
        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdDoc = new PDDocument(cosDoc);
        parsedText = pdfStripper.getText(pdDoc);
        System.out.println(parsedText.replaceAll("[^A-Za-z0-9. ]+", ""));
    } catch (Exception e) {
        e.printStackTrace();
        try {
            if (cosDoc != null)
                cosDoc.close();
            if (pdDoc != null)
                pdDoc.close();
        } catch (Exception e1) {
            e1.printStackTrace();
        }

    }
}

199

answered Oct 10 '22 03:10

SANN3

Hi we can extract the pdf files using Apache Tika

The Example is :

import java.io.IOException;
import java.io.InputStream;
import java.util.HashMap;
import java.util.Map;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.TikaCoreProperties;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;

public class WebPagePdfExtractor {

    public Map<String, Object> processRecord(String url) {
        DefaultHttpClient httpclient = new DefaultHttpClient();
        Map<String, Object> map = new HashMap<String, Object>();
        try {
            HttpGet httpGet = new HttpGet(url);
            HttpResponse response = httpclient.execute(httpGet);
            HttpEntity entity = response.getEntity();
            InputStream input = null;
            if (entity != null) {
                try {
                    input = entity.getContent();
                    BodyContentHandler handler = new BodyContentHandler();
                    Metadata metadata = new Metadata();
                    AutoDetectParser parser = new AutoDetectParser();
                    ParseContext parseContext = new ParseContext();
                    parser.parse(input, handler, metadata, parseContext);
                    map.put("text", handler.toString().replaceAll("\n|\r|\t", " "));
                    map.put("title", metadata.get(TikaCoreProperties.TITLE));
                    map.put("pageCount", metadata.get("xmpTPg:NPages"));
                    map.put("status_code", response.getStatusLine().getStatusCode() + "");
                } catch (Exception e) {
                    e.printStackTrace();
                } finally {
                    if (input != null) {
                        try {
                            input.close();
                        } catch (IOException e) {
                            e.printStackTrace();
                        }
                    }
                }
            }
        } catch (Exception exception) {
            exception.printStackTrace();
        }
        return map;
    }

    public static void main(String arg[]) {
        WebPagePdfExtractor webPagePdfExtractor = new WebPagePdfExtractor();
        Map<String, Object> extractedMap = webPagePdfExtractor.processRecord("http://math.about.com/library/q20.pdf");
        System.out.println(extractedMap.get("text"));
    }

}

answered Oct 10 '22 03:10

Johnsa Philip

You can use iText for do such things

//iText imports

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;

for example:

try {     
    PdfReader reader = new PdfReader(INPUTFILE);
    int n = reader.getNumberOfPages(); 
    String str=PdfTextExtractor.getTextFromPage(reader, 2); //Extracting the content from a particular page.
    System.out.println(str);
    reader.close();
} catch (Exception e) {
    System.out.println(e);
}

another one

try {

    PdfReader reader = new PdfReader("c:/temp/test.pdf");
    System.out.println("This PDF has "+reader.getNumberOfPages()+" pages.");
    String page = PdfTextExtractor.getTextFromPage(reader, 2);
    System.out.println("Page Content:\n\n"+page+"\n\n");
    System.out.println("Is this document tampered: "+reader.isTampered());
    System.out.println("Is this document encrypted: "+reader.isEncrypted());
} catch (IOException e) {
    e.printStackTrace();
}

the above examples can only extract the text, but you need to do some more to remove hyperlinks, bullets, heading & numbers.

answered Oct 10 '22 02:10

Ulaga

Related questions
                            
                                Use Java lambda instead of 'if else'
                            
                                Proguard and reflection in Android
                            
                                Create prototype scoped Spring bean with annotations?
                            
                                How to do function composition?
                            
                                Does log4j support JSON format?
                            
                                Why is the square root of -Infinity +Infinity in Java? [duplicate]
                            
                                Simulation of Service using Mockito 2 leads to stubbing error
                            
                                What's an example of duck typing in Java?
                            
                                How to check if a parameter contains two substrings using Mockito?
                            
                                Create file name using date and time
                            
                                Java - Hash algorithms - Fastest implementations
                            
                                JFrame.dispose() vs System.exit()
                            
                                why cannot we create spy for Parameterized Constructor using Mockito
                            
                                Have you used Quickcheck in a real project [closed]
                            
                                ojdbc14.jar vs. ojdbc6.jar
                            
                                Java ProcessBuilder: Resultant Process Hangs
                            
                                Starting a process in Java?
                            
                                Creating zip archive in Java
                            
                                Hibernate: CRUD Generic DAO
                            
                                FindBugs warning: Inefficient use of keySet iterator instead of entrySet iterator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get raw text from pdf file using java

Tags:

java

pdf

pdfbox