Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting a pdf to word document using java

Tags:

java

I've successfully converted JPEG to Pdf using Java, but don't know how to convert Pdf to Word using Java, the code for converting JPEG to Pdf is given below.

Can anyone tell me how to convert Pdf to Word (.doc/ .docx) using Java?

import java.io.FileOutputStream;
import com.itextpdf.text.Image;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.text.Document;

public class JpegToPDF {
    public static void main(String[] args) {
        try {
            Document convertJpgToPdf = new Document();
            PdfWriter.getInstance(convertJpgToPdf, new FileOutputStream(
                    "c:\\java\\ConvertImagetoPDF.pdf"));
            convertJpgToPdf.open();
            Image convertJpg = Image.getInstance("c:\\java\\test.jpg");
            convertJpgToPdf.add(convertJpg);
            convertJpgToPdf.close();
            System.out.println("Successfully Converted JPG to PDF in iText");
        } catch (Exception i1) {
            i1.printStackTrace();
        }
    }
}

1 Answers

In fact, you need two libraries. Both libraries are open source. The first one is iText, it is used to extract the text from a PDF file. The second one is POI, it is ued to create the word document.

The code is quite simple:

//Create the word document
XWPFDocument doc = new XWPFDocument();

// Open the pdf file
String pdf = "myfile.pdf";
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);

// Read the PDF page by page
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
    TextExtractionStrategy strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
    // Extract the text
    String text=strategy.getResultantText();
    // Create a new paragraph in the word document, adding the extracted text
    XWPFParagraph p = doc.createParagraph();
    XWPFRun run = p.createRun();
    run.setText(text);
    // Adding a page break
    run.addBreak(BreakType.PAGE);
}
// Write the word document
FileOutputStream out = new FileOutputStream("myfile.docx");
doc.write(out);
// Close all open files
out.close();
reader.close();

Beware: With the used extraction strategy, you will lose all formatting. But you can fix this, by inserting your own, more complex extraction strategy.

like image 192
stefan.schwetschke Avatar answered Dec 01 '25 06:12

stefan.schwetschke