Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Build PDFBox for .Net

I've seen examples for extracting text from pdf files that either use ITextSharp or PDFBox. PDFBox seems to be the most "reliable" method for extracting text, but it requires many additional steps.

I've tried to build the dll's using the instructions found here, but I have no idea how to correctly build the required files for .Net.

I'm pretty lost, could someone provide a "Include PDFBox in your .Net application for Dummies" step by step?

like image 372
SharpBarb Avatar asked Dec 09 '11 06:12

SharpBarb


1 Answers

I finally got it to work. I've outlined the steps I followed to get a working example. I hope someone finds this helpful.

Download the Java JDK
Download IKVM 0.42.0.6
Download PDFBox 1.6.0-src.zip

The Ant Manual was helpful.

I renamed the Ant and PDFBox folders to shorten their names and moved them my C: drive

You have to setup your environment Variables. (Windows 7) Right-click My Computer->Properties->Advanced System Settings->Environment Variables

I used the settings below, but yours will vary depending on where you installed Java and where you put the Ant and PDF Box folders.

Variable    Value
ANT_HOME    C:\apache-ant\
JAVA_HOME   C:\Program Files (x86)\Java\jdk1.7.0_01
Path        ;C:\apache-ant\bin\     (Append semi-colon and path)

Once the above it done, type in “ant” in a command window, you should get a “build.xml does not exist!” message if everything is setup correctly.

Edit the build.xml file inside the ”pdfbox-1.6.0\pdfbox” folder. Find the line that has Replace “.” with “Your IKVM Folder Path”.

I moved IKVM to “C:\IKVM" so mine looks like:

Open a command window and cd to “C:\pdfbox-1.6.0\pdfbox “ and type “ant”

…and then a miracle occurs.

A bunch of new folders should now exist in the pdfbox folder. The required dll’s are in the bin folder. I don’t know why, but I got a “-SNAPSHOT” and the end of all my files (pdfbox-1.6.0-SNAPSHOT.dll).

IKVM.GNU.Classpath (Also called IKVM.OpenJDK.Classpath) no longer exists, it was modularized since the 0.40 release. It is now available in the form of several IKVM.OpenJDK dll’s. You only need a few of them.

Create a new project in Visual Studio C#

Copy these files from the pdfBox bin folder to the bin folder of your Visual C# project bin folder:

pdfbox-1.6.0-SNAPSHOT.dll
fontbox-1.6.0-SNAPSHOT.dll
commons-logging.dll

Copy these files from the ikvm bin folder to the bin folder of your Visual C# project bin folder:

IKVM.OpenJDK.Core.dll
IKVM.OpenJDK.SwingAWT.dll
IKVM.OpenJDK.Text.dll
IKVM.OpenJDK.Util.dll
IKVM.Runtime.dll

Add References to the IKVM dll’s above and build your project.

Add a Reference to the pdfbox dll and build your project again.

You are now ready to write some code. The simple example below produced a nice text file from the input pdf.

using System;
using System.IO;

using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;

namespace testPDF
{
class Program
{
    static void Main()
    {
        PDFtoText pdf = new PDFtoText();

        string pdfText = pdf.parsePDF(@"C:\Sample.pdf");

        using (StreamWriter writer = new StreamWriter(@"C:\Sample.txt"))
        { writer.Write(pdfText); }

    }

    class PDFtoText
    {
        public string parsePDF(string filepath)
        {
            PDDocument document = PDDocument.load(filepath);
            PDFTextStripper stripper = new PDFTextStripper();
            return stripper.getText(document);
        }

    }
}

}
like image 161
SharpBarb Avatar answered Sep 29 '22 16:09

SharpBarb