Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ghostscript convert a PDF and output in a textfile

1.I need to convert a PDF File into a txt.file. My Command seems to work, since i get the converted text on the screen, but somehow im incapable to direct the output into a textfile.

public static string[] GetArgs(string inputPath, string outputPath)
{ 
    return new[] {
                "-q", "-dNODISPLAY", "-dSAFER",
                "-dDELAYBIND", "-dWRITESYSTEMDICT", "-dSIMPLE",
                "-c", "save", "-f",
                "ps2ascii.ps", inputPath, "-sDEVICE=txtwrite",
                String.Format("-sOutputFile={0}", outputPath),
                "-c", "quit"
    }; 
}

2.Is there a unicode speficic .ps?

Update: Posting my complete Code, maybe the error is somewhere else.

public static string[] GetArgs(string inputPath, string outputPath)
{
    return new[]    
    {   "-o c:/test.txt",    
        "-dSIMPLE",
        "-sFONTPATH=c:/windows/fonts",
        "-dNODISPLAY",
        "-dDELAYBIND",
        "-dWRITESYSTEMDICT",
        "-f",
        "C:/Program Files/gs/gs9.05/lib/ps2ascii.ps",               
        inputPath,
    };
}

[DllImport("gsdll64.dll", EntryPoint = "gsapi_new_instance")]
private static extern int CreateAPIInstance(out IntPtr pinstance, IntPtr caller_handle);

[DllImport("gsdll64.dll", EntryPoint = "gsapi_init_with_args")]
private static extern int InitAPI(IntPtr instance, int argc, string[] argv);

[DllImport("gsdll64.dll", EntryPoint = "gsapi_exit")]
private static extern int ExitAPI(IntPtr instance);

[DllImport("gsdll64.dll", EntryPoint = "gsapi_delete_instance")]
private static extern void DeleteAPIInstance(IntPtr instance);`

private static object resourceLock = new object();

private static void Cleanup(IntPtr gsInstancePtr)
{
    ExitAPI(gsInstancePtr);
    DeleteAPIInstance(gsInstancePtr);
}

private static object resourceLock = new object();

public static void ConvertPdfToText(string inputPath, string outputPath) 
{ 
    CallAPI(GetArgs(inputPath, outputPath));
}

public static void ConvertPdfToText(string inputPath, string outputPath) 
{ 
    CallAPI(GetArgs(inputPath, outputPath));
}

private static void CallAPI(string[] args)      
{       
    // Get a pointer to an instance of the Ghostscript API and run the API with the current arguments       
    IntPtr gsInstancePtr;   
    lock (resourceLock)     
    {           
        CreateAPIInstance(out gsInstancePtr, IntPtr.Zero);      
        try
        {
            int result = InitAPI(gsInstancePtr, args.Length, args);                    
            if (result < 0)     
            {
                throw new ExternalException("Ghostscript conversion error", result);        
            }       
        }           
        finally     
        {               
            Cleanup(gsInstancePtr);     
        }       
    }   
}
like image 453
Thomas Kurn Avatar asked Aug 01 '12 07:08

Thomas Kurn


1 Answers

2 questions, 2 answers:

  1. To get output to a file, use -sOutputFile=/path/to/file on the commandline, or add the line

     "-sOutputFile=/where/it/should/go",
    

    to your c# code (can be the first argument, but should be before your first "-c". But first get rid of your other -sOutputFile stuff you have already in there... :-)

  2. No, PostScript isn't aware of Unicode.


Update

(Remark: Extracting text from PDF reliably is (for various technical reasons) notoriously difficult. And it may not work at all, whichever tool you try...)

On the commandline, the following two should work for recent releases of Ghostscript (current version is v9.05). It would be your own job...

  • ...to test which command works better for your use case, and
  • ...to translate these into c# code.

1. txtwrite device:

gswin32c.exe ^
   -o c:/path/to/output.txt ^
   -dTextFormat=3 ^
   -sDEVICE=txtwrite ^
    input.pdf

Notes:

  1. You may want to use gswin64c.exe (if available) on your system if it is 64bit.
  2. The -o syntax for the output works only with recent versions of Ghostscript.
  3. The -o syntax does implicitely also set the -dBATCH and -dNOPAUSE parameters.
  4. If your Ghostscript is too old and the -o shorthand doesn't work, replace it with -dBATCH -dNOPAUSE -sOutputFile=....
  5. Ghostscript can handle forward slashes inside path arguments even on Windows.
  6. The -dTextFormat is by default set to 3 anyway, so it is not required here. 'Legal' values for it are:
    • 0 : This outputs XML-escaped Unicode along with info related to the format of the text (position, font name, point size, etc). Intended for developers only.
    • 1 : Same as 0, but will output blocks of text.
    • 2 : This outputs Unicode (UCS2) text with BMO (Byte Order Mark); tries to approximate layout of text in original document.
    • 3 : (default) Same as 2, but the text is encoded in UTF-8.
  7. The txtwrite device with this -dTextFormat modifier is a rather new asset of Ghostscript, so please report bugs if you find ones.

2. Using ps2ascii.ps

gswin32c.exe ^
   -sstdout=c:/path/to/output.txt ^
   -dSIMPLE ^
   -sFONTPATH=c:/windows/fonts ^
   -dNODISPLAY 
   -dDELAYBIND ^
   -dWRITESYSTEMDICT ^
   -f /path/to/ps2ascii.ps ^
    input.pdf

Notes:

  1. This is a completely different method from the txtwrite device one and cannot be mixed with it!
  2. ps2ascii.ps is a file, a PostScript program that Ghostscript invokes to extract the text. It is usually located in the Ghostscript installdir's /lib subdirectory. Go and see if it is really there.
  3. -dSIMPLE may be replaced by dCOMPLEX in order to print out extra info lines (current color, presence of an image, rectangular fills).
  4. -sstdout=... is required because the ps2ascii.ps PostScript program does print to stdout only and can't be told to write to a file. So -sstdout=... tells Ghostscript to redirect its stdout to a file.

3. Non-Ghostscript methods

Do not ignore other, non-Ghostscript methods that may be easier to work with. All of the following are cross-platform and should be available on Windows too:

  • mudraw -t
    GPL licensed (or commercial, if you need). Commandline utility from MuPDF to extract text from PDF (which is developed by the same group of developers that do Ghostscript).
  • pdftotext
    GPL licensed. Commandline utility from Poppler (which is a fork from XPDF, that also provides a pdftotext).
  • podofotxtextract
    GPL licensed. Commandline utility based the PoDoFo PDF processing library.
  • TET
    The Text Extraction Toolkit from PDFlib.com (commercial, but may be gratis for personal use -- I didn't check recent news). Probably the most powerful text extraction tool of them all...
like image 187
Kurt Pfeifle Avatar answered Nov 10 '22 18:11

Kurt Pfeifle