1.I need to convert a PDF File into a txt.file. My Command seems to work, since i get the converted text on the screen, but somehow im incapable to direct the output into a textfile.
public static string[] GetArgs(string inputPath, string outputPath)
{
return new[] {
"-q", "-dNODISPLAY", "-dSAFER",
"-dDELAYBIND", "-dWRITESYSTEMDICT", "-dSIMPLE",
"-c", "save", "-f",
"ps2ascii.ps", inputPath, "-sDEVICE=txtwrite",
String.Format("-sOutputFile={0}", outputPath),
"-c", "quit"
};
}
2.Is there a unicode speficic .ps?
Update: Posting my complete Code, maybe the error is somewhere else.
public static string[] GetArgs(string inputPath, string outputPath)
{
return new[]
{ "-o c:/test.txt",
"-dSIMPLE",
"-sFONTPATH=c:/windows/fonts",
"-dNODISPLAY",
"-dDELAYBIND",
"-dWRITESYSTEMDICT",
"-f",
"C:/Program Files/gs/gs9.05/lib/ps2ascii.ps",
inputPath,
};
}
[DllImport("gsdll64.dll", EntryPoint = "gsapi_new_instance")]
private static extern int CreateAPIInstance(out IntPtr pinstance, IntPtr caller_handle);
[DllImport("gsdll64.dll", EntryPoint = "gsapi_init_with_args")]
private static extern int InitAPI(IntPtr instance, int argc, string[] argv);
[DllImport("gsdll64.dll", EntryPoint = "gsapi_exit")]
private static extern int ExitAPI(IntPtr instance);
[DllImport("gsdll64.dll", EntryPoint = "gsapi_delete_instance")]
private static extern void DeleteAPIInstance(IntPtr instance);`
private static object resourceLock = new object();
private static void Cleanup(IntPtr gsInstancePtr)
{
ExitAPI(gsInstancePtr);
DeleteAPIInstance(gsInstancePtr);
}
private static object resourceLock = new object();
public static void ConvertPdfToText(string inputPath, string outputPath)
{
CallAPI(GetArgs(inputPath, outputPath));
}
public static void ConvertPdfToText(string inputPath, string outputPath)
{
CallAPI(GetArgs(inputPath, outputPath));
}
private static void CallAPI(string[] args)
{
// Get a pointer to an instance of the Ghostscript API and run the API with the current arguments
IntPtr gsInstancePtr;
lock (resourceLock)
{
CreateAPIInstance(out gsInstancePtr, IntPtr.Zero);
try
{
int result = InitAPI(gsInstancePtr, args.Length, args);
if (result < 0)
{
throw new ExternalException("Ghostscript conversion error", result);
}
}
finally
{
Cleanup(gsInstancePtr);
}
}
}
2 questions, 2 answers:
To get output to a file, use -sOutputFile=/path/to/file
on the commandline, or add the line
"-sOutputFile=/where/it/should/go",
to your c#
code (can be the first argument, but should be before your first "-c"
. But first get rid of your other -sOutputFile
stuff you have already in there... :-)
No, PostScript isn't aware of Unicode.
Update
(Remark: Extracting text from PDF reliably is (for various technical reasons) notoriously difficult. And it may not work at all, whichever tool you try...)
On the commandline, the following two should work for recent releases of Ghostscript (current version is v9.05). It would be your own job...
c#
code.txtwrite
device:gswin32c.exe ^
-o c:/path/to/output.txt ^
-dTextFormat=3 ^
-sDEVICE=txtwrite ^
input.pdf
Notes:
gswin64c.exe
(if available) on your system if it is 64bit.-o
syntax for the output works only with recent versions of Ghostscript.-o
syntax does implicitely also set the -dBATCH
and -dNOPAUSE
parameters.-o
shorthand doesn't work, replace it with -dBATCH -dNOPAUSE -sOutputFile=...
.-dTextFormat
is by default set to 3
anyway, so it is not required here. 'Legal' values for it are:
0
: This outputs XML-escaped Unicode along with info related to the format of the text (position, font name, point size, etc). Intended for developers only.1
: Same as 0
, but will output blocks of text.2
: This outputs Unicode (UCS2) text with BMO (Byte Order Mark); tries to approximate layout of text in original document.3
: (default) Same as 2
, but the text is encoded in UTF-8.txtwrite
device with this -dTextFormat
modifier is a rather new asset of Ghostscript, so please report bugs if you find ones.ps2ascii.ps
gswin32c.exe ^
-sstdout=c:/path/to/output.txt ^
-dSIMPLE ^
-sFONTPATH=c:/windows/fonts ^
-dNODISPLAY
-dDELAYBIND ^
-dWRITESYSTEMDICT ^
-f /path/to/ps2ascii.ps ^
input.pdf
Notes:
txtwrite
device one and cannot be mixed with it!ps2ascii.ps
is a file, a PostScript program that Ghostscript invokes to extract the text. It is usually located in the Ghostscript installdir's /lib
subdirectory. Go and see if it is really there.-dSIMPLE
may be replaced by dCOMPLEX
in order to print out extra info lines (current color, presence of an image, rectangular fills).-sstdout=...
is required because the ps2ascii.ps
PostScript program does print to stdout only and can't be told to write to a file. So -sstdout=...
tells Ghostscript to redirect its stdout to a file.Do not ignore other, non-Ghostscript methods that may be easier to work with. All of the following are cross-platform and should be available on Windows too:
mudraw -t
pdftotext
pdftotext
).podofotxtextract
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With