I am trying to read text from a PDF into a string using the iTextSharp library.
iTextSharp.text.pdf.PdfReader pdfReader = new iTextSharp.text.pdf.PdfReader(@"C:\mypdf.pdf");
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy);
text = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
pdfReader.Close();
Console.WriteLine(text);
This normally works OK, but every few lines the whitespace will be omitted, leaving me with output like: "thisismyoutputwithoutwhitespace". The text that parses correctly seems to be the same as the text that doesn't; the same text will consistently be parsed incorrectly, which makes me think it's something within the PDFs.
In the content stream of a PDF there's no notion of "words". So in iText(Sharp)'s text extraction implementation there are some heuristics to determine how to group characters into words. When the distance between 2 characters is larger than half the width of a space in the current font, whitespace is inserted.
Most likely, the text that gets extracted without whitespace has distances between the words that are smaller than "spacewidth / 2".
In SimpleTextExtractionStrategy.RenderText()
:
if (spacing > renderInfo.GetSingleSpaceWidth()/2f){
AppendTextChunk(' ');
}
You can extend SimpleTextExtractionStrategy
and adjust the RenderText()
.
In LocationTextExtractionStrategy
it is more convenient. You only need to override IsChunkAtWordBoundary()
:
protected bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk) {
float dist = chunk.DistanceFromEndOf(previousChunk);
if(dist < -chunk.CharSpaceWidth || dist > chunk.CharSpaceWidth / 2.0f)
return true;
return false;
}
You'll have to experiment a bit to get good results for your PDFs. "spacewidth / 2" is apparently too large in your case. But if you adjust it to be too small, you'll get false positives: whitespace will be inserted within words.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With