I need to parse a PDF document. I already implemented the parser and used the Library iText and till now it worked without any problems.
But no I need to parse another document which gets very strange whitespaces in the middle of words. As example I get:
Vo rber eitung auf die Motorr adsaison. Viele Motorr adf ahr er
All the bold words should be connected, but somehow the PDF Parser is adding whitespaces into the words. But when I copy and paste the content from the PDF into a Textfile I dont get these spaces.
First I thought it's because of the PDF Parsing library I'm using, but also with another library I get the exact same issue.
I had a look on the singleSpaceWidth
from the parsed words and I noticed that it's varying always then, when it's adding a whitespace. I tried to put them manually together. But since there isn't really a pattern to recombine the words it's almost impossible.
Did anyone else have a similar issue or even a solution to that problem?
As requested, here is some more information:
Parsing with SemTextExtractionStrategy:
PdfReader reader = new PdfReader("data/SpecialTests/SuedostSchweiz/" + src);
SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy();
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
// Set the page number on the strategy. Is used in the Parsing strategies.
semTextExtractionStrategy.pageNumber = i;
// Parse text from page
PdfTextExtractor.getTextFromPage(reader, i, semTextExtractionStrategy);
}
Here the SemTextExtractionStrategy method which actually parses the text. There I manually add after every parsed word a whitespace, but somehow it does split the words in the detection:
@Override
public void parseText(TextRenderInfo renderInfo, int pageNumber) {
this.pageNumber = pageNumber;
String text = renderInfo.getText();
currTextBlock.getText().append(text + " ");
....
}
Here is the whole SemTextExtraction Class but in there it does only call the method from above (parseText):
public class SemTextExtractionStrategy implements TextExtractionStrategy {
// Text Extraction Strategies
public ColumnDetecter columnDetecter = new ColumnDetecter();
// Image Extraction Strategies
public ImageRetriever imageRetriever = new ImageRetriever();
public int pageNumber = -1;
public ArrayList<TextParsingStrategy> textParsingStrategies = new ArrayList<TextParsingStrategy>();
public ArrayList<ImageParsingStrategy> imageParsingStrategies = new ArrayList<ImageParsingStrategy>();
public SemTextExtractionStrategy() {
// Add all text parsing strategies which are later on applied on the extracted text
// textParsingStrategies.add(fontSizeMatcher);
textParsingStrategies.add(columnDetecter);
// Add all image parsing strategies which are later on applied on the extracted text
imageParsingStrategies.add(imageRetriever);
}
@Override
public void beginTextBlock() {
}
@Override
public void renderText(TextRenderInfo renderInfo) {
// TEXT PARSING
for(TextParsingStrategy strategy : textParsingStrategies) {
strategy.parseText(renderInfo, pageNumber);
}
}
@Override
public void endTextBlock() {
}
@Override
public void renderImage(ImageRenderInfo renderInfo) {
for(ImageParsingStrategy strategy : imageParsingStrategies) {
strategy.parseImage(renderInfo);
}
}
}
The whitespaces in pdf are a known issue as described by the answer on here by Roland and also seen at first comment of https://issues.apache.org/jira/browse/TIKA-724
The answer that also worked for me is the one seen by huuhungus at https://github.com/smalot/pdfparser/issues/72
which is specific to PDFParser and it is to change the code that actually adds this extra space to the PDFParser if you know you will have this problem:
src/Smalot/PdfParser/Object.php comment out this line
$text .= ' ';
Not completely fix it, but it's at acceptable
Other libraries may also have similar temporary fixes so they could help with this issue in some cases.
I have processed the given PDF file with the following Ghostscript command:
gs -o out.pdf -q -sDEVICE=pdfwrite -dOptimize=false -dUseFlageCompression=false -dCompressPages=false -dCompressFonts=false whitespacesProblem.pdf
This command created a file out.pdf
, which does not have the stream encodings, so it is better readable. The interesting part is in line 52, which I split into multiple lines for readability:
[
(&;&)-287.988
(672744)29.9906
(+\(%)30.01
(+!4)29.9876
(&4)-287.989
(%4)30.0039
(&1&8)-287.975
(3=\)!)-288.021
(*&4)30.0212
(&=23)-287.996
(+1%)-287.99
(\(=&)-288.011
(8&1&)-287.974
(672744)29.9906
(+\(3+=378$)-250.977
(#7\)!)
]TJ
Between the parentheses are the text characters. I changed some of them and watched the rendered PDF file to see which character represents which glyph. Then I decoded the text:
[
(ele)-287.988
(Motorr)29.9906 ***
(adf)30.01 ***
(ahr)29.9876 ***
(er)-287.989
(fr)30.0039
(euen)-287.975
(sich)-288.021
...
]
So there is indeed whitespace between the characters. In your case this is probably the kerning of the font. The question is now how your PDF library interprets this whitespace, and it seems to me, that even "negative whitespace" is rendered into a space in the resulting string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With