Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange whitespaces when parsing a PDF

I need to parse a PDF document. I already implemented the parser and used the Library iText and till now it worked without any problems.

But no I need to parse another document which gets very strange whitespaces in the middle of words. As example I get:

Vo rber eitung auf die Motorr adsaison. Viele Motorr adf ahr er

All the bold words should be connected, but somehow the PDF Parser is adding whitespaces into the words. But when I copy and paste the content from the PDF into a Textfile I dont get these spaces.

First I thought it's because of the PDF Parsing library I'm using, but also with another library I get the exact same issue.

I had a look on the singleSpaceWidth from the parsed words and I noticed that it's varying always then, when it's adding a whitespace. I tried to put them manually together. But since there isn't really a pattern to recombine the words it's almost impossible.

Did anyone else have a similar issue or even a solution to that problem?

As requested, here is some more information:

  • iText Version 5.2.1
  • http://prine.ch/whitespacesProblem.pdf (Link to the pdf)

Parsing with SemTextExtractionStrategy:

PdfReader reader = new PdfReader("data/SpecialTests/SuedostSchweiz/" + src);

SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy();

for (int i = 1; i <= reader.getNumberOfPages(); i++) {
    // Set the page number on the strategy. Is used in the Parsing strategies.
    semTextExtractionStrategy.pageNumber = i;

    // Parse text from page
    PdfTextExtractor.getTextFromPage(reader, i, semTextExtractionStrategy);
}

Here the SemTextExtractionStrategy method which actually parses the text. There I manually add after every parsed word a whitespace, but somehow it does split the words in the detection:

@Override
public void parseText(TextRenderInfo renderInfo, int pageNumber) {      

    this.pageNumber = pageNumber;

    String text = renderInfo.getText();

    currTextBlock.getText().append(text + " ");

    ....
}

Here is the whole SemTextExtraction Class but in there it does only call the method from above (parseText):

public class SemTextExtractionStrategy implements TextExtractionStrategy {

    // Text Extraction Strategies
    public ColumnDetecter columnDetecter = new ColumnDetecter();

    // Image Extraction Strategies
    public ImageRetriever imageRetriever = new ImageRetriever();

    public int pageNumber = -1;

    public ArrayList<TextParsingStrategy> textParsingStrategies = new ArrayList<TextParsingStrategy>();
    public ArrayList<ImageParsingStrategy> imageParsingStrategies = new ArrayList<ImageParsingStrategy>();

    public SemTextExtractionStrategy() {

        // Add all text parsing strategies which are later on applied on the extracted text
        // textParsingStrategies.add(fontSizeMatcher);
        textParsingStrategies.add(columnDetecter);

        // Add all image parsing strategies which are later on applied on the extracted text
        imageParsingStrategies.add(imageRetriever);
    }

    @Override
    public void beginTextBlock() {

    }

    @Override
    public void renderText(TextRenderInfo renderInfo) {
        // TEXT PARSING
        for(TextParsingStrategy strategy : textParsingStrategies) {
            strategy.parseText(renderInfo, pageNumber);
        }
    }

    @Override
    public void endTextBlock() {

    }

    @Override
    public void renderImage(ImageRenderInfo renderInfo) {
        for(ImageParsingStrategy strategy : imageParsingStrategies) {
            strategy.parseImage(renderInfo);
        }
    }
}
like image 946
Prine Avatar asked Aug 10 '12 12:08

Prine


2 Answers

The whitespaces in pdf are a known issue as described by the answer on here by Roland and also seen at first comment of https://issues.apache.org/jira/browse/TIKA-724

The answer that also worked for me is the one seen by huuhungus at https://github.com/smalot/pdfparser/issues/72

which is specific to PDFParser and it is to change the code that actually adds this extra space to the PDFParser if you know you will have this problem:

src/Smalot/PdfParser/Object.php comment out this line

   $text .= ' ';

Not completely fix it, but it's at acceptable

Other libraries may also have similar temporary fixes so they could help with this issue in some cases.

like image 99
user3134164 Avatar answered Sep 30 '22 20:09

user3134164


I have processed the given PDF file with the following Ghostscript command:

gs -o out.pdf -q -sDEVICE=pdfwrite -dOptimize=false -dUseFlageCompression=false -dCompressPages=false -dCompressFonts=false whitespacesProblem.pdf

This command created a file out.pdf, which does not have the stream encodings, so it is better readable. The interesting part is in line 52, which I split into multiple lines for readability:

[
  (&;&)-287.988
  (672744)29.9906
  (+\(%)30.01
  (+!4)29.9876
  (&4)-287.989
  (%4)30.0039
  (&1&8)-287.975
  (3=\)!)-288.021
  (*&4)30.0212
  (&=23)-287.996
  (+1%)-287.99
  (\(=&)-288.011
  (8&1&)-287.974
  (672744)29.9906
  (+\(3+=378$)-250.977
  (#7\)!)
]TJ

Between the parentheses are the text characters. I changed some of them and watched the rendered PDF file to see which character represents which glyph. Then I decoded the text:

[
  (ele)-287.988
  (Motorr)29.9906 ***
  (adf)30.01 ***
  (ahr)29.9876 ***
  (er)-287.989
  (fr)30.0039
  (euen)-287.975
  (sich)-288.021
  ...
]

So there is indeed whitespace between the characters. In your case this is probably the kerning of the font. The question is now how your PDF library interprets this whitespace, and it seems to me, that even "negative whitespace" is rendered into a space in the resulting string.

like image 37
Roland Illig Avatar answered Sep 30 '22 20:09

Roland Illig