Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDFKitten is highlighting on wrong position

I am using PDFKitten for searching strings within PDF documents with highlighting of the results. FastPDFKit or any other commercial library is no option so i sticked to the most close one for my requirements.

Wrong coordinate

As you can see in the screenshot i searched for the string "in" which is always correctly highlighted except the last one. I got a more complex PDF document where the highlighted box for "in" is nearly 40% wrong.

I read the whole syntax and checked the issues tracker but except line height problems i found nothing regarding the width calculation. For the moment i dont see any pattern where the calculation goes or could be wrong and i hope that maybe someone else had a close problem to mine.

My current expectation is that the coordinates and character width is wrong calculated somewhere in the font classes or RenderingState.m. The project is very complex and maybe someone of you had a similar problem with PDFKitten in the past.

I have used the original sample PDF document from PDFKitten for my screenshot.

like image 717
SMP Avatar asked Oct 06 '22 07:10

SMP


1 Answers

This might be a bug in PDFKitten when calculating the width of characters whose character identifier does not coincide with its unicode character code.

appendPDFString in StringDetector works with two strings when processing some string data:

// Use CID string for font-related computations.
NSString *cidString = [font stringWithPDFString:string];

// Use Unicode string to compare with user input.
NSString *unicodeString = [[font stringWithPDFString:string] lowercaseString];

stringWithPDFString in Font transforms the sequence of character identifiers of its argument into a unicode string.

Thus, in spite of the name of the variable, cidString is not a sequence of character identifiers but instead of unicode chars. Nonetheless its entries are used as argument of didScanCharacter which in Scanner is implemented to forward the position by the character width: It is using the value as parameter of widthOfCharacter in Font to determine the character width, and that method (according to the comment "Width of the given character (CID) scaled to fontsize") expects its argument to be a character identifier.

So, if CID and unicode character code don't coincide, the wrong character widths is determined and the position of any following character cannot be trusted. In the case at hand, the /fi ligature has a CID of 12 which is way different from its Unicode code 0xfb01.

I would propose PDFKitten to be enhanced to also define a didScanCID method in StringDetector which in appendPDFString should be called next to didScanCharacter for each processed character forwarding its CID. Scanner then should make use of this new method instead to calculate the width to forward its cursor.

This should be triple-checked first, though. Maybe some widthOfCharacter implementations (there are different ones for different font types) in spite of the comment expect the argument to be a unicode code after all...

(Sorry if I used the wrong vocabulary here or there, I'm a 'Java guy... :))

like image 184
mkl Avatar answered Oct 17 '22 22:10

mkl