I have a pdf file that contains utf-8 characters (İ,ğ,ı and arabic letter etc..). How to parse this file?
I use itext and pdfBox but I see "çekti¤i k夛da" instead of "çektiği kağıda". How can I resolve this ?
As no sample has yet been provided, I created arabic test data myself (well, actually I borrowed the code for creating the test data from some posts on the itext-questions mailing list) and a test which parses those data:
package itext.parsing;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Font;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.Phrase;
import com.itextpdf.text.pdf.BaseFont;
import com.itextpdf.text.pdf.PdfPCell;
import com.itextpdf.text.pdf.PdfPTable;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
import junit.framework.TestCase;
public class TextExtractingArabic extends TestCase
{
public void testExtractArabicChars() throws DocumentException, IOException
{
createTestFile(TEST_FILE);
PdfReader reader = new PdfReader(TEST_FILE.toString());
String text = PdfTextExtractor.getTextFromPage(reader, 1);
for (char c: text.toCharArray())
{
int i = c<0 ? Integer.MAX_VALUE + c : c;
System.out.print("\\u");
System.out.print(Integer.toHexString(i));
}
}
void createTestFile(File file) throws DocumentException, IOException
{
Document document = new Document();
OutputStream os = new FileOutputStream(file);
PdfWriter.getInstance(document, os);
document.open();
BaseFont bfArialUni = BaseFont.createFont("C:\\Windows\\Fonts" + "\\ARIALUNI.TTF",
BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
Font fontArialUni = new Font(bfArialUni, 12f);
Phrase myPhrase = new Phrase(LAWRENCE_OF_ARABIA, fontArialUni);
PdfPTable table = new PdfPTable(1);
PdfPCell cell = new PdfPCell(new Paragraph(myPhrase));
cell.setColspan(3);
cell.setPaddingRight(15f);
cell.setBorder(PdfPCell.NO_BORDER);
cell.setRunDirection(PdfWriter.RUN_DIRECTION_RTL);
table.addCell(cell);
document.add(table);
document.close();
os.close();
}
final static File TEST_FILE = new File("arabic-test.pdf");
final static String LAWRENCE_OF_ARABIA =
"\u0644\u0648\u0631\u0627\u0646\u0633\u0627\u0644\u0639\u0631\u0628";
}
The String LAWRENCE_OF_ARABIA phonetically somewhat aproximates Lawrence of Arabia.
The output of the text is:
\ufe8f\ufeae\ufecc\ufedf\ufe8e\ufeb4\ufee7\ufe8d\ufead\ufeee\ufedf
While this is not identical to the input, a quick look into the unicode tables reveals that the input is from the Unicode Range "Arabic" and the output is from the Unicode Range "Arabic Presentation Forms-B". Additionally the output is left-to-right while the input is right-to-left.
I don't know Arabic myself and, thus, cannot say how accurate the output is, but the parsed characters definitively are from an appropriate Unicode range.
As far as can be told without access to the PDF the original poster works with, therefore, the problem does not seem to be the parsing but instead the proper use of the output of the parsers.
As Bobrovsky mentions, it may look good, while the underlying encoding is not entirely correct. A glyhp that looks like an X in the PDF viewer may not be encoded internally as the character X. You can easily test this by copy-pasting the text from the Adobe PDF Reader to a text editor that supports the character set. If it copy-pastes OK, then extraction is possible, otherwise it is not (without taking manual measures such as a customized mapping).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With