Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse pdf file that contain utf-8 character with java or C#

Tags:

java

c#

parsing

pdf

I have a pdf file that contains utf-8 characters (İ,ğ,ı and arabic letter etc..). How to parse this file?
I use itext and pdfBox but I see "çekti¤i k夛da" instead of "çektiği kağıda". How can I resolve this ?

like image 432
katsu Avatar asked Oct 20 '12 18:10

katsu


2 Answers

As no sample has yet been provided, I created arabic test data myself (well, actually I borrowed the code for creating the test data from some posts on the itext-questions mailing list) and a test which parses those data:

package itext.parsing;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;

import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Font;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.Phrase;
import com.itextpdf.text.pdf.BaseFont;
import com.itextpdf.text.pdf.PdfPCell;
import com.itextpdf.text.pdf.PdfPTable;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;

import junit.framework.TestCase;

public class TextExtractingArabic extends TestCase
{
    public void testExtractArabicChars() throws DocumentException, IOException
    {
        createTestFile(TEST_FILE);

        PdfReader reader = new PdfReader(TEST_FILE.toString());
        String text = PdfTextExtractor.getTextFromPage(reader, 1);
        for (char c: text.toCharArray())
        {
            int i = c<0 ? Integer.MAX_VALUE + c : c;
            System.out.print("\\u");
            System.out.print(Integer.toHexString(i));
        }
    }

    void createTestFile(File file) throws DocumentException, IOException
    {
        Document document = new Document();
        OutputStream os = new FileOutputStream(file);
        PdfWriter.getInstance(document, os);
        document.open();

        BaseFont bfArialUni = BaseFont.createFont("C:\\Windows\\Fonts" + "\\ARIALUNI.TTF",
                                            BaseFont.IDENTITY_H, BaseFont.EMBEDDED);            
        Font fontArialUni = new Font(bfArialUni, 12f);
        Phrase myPhrase = new Phrase(LAWRENCE_OF_ARABIA, fontArialUni);

        PdfPTable table = new PdfPTable(1);
        PdfPCell cell = new PdfPCell(new Paragraph(myPhrase));
        cell.setColspan(3);
        cell.setPaddingRight(15f);
        cell.setBorder(PdfPCell.NO_BORDER);
        cell.setRunDirection(PdfWriter.RUN_DIRECTION_RTL);
        table.addCell(cell);

        document.add(table);
        document.close();
        os.close();
    }

    final static File TEST_FILE = new File("arabic-test.pdf");
    final static String LAWRENCE_OF_ARABIA =
        "\u0644\u0648\u0631\u0627\u0646\u0633\u0627\u0644\u0639\u0631\u0628";
}

The String LAWRENCE_OF_ARABIA phonetically somewhat aproximates Lawrence of Arabia.

The output of the text is:

\ufe8f\ufeae\ufecc\ufedf\ufe8e\ufeb4\ufee7\ufe8d\ufead\ufeee\ufedf

While this is not identical to the input, a quick look into the unicode tables reveals that the input is from the Unicode Range "Arabic" and the output is from the Unicode Range "Arabic Presentation Forms-B". Additionally the output is left-to-right while the input is right-to-left.

I don't know Arabic myself and, thus, cannot say how accurate the output is, but the parsed characters definitively are from an appropriate Unicode range.

As far as can be told without access to the PDF the original poster works with, therefore, the problem does not seem to be the parsing but instead the proper use of the output of the parsers.

like image 68
mkl Avatar answered Oct 25 '22 07:10

mkl


As Bobrovsky mentions, it may look good, while the underlying encoding is not entirely correct. A glyhp that looks like an X in the PDF viewer may not be encoded internally as the character X. You can easily test this by copy-pasting the text from the Adobe PDF Reader to a text editor that supports the character set. If it copy-pastes OK, then extraction is possible, otherwise it is not (without taking manual measures such as a customized mapping).

like image 21
Frank Avatar answered Oct 25 '22 08:10

Frank