Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing Arabic with PDFBOX with correct characters presentation form without being separated

I'm trying to generate a PDF that contains Arabic text using PDFBox Apache but the text is generated as separated characters because Apache parses given Arabic string to a sequence of general 'official' Unicode characters that is equivalent to the isolated form of Arabic characters.

Here is an example:
Target text to Write in PDF "Should be expected output in PDF File" -> جملة بالعربي
What I get in PDF File ->

incorrect text

I tried some methods but it's no use here are some of them:
1. Converting String to Stream of bits and trying to extract right values
2. Treating String a sequence of bytes with UTF-8 && UTF-16 and extracting values from them

There is some approach seems very promising to get the value "Unicode" of each character But it generate general "official Unicode" Here is what I mean

System.out.println( Integer.toHexString( (int)(new String("كلمة").charAt(1))) );  

output is 644 but fee0 was the expected output because this character is in middle from then I should get the middle Unicode fee0

so what I want is some method that generates the correct Unicode not the just the official one

The very Left column in the first table in the following link represents the general Unicode
Arabic Unicode Tables Wikipedia

like image 334
Mina Gerges Avatar asked Jan 16 '18 15:01

Mina Gerges


2 Answers

Here is a code that works. Download a sample font, e.g. trado.ttf

Make sure the pdfbox-app and icu4j jar files are in your classpath.

import java.io.File;
import java.io.IOException;

import com.ibm.icu.text.ArabicShaping;
import com.ibm.icu.text.ArabicShapingException;
import com.ibm.icu.text.Bidi;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.*;

public class Main {
    public static void main(String[] args) throws IOException , ArabicShapingException
    {
    File f = new File("trado.ttf");
        PDDocument doc = new PDDocument();
        PDPage Page = new PDPage();
        doc.addPage(Page);
        PDPageContentStream Writer = new PDPageContentStream(doc, Page);
        Writer.beginText();
        Writer.setFont(PDType0Font.load(doc, f), 20);
        Writer.newLineAtOffset(0, 700);
        String s ="جملة بالعربي لتجربة الكلاس اللذي يساعد علي وصل الحروف بشكل صحيح";
        Writer.showText(bidiReorder(s));
        Writer.endText();
        Writer.close();
        doc.save(new File("File_Test.pdf"));
        doc.close();
    }

    private static String bidiReorder(String text)
    {
        try {
        Bidi bidi = new Bidi((new ArabicShaping(ArabicShaping.LETTERS_SHAPE)).shape(text), 127);
            bidi.setReorderingMode(0);
            return bidi.writeReordered(2);
        }
        catch (ArabicShapingException ase3) {
        return text;
    }
    }

}
like image 67
h q Avatar answered Oct 25 '22 23:10

h q


Notice:

The sample code in this answer might be outdated please refer to h q's answer for the working sample code


At First I will thank Tilman Hausherr and M.Prokhorov for showing me the library that made writing Arabic possible using PDFBox Apache.

This Answer will be divided into two Sections:
  1. Downloading the library and installing it
  2. How to use the library

Downloading the library and installing it

We are going to use ICU Library.
ICU stands for International Components for Unicode and it is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.

To download the Library go to the downloads page from here.
Choose the latest version of ICU4J as shown in the following image.
Downloads Page
You will be transferred to another page and you will find a box with direct links of the needed components .Go ahead and download three Files you will find the highlighted in next image.

  1. icu4j-docs.jar
  2. icu4j-src.jar
  3. icu4j.jar

Files

The following explanation for creating and adding a library in Netbeans IDE

  1. Navigate to the Toolbar and Click tools
  2. Choose Libraries
  3. At the bottom left you will find new Library button Create yours
  4. Navigate to the library that you created in libraries list
  5. Click it and add jar folders like that
  6. Add icu4j.jar in class path
  7. Add icu4j-src.jar in Sources
  8. Add icu4j-docs.jar in Javadoc
  9. View your opened projects from the very right
  10. Expand the project that you want to use the library in
  11. Right Click on the libraries folder and choose add library
  12. Finally choose the library that you had just created.

Now you are ready to use the library just import what you want like that

import com.ibm.icu.What_You_Want_To_Import;


How to use the library

With ArabicShaping Class and reversing the String we can write a correct attached Arabic LINE
Here is the Code Notice the comments in the following code

import com.ibm.icu.text.ArabicShaping;
import com.ibm.icu.text.ArabicShapingException;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.*;

public class Main {
    public static void main(String[] args) throws IOException , ArabicShapingException
{
        File f = new File("Arabic Font File of format.ttf");
        PDDocument doc = new PDDocument();
        PDPage Page = new PDPage();
        doc.addPage(Page);
        PDPageContentStream Writer = new PDPageContentStream(doc, Page);
        Writer.beginText();
        Writer.setFont(PDType0Font.load(doc, f), 20);
        Writer.newLineAtOffset(0, 700);
        //The Trick in the next Line of Code But Here is some few Notes first  
        //We have to reverse the string because PDFBox is Writting from the left but Arabic is RTL Language  
        //The output will be perfect except every line will be justified to the left "It's not hard to resolve this"
        // So we have to write arabic string to pdf line by line..It will be like this
        String s ="جملة بالعربي لتجربة الكلاس اللذي يساعد علي وصل الحروف بشكل صحيح";
        Writer.showText(new StringBuilder(new ArabicShaping(reverseNumbersInString(ArabicShaping.LETTERS_SHAPE).shape(s))).reverse().toString());
        // Note the previous line of code throws ArabicShapingExcpetion 
        Writer.endText();
        Writer.close();
        doc.save(new File("File_Test.pdf"));
        doc.close();
    }
}

Here is the output

Output

I hope that I had gone over everything.

Update : After reversing make sure to reverse the numbers again in order to get the same proper number
Here is a couple of functions that could help

public static boolean isInt(String Input)
{
    try{Integer.parseInt(Input);return true;}
    catch(NumberFormatException e){return false;}
}
public static String reverseNumbersInString(String Input)
{
    char[] Separated = Input.toCharArray();int i = 0;
    String Result = "",Hold = "";
    for(;i<Separated.length;i++ )
    {
        if(isInt(Separated[i]+"") == true)
        {
            while(i < Separated.length && (isInt(Separated[i]+"") == true ||  Separated[i] == '.' ||  Separated[i] == '-'))
            {
                Hold += Separated[i];
                i++;
            }
            Result+=reverse(Hold);
            Hold="";
        }
        else{Result+=Separated[i];}
    }
    return Result;
}
like image 21
Mina Gerges Avatar answered Oct 25 '22 23:10

Mina Gerges