My android app has an OCR functionality using tess-two library. I have this issue in reading the String which contains "fi". After baseApi.getUTF8Text(), a method to get the recognized text by the OCR, the returned String in that "fi" is "fi" <<<- - - Take a very close attention to that string. It is not a 2-charactered String but a single-charactered String. You can reproduce that by copying and pasting. Now, I am thinking it might be an issue of UTF8 encoding or etc which I don't have enough knowledge with. When I tried to do string.replace("fi","fi"), Android Studio builds with erors unmappable character for encoding utf-8. I tried searching in google but it recognize it as a regular "fi" not "fi".
Is there any way I can fix this character?
You can avoid recognizing the fi
ligature by blacklisting it before calling baseApi.setImage
:
baseApi.setVariable(TessBaseAPI.VAR_CHAR_BLACKLIST, "fi");
To prevent Android Studio from throwing the unmappable character
error on your java code, convert your file encoding to UTF-8 by choosing "UTF-8" from the selector near the bottom right corner of the Android Studio window.
Here's what I found, FWIW: the character 'fi' is a ligature (more at: Unicode Character 'LATIN SMALL LIGATURE FI' (U+FB01))
Here's a quick and dirty program to find and replace 'fi' with any other characters:
public class LigatureFI
{
static char ligature_fi = 0xFB01;
public static void main(String[] args)
{
String sligature_fi = Character.toString(ligature_fi);
String string = new String("fififififififififififififififi");
System.out.println(string);
string = string.replaceAll(sligature_fi, "FI");
System.out.println(string);
}
}
If your IDE complains about 'fi' not being in the cp1252 charset, save as UTF8.
HTH.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With