how can I detect farsi web pages by tika?

Question

I need a sample code to help me detect farsi language web pages by apache tika toolkit.

 LanguageIdentifier identifier = new LanguageIdentifier("فارسی");
        String language = identifier.getLanguage();

I have download apache.tika jar files and add them to the classpath. but this code gives error for Farsi language but it works for english. how can I add Farsi to languageIdentifier package of tika?

Kai Sternad · Accepted Answer

Tika doesn't ship with a language profile for the Farsi language yet. As of version 1.0 27 languages are supported out of the box:

languages=be,ca,da,de,eo,et,el,en,es,fi,fr,gl,hu,is,it,lt,nl,no,pl,pt,ro,ru,sk,sl,sv,th,uk

In your example the input is misdetected as li(Lithuanian) with a distance of 0.41, which is above the certainty threshold of 0.022. See the source code for more information on the inner works of LanguageIdentifier.

The Farsi language (Persian, ISO 639-1 2-letter code fa) is not recognized by default. If you want Tika to recognize another language, you have to create a language profile first.

For this the following steps are necessary:

Find a text corpus for your language. I found the Hamshahri Collection. This should be sufficient. Download the corpus or parts of it and create a plain text file out of the XML.
Create an ngram file for the language identifier. This can be done using TikaCLI:

java -jar tika-app-1.0.jar --create-profile=fa -eUTF-8 fa-corpus.txt This will a file called fa.ngp which contains the n-grams.
Configure Tika so that it recognizes the new language. Either do this programmatically using LanguageIdentifier.initProfiles() or put a property file with the name tika.language.override.properties into the classpath. Make sure the ngram file is in the classpath as well.

If you now run Tika, it should correctly detect your language.

Update: Detailed the steps necessary to create a language profile.

how can I detect farsi web pages by tika?

Tags:

java

apache

language-detection

farsi

apache-tika

aliakbarian

1 Answers

Kai Sternad

Recent Activity

Donate For Us

how can I detect farsi web pages by tika?

Tags:

java

apache

language-detection

farsi

apache-tika

aliakbarian

1 Answers

Kai Sternad

Related questions

Recent Activity

Donate For Us