Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java language detection with langdetect - how to load profiles?

I'm trying to use a Java library called langdetect hosted here. It couldn't be easier to use:

Detector detector;
String langDetected = "";
try {
    String path = "C:/Users/myUser/Desktop/jars/langdetect/profiles";
    DetectorFactory.loadProfile(path);
    detector = DetectorFactory.create();
    detector.append(text);
    langDetected = detector.detect();
} 
catch (LangDetectException e) {
    throw e;
}

return langDetected;

Except with respect to the DetectFactory.loadProfile method. This library works great when I pass it an absolute file path, but ultimately I think I need to package my code and langdetect's companion profiles directory inside the same JAR file:

myapp.jar/
    META-INF/
    langdetect/
        profiles/
            af
            bn
            en
            ...etc.
    com/
        me/
            myorg/
                LangDetectAdaptor --> is what actually uses the code above

I will make sure that the LangDetectAdaptor which is located inside myapp.jar is supplied with both the langdetect.jar and jsonic.jar dependencies it needs for langdetect to work at runtime. However I'm confused as to what I need to pass in to DetectFactory.loadProfile in order to work:

  • The langdetect JAR ships with the profiles directory, but you need to initialize it from inside your JAR. So do I copy the profiles directory and put it inside my JAR (like I prescribe above), or is there a way to keep it inside langdetect.jar but access it from inside my code?

Thanks in advance for any help here!

Edit : I think the problem here is that langdetect ships with this profiles directory, but then wants you to initialize it from inside your JAR. The API would probably benefit from being changed a little bit to just consider profiles its own configuration, and to then provide methods like DetectFactory.loadProfiles().except("fr") in the event that you don't want it to initialize French, etc. But this still doesn't solve my problem!

like image 525
IAmYourFaja Avatar asked Aug 17 '12 14:08

IAmYourFaja


3 Answers

I have the same problem. You can load the profiles from the LangDetect jar using JarUrlConnection and JarEntry. Note in this example I am using Java 7 resource management.

    String dirname = "profiles/";
    Enumeration<URL> en = Detector.class.getClassLoader().getResources(
            dirname);
    List<String> profiles = new ArrayList<>();
    if (en.hasMoreElements()) {
        URL url = en.nextElement();
        JarURLConnection urlcon = (JarURLConnection) url.openConnection();
        try (JarFile jar = urlcon.getJarFile();) {
            Enumeration<JarEntry> entries = jar.entries();
            while (entries.hasMoreElements()) {
                String entry = entries.nextElement().getName();
                if (entry.startsWith(dirname)) {
                    try (InputStream in = Detector.class.getClassLoader()
                            .getResourceAsStream(entry);) {
                        profiles.add(IOUtils.toString(in));
                    }
                }
            }
        }
    }

    DetectorFactory.loadProfile(profiles);
    Detector detector = DetectorFactory.create();
    detector.append(text);
    String langDetected = detector.detect();
    System.out.println(langDetected);
like image 173
Mark Butler Avatar answered Oct 22 '22 05:10

Mark Butler


Since no maven-support was available, and the mechanism to load profiles was not perfect (since you you need to define files instead of resources), I created a fork which solves that problem:

https://github.com/galan/language-detector

I mailed the original author, so he can fork/maintain the changes, but no luck - seems the project is abandoned.

Here is an example of how to use it now (own profiles can be written where necessary):

DetectorFactory.loadProfile(new DefaultProfile()); // SmProfile is also available
Detector detector = DetectorFactory.create();
detector.append(input);
String result = detector.detect();
// maybe work with detector.getProbabilities()

I don't like the static approach the DetectorFactory uses, but I won't rewrite the full project, you have to create your own fork/pull request :)

like image 37
Dag Avatar answered Oct 22 '22 07:10

Dag


Looks like the library only accepts files. You can either change the code and try submitting the changes upstream. Or write your resource to a temp file and get it to load that.

like image 3
vickirk Avatar answered Oct 22 '22 07:10

vickirk