Suppose we have a text file with the content: "Je suis un beau homme ..."
another with: "I am a brave man"
the third with a text in German: "Guten morgen. Wie geht's ?"
How do we write a function that would tell us: with such a probability the text in the first file is in English, in the second we have French etc?
Links to books / out-of-the-box solutions are welcome. I write in Java, but I can learn Python if needed.
My comments
Plain text (. txt) is a type of digital file that is free of computer tags, special formatting, and code. This is the only file type recognized by the Lexile Analyzer.
There is a package called JLangDetect which seems to do exactly what you want:
langof("un texte en français") = fr : OK langof("a text in english") = en : OK langof("un texto en español") = es : OK langof("un texte un peu plus long en français") = fr : OK langof("a text a little longer in english") = en : OK langof("a little longer text in english") = en : OK langof("un texto un poco mas largo en español") = es : OK langof("J'aime les bisounours !") = fr : OK langof("Bienvenue à Montmartre !") = fr : OK langof("Welcome to London !") = en : OK // ...
Edit: as Kevin pointed out, there is similar functionality in the Nutch project provided by the package org.apache.nutch.analysis.lang.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With