Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I tell what language is a plain-text file written in? [closed]

Tags:

Suppose we have a text file with the content: "Je suis un beau homme ..."

another with: "I am a brave man"

the third with a text in German: "Guten morgen. Wie geht's ?"

How do we write a function that would tell us: with such a probability the text in the first file is in English, in the second we have French etc?

Links to books / out-of-the-box solutions are welcome. I write in Java, but I can learn Python if needed.

My comments

  1. There's one small comment I need to add. The text may contain phrases in different languages, as part of whole or as a result of a mistake. In classic litterature we have a lot of examples, because the aristocracy members were multilingual. So the probability better describes the situation, as most parts of the text are in one language, while others may be written in another.
  2. Google API - Internet Connection. I would prefer not to use remote functions/services, as I need to do it myself or use a downloadable library. I'd like to make a research on that topic.
like image 466
EugeneP Avatar asked Feb 24 '10 12:02

EugeneP


People also ask

What is the file type for plain text version?

Plain text (. txt) is a type of digital file that is free of computer tags, special formatting, and code. This is the only file type recognized by the Lexile Analyzer.


1 Answers

There is a package called JLangDetect which seems to do exactly what you want:

langof("un texte en français") = fr : OK langof("a text in english") = en : OK langof("un texto en español") = es : OK langof("un texte un peu plus long en français") = fr : OK langof("a text a little longer in english") = en : OK langof("a little longer text in english") = en : OK langof("un texto un poco mas largo en español") = es : OK langof("J'aime les bisounours !") = fr : OK langof("Bienvenue à Montmartre !") = fr : OK langof("Welcome to London !") = en : OK // ... 

Edit: as Kevin pointed out, there is similar functionality in the Nutch project provided by the package org.apache.nutch.analysis.lang.

like image 134
Otto Allmendinger Avatar answered Nov 01 '22 07:11

Otto Allmendinger