How do I tell what language is a plain-text file written in? [closed]

Tags:

Suppose we have a text file with the content: "Je suis un beau homme ..."

another with: "I am a brave man"

the third with a text in German: "Guten morgen. Wie geht's ?"

How do we write a function that would tell us: with such a probability the text in the first file is in English, in the second we have French etc?

Links to books / out-of-the-box solutions are welcome. I write in Java, but I can learn Python if needed.

My comments

There's one small comment I need to add. The text may contain phrases in different languages, as part of whole or as a result of a mistake. In classic litterature we have a lot of examples, because the aristocracy members were multilingual. So the probability better describes the situation, as most parts of the text are in one language, while others may be written in another.
Google API - Internet Connection. I would prefer not to use remote functions/services, as I need to do it myself or use a downloadable library. I'd like to make a research on that topic.

466

asked Feb 24 '10 12:02

EugeneP

1 Answers

There is a package called JLangDetect which seems to do exactly what you want:

langof("un texte en français") = fr : OK langof("a text in english") = en : OK langof("un texto en español") = es : OK langof("un texte un peu plus long en français") = fr : OK langof("a text a little longer in english") = en : OK langof("a little longer text in english") = en : OK langof("un texto un poco mas largo en español") = es : OK langof("J'aime les bisounours !") = fr : OK langof("Bienvenue à Montmartre !") = fr : OK langof("Welcome to London !") = en : OK // ...

Edit: as Kevin pointed out, there is similar functionality in the Nutch project provided by the package org.apache.nutch.analysis.lang.

134

answered Nov 01 '22 07:11

Otto Allmendinger

Related questions
                            
                                How do you create an indented XML string from an XDocument in c#?
                            
                                Exception calling when TimeZoneInfo.ConvertTimeToUtc for certain DateTime values
                            
                                Ruby - How to write a new file with output from script
                            
                                What's bad about shifting a 32-bit variable 32 bits?
                            
                                SQL: Find the max record per group [duplicate]
                            
                                Return/consume dynamic anonymous type across assembly boundaries
                            
                                Can a makefile have a directory as a target?
                            
                                Does xslt have split() function?
                            
                                can i remove the X-Requested-With header from ajax requests?
                            
                                How can I pass parameters to a jQuery $.getJSON callback method?
                            
                                NSString initWithData returns null
                            
                                Show create table tablename in SQL Server

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With