How to determine the (natural) language of a document?

Tags:

I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on that, the program has to decide which of the two languages the document is written in.

Is there any "standard" algorithm for this problem that can be implemented in a few hours' time? Or alternatively, a free .NET library or toolkit that can do this? I know about LingPipe, but it is

Java
Not free for "semi-commercial" usage

This problem seems to be surprisingly hard. I checked out the Google AJAX Language API (which I found by searching this site first), but it was ridiculously bad. For six web pages in German to which I pointed it only one guess was correct. The other guesses were Swedish, English, Danish and French...

A simple approach I came up with is to use a list of stop words. My app already uses such a list for German documents in order to analyze them with Lucene.Net. If my app scans the documents for occurrences of stop words from either language the one with more occurrences would win. A very naive approach, to be sure, but it might be good enough. Unfortunately I don't have the time to become an expert at natural-language processing, although it is an intriguing topic.

864

asked Sep 05 '09 14:09

Robert Petermeier

2 Answers

Try measure occurences of each letter in text. For English and German texts are calculated the frequencies and, maybe, the distributions of them. Having obtained these data, you may reason what language the distribution of frequencies for your text belongs.

You should use Bayesian inference to determine the closest language (with a certain error probability) or, maybe, there are other statistical methods for such tasks.

136

answered Oct 21 '22 09:10

P Shved

The problem with using a list of stop words is one of robustness. Stop word lists are basically a set of rules, one rule per word. Rule-based methods tend to be less robust to unseen data than statistical methods. Some problems you will encounter are documents that contain equal counts of stop words from each language, documents that have no stop words, documents that have stop words from the wrong language, etc. Rule-based methods can't do anything their rules don't specify.

One approach that doesn't require you to implement Naive Bayes or any other complicated math or machine learning algorithm yourself, is to count character bigrams and trigrams (depending on whether you have a lot or a little of data to start with -- bigrams will work with less training data). Run the counts on a handful of documents (the more the better) of known source language and then construct an ordered list for each language by the number of counts. For example, English would have "th" as the most common bigram. With your ordered lists in hand, count the bigrams in a document you wish to classify and put them in order. Then go through each one and compare its location in the sorted unknown document list to the its rank in each of the training lists. Give each bigram a score for each language as

1 / ABS(RankInUnknown - RankInLanguage + 1).

Whichever language ends up with the highest score is the winner. It's simple, doesn't require a lot of coding, and doesn't require a lot of training data. Even better, you can keep adding data to it as you go on and it will improve. Plus, you don't have to hand-create a list of stop words and it won't fail just because there are no stop words in a document.

It will still be confused by documents that contain equal symmetrical bigram counts. If you can get enough training data, using trigrams will make this less likely. But using trigrams means you also need the unknown document to be longer. Really short documents may require you to drop down to single character (unigram) counts.

All this said, you're going to have errors. There's no silver bullet. Combining methods and choosing the language that maximizes your confidence in each method may be the smartest thing to do.

answered Oct 21 '22 07:10

ealdent

Related questions
                            
                                Integrating MSBuild into Visual Studio
                            
                                Calculating the path relative to some root- the inverse of Path.Combine
                            
                                Is there a way in the Eclipse debugger to be notified when the state of a Java object changes?
                            
                                Alternative to dozer for bean mapping? [closed]
                            
                                How to programmatically switch to a specific window in compiz?
                            
                                How to back up the embedded H2 database engine while it is running?
                            
                                the specified module could not be found 0x8007007E
                            
                                How can I compile Ruby to Javascript? [closed]
                            
                                Is there a better (more modern) tool than lex/flex for generating a tokenizer for C++?
                            
                                What is an empty element?
                            
                                Storing PDF files as binary objects in SQL Server, yes or no?
                            
                                Set form submit header

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With