Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determine if text is in English?

I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the following could be true:

[ "this is some text written in English", 
  "this is some more text written in English", 
  "Ce n'est pas en anglais" ] 

For the purposes of my analysis, I want all sentences that are not in English to be removed as part of pre-processing. However, is there a good way to do this? I have been Googling, but cannot find anything specific that will let me recognize if strings are in English or not. Is this something that is not offered as functionality in either Nltk or Scikit learn? EDIT I've seen questions both like this and this but both are for individual words... Not a "document". Would I have to loop through every word in a sentence to check if the whole sentence is in English?

I'm using Python, so libraries that are in Python would be preferable, but I can switch languages if needed, just thought that Python would be the best for this.

like image 691
ocean800 Avatar asked Apr 12 '17 18:04

ocean800


People also ask

How does Python identify English words?

dict_exists() in Python. Enchant is a module in python which is used to check the spelling of a word, gives suggestions to correct words.

What is Langdetect in Python?

The idea behind language detection is based on the detection of the character among the expression and words in the text. The main principle is to detect commonly used words like to, of in English. Python provides various modules for language detection. In this article, the modules covered are: langdetect.

How to check whether words in text file are in English language?

On small scale, like if you have a text file of let’s say 10 MB or so then using Python can be really helpful for checking whether words in text file are in English Language or not. There exists four ways to do this using Python which are – Let’s discuss each one of this method one-by-one.

What is text language identification and how it works?

4 minute read Text Language Identification is the process of predicting the language of a given piece of text. You might have encountered it when Chrome shows a popup to translate a webpage when it detects that the content is not in English. Behind the scenes, Chrome is using a model to predict the language of text used on a webpage.

How accurate is the language detection of a written text?

Depending on the model and the length of the input text, the accuracy is between 70% (only short Norwegian, Swedisch and Danisch classified by the "all" model) and 99.8%, using the "default" model. The language detection of a written text is probably one of the most basic tasks in natural language processing (NLP).

How to check if word is in English language in Python?

Python’s Comparison Operator can also be used for checking whether Word is in English Language. Below is a Python Code Example Showing this. As English alphabetic characters can either be lowercase (like => a b c) or uppercase (like => A B C).


6 Answers

There is a library called langdetect. It is ported from Google's language-detection available here:

https://pypi.python.org/pypi/langdetect

It supports 55 languages out of the box.

like image 51
salehinejad Avatar answered Oct 22 '22 09:10

salehinejad


You might be interested in my paper The WiLI benchmark dataset for written language identification. I also benchmarked a couple of tools.

TL;DR:

  • CLD-2 is pretty good and extremely fast
  • lang-detect is a tiny bit better, but much slower
  • langid is good, but CLD-2 and lang-detect are much better
  • NLTK's Textcat is neither efficient nor effective.

You can install lidtk and classify languages:

$ lidtk cld2 predict --text "this is some text written in English"
eng
$ lidtk cld2 predict --text "this is some more text written in English"
eng
$ lidtk cld2 predict --text "Ce n'est pas en anglais"                  
fra
like image 21
Martin Thoma Avatar answered Oct 22 '22 08:10

Martin Thoma


Pretrained Fast Text Model Worked Best For My Similar Needs

I arrived at your question with a very similar need. I appreciated Martin Thoma's answer. However, I found the most help from Rabash's answer part 7 HERE.

After experimenting to find what worked best for my needs, which were making sure text files were in English in 60,000+ text files, I found that fasttext was an excellent tool.

With a little work, I had a tool that worked very fast over many files. Below is the code with comments. I believe that you and others will be able to modify this code for your more specific needs.

class English_Check:
    def __init__(self):
        # Don't need to train a model to detect languages. A model exists
        #    that is very good. Let's use it.
        pretrained_model_path = 'location of your lid.176.ftz file from fasttext'
        self.model = fasttext.load_model(pretrained_model_path)

    def predictionict_languages(self, text_file):
        this_D = {}
        with open(text_file, 'r') as f:
            fla = f.readlines()  # fla = file line array.
            # fasttext doesn't like newline characters, but it can take
            #    an array of lines from a file. The two list comprehensions
            #    below, just clean up the lines in fla
            fla = [line.rstrip('\n').strip(' ') for line in fla]
            fla = [line for line in fla if len(line) > 0]

            for line in fla:  # Language predict each line of the file
                language_tuple = self.model.predictionict(line)
                # The next two lines simply get at the top language prediction
                #    string AND the confidence value for that prediction.
                prediction = language_tuple[0][0].replace('__label__', '')
                value = language_tuple[1][0]

                # Each top language prediction for the lines in the file
                #    becomes a unique key for the this_D dictionary.
                #    Everytime that language is found, add the confidence
                #    score to the running tally for that language.
                if prediction not in this_D.keys():
                    this_D[prediction] = 0
                this_D[prediction] += value

        self.this_D = this_D

    def determine_if_file_is_english(self, text_file):
        self.predictionict_languages(text_file)

        # Find the max tallied confidence and the sum of all confidences.
        max_value = max(self.this_D.values())
        sum_of_values = sum(self.this_D.values())
        # calculate a relative confidence of the max confidence to all
        #    confidence scores. Then find the key with the max confidence.
        confidence = max_value / sum_of_values
        max_key = [key for key in self.this_D.keys()
                   if self.this_D[key] == max_value][0]

        # Only want to know if this is english or not.
        return max_key == 'en'

Below is the application / instantiation and use of the above class for my needs.

file_list = # some tool to get my specific list of files to check for English

en_checker = English_Check()
for file in file_list:
    check = en_checker.determine_if_file_is_english(file)
    if not check:
        print(file)
like image 20
Thom Ives Avatar answered Oct 22 '22 09:10

Thom Ives


This is what I've used some time ago. It works for texts longer than 3 words and with less than 3 non-recognized words. Of course, you can play with the settings, but for my use case (website scraping) those worked pretty well.

from enchant.checker import SpellChecker

max_error_count = 4
min_text_length = 3

def is_in_english(quote):
  d = SpellChecker("en_US")
  d.set_text(quote)
  errors = [err.word for err in d]
  return False if ((len(errors) > max_error_count) or len(quote.split()) < min_text_length) else True

print(is_in_english('“中文”'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True
like image 23
grizmin Avatar answered Oct 22 '22 09:10

grizmin


Use the enchant library

import enchant

dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc

dictionary.check("Hello") # prints True
dictionary.check("Helo") #prints False

This example is taken directly from their website

like image 31
lordingtar Avatar answered Oct 22 '22 08:10

lordingtar


If you want something lightweight, letter trigrams are a popular approach. Every language has a different "profile" of common and uncommon trigrams. You can google around for it, or code your own. Here's a sample implementation I came across, which uses "cosine similarity" as a measure of distance between the sample text and the reference data:

http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/

If you know the common non-English languages in your corpus, it's pretty easy to turn this into a yes/no test. If you don't, you need to anticipate sentences from languages for which you don't have trigram statistics. I would do some testing to see the normal range of similarity scores for single-sentence texts in your documents, and choose a suitable threshold for the English cosine score.

like image 31
alexis Avatar answered Oct 22 '22 08:10

alexis