Well, i knew this question being asked multiple of times but i still couldn't fix it with the "available" solution. Hope to got any further ideas or concepts of how to detect my sentences is english in python. The available solution:
- Language Detector (in ruby not in python :/)
- Google Translate API v2 (No longer free, have to pay 20 bucks a month while i'm doing this project for academic purposes. Courtesy limit: 0 characters/day )
- Language identification for python (source code not found, link at below. automatic-language-identification)
-
Enchant (it's not for python 2.7? I'm new to python, any guide? I bet this would be the one i need)
- Wordnet from NLTK (i got no idea why "wordnet.synsets" is missing and only "wordnet.Synset" is available. the sample code in solution is not working for me as well T_T, probably versioning issue again?)
- Store english words into list and compare if the word exist (yea, it's kinda bad approach while the sentences are from twitter and.. you knew that :P)
WORKING SOLUTION
Finally after a series of trying, the following is the working solution (alternative to the above list)
- Wiktionary API (Using Urllib2, and simplejson to parse it. then find if the key is -1 means the word doesn't exist. else it's english. of course, for use in twitter have to preprocess your word into no special character like @#,?!. For how to find the key would referencing here. Simplejson and random key value)
- Answer from Dogukan Tufekci (Ticked)(Weakness: Let say if the sentence shorter than 20 characters long have to install PyEnchant or it will return UNKNOWN. While PyEnchant is not supporting Python 2.7, means couldn't install and not working to less than 20 character sentence)
References
- Detecting whether or not text is English (in bulk)
- How to check if a word is an English word with Python?
- How to retrieve Wiktionary word content?
How do you check if a word is in the dictionary Python?
To simply check if a key exists in a Python dictionary you can use the in operator to search through the dictionary keys like this: pets = {'cats': 1, 'dogs': 2, 'fish': 3} if 'dogs' in pets: print('Dogs found!') # Dogs found!
What is Langdetect in Python?
$ pip install langdetect… pypi.org. langdetect is a re-implementation of Google's language-detection library from Java to Python. Simply pass your text to the imported detect function and it will output the two-letter ISO 693 code of the language for which the model gave the highest confidence score.
You can try the guess_language library that I found through the Miguel Grinber's The Flask Mega Tutorial. It looks like it supports Python 2 and 3 so it should be ok.
You might be able to make use of Hidden Markov models to detect languages, each language would have their own characteristics.