Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK and language detection

How do I detect what language a text is written in using NLTK?

The examples I've seen use nltk.detect, but when I've installed it on my mac, I cannot find this package.

like image 974
niklassaers Avatar asked Jul 05 '10 21:07

niklassaers


People also ask

How can you detect language of text in NLP?

First, you import the detect method from langdetect and then pass the text to the method. The method detects the text provided is in the Swahili language ('sw'). You can also find out the probabilities for the top languages by using detect_langs method.

What is the purpose of NLTK?

The Natural Language Toolkit (NLTK) is a platform used for building Python programs that work with human language data for applying in statistical natural language processing (NLP). It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning.

Is NLP and NLTK same?

NLTK (Natural Language Toolkit) is the go-to API for NLP (Natural Language Processing) with Python. It is a really powerful tool to preprocess text data for further analysis like with ML models for instance.

How can I detect a language?

Google Translate - If you need to determine the language of an entire web page or an online document, paste the URL of that page in the Google Translate box and choose “Detect Language” as the source language.


2 Answers

Have you come across the following code snippet?

english_vocab = set(w.lower() for w in nltk.corpus.words.words()) text_vocab = set(w.lower() for w in text if w.lower().isalpha()) unusual = text_vocab.difference(english_vocab)  

from http://groups.google.com/group/nltk-users/browse_thread/thread/a5f52af2cbc4cfeb?pli=1&safe=active

Or the following demo file?

https://web.archive.org/web/20120202055535/http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/misc/langid.py

like image 86
William Niu Avatar answered Sep 25 '22 07:09

William Niu


This library is not from NLTK either but certainly helps.

$ sudo pip install langdetect

Supported Python versions 2.6, 2.7, 3.x.

>>> from langdetect import detect  >>> detect("War doesn't show who's right, just who's left.") 'en' >>> detect("Ein, zwei, drei, vier") 'de' 

https://pypi.python.org/pypi/langdetect?

P.S.: Don't expect this to work correctly always:

>>> detect("today is a good day") 'so' >>> detect("today is a good day.") 'so' >>> detect("la vita e bella!") 'it' >>> detect("khoobi? khoshi?") 'so' >>> detect("wow") 'pl' >>> detect("what a day") 'en' >>> detect("yay!") 'so' 
like image 30
SVK Avatar answered Sep 26 '22 07:09

SVK