Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Automatically determine the natural language of a website page given its URL

Tags:

python

url

web

nlp

I'm looking for a way to automatically determine the natural language used by a website page, given its URL.

In Python, a function like:

def LanguageUsed (url):
    #stuff

Which returns a language specifier (e.g. 'en' for English, 'jp' for Japanese, etc...)

Summary of Results: I have a reasonable solution working in Python using code from the PyPi for oice.langdet. It does a decent job in discriminating English vs. Non-English, which is all I require at the moment. Note that you have to fetch the html using Python urllib. Also, oice.langdet is GPL license.

For a more general solution using Trigrams in Python as others have suggested, see this Python Cookbook Recipe from ActiveState.

The Google Natural Language Detection API works very well (if not the best I've seen). However, it is Javascript and their TOS forbids automating its use.

like image 710
Travis Avatar asked Jul 22 '09 18:07

Travis


4 Answers

This is usually accomplished by using character n-gram models. You can find here a state of the art language identifier for Java. If you need some help converting it to Python, just ask. Hope it helps.

like image 57
João Silva Avatar answered Nov 06 '22 19:11

João Silva


Your best bet really is to use Google's natural language detection api. It returns an iso code for the page language, with a probability index.

See http://code.google.com/apis/ajaxlanguage/documentation/

like image 37
Vincent Buck Avatar answered Nov 06 '22 19:11

Vincent Buck


There is nothing about the URL itself that will indicate language.

One option would be to use a natural language toolkit to try to identify the language based on the content, but even if you can get the NLP part of it working, it'll be pretty slow. Also, it may not be reliable. Remember, most user agents pass something like

Accept-Language: en-US

with each request, and many large websites will serve different content based on that header. Smaller sites will be more reliable because they won't pay attention to the language headers.

You could also use server location (i.e. which country the server is in) as a proxy for language using GeoIP. It's obviously not perfect, but it is much better than using the TLD.

like image 45
tghw Avatar answered Nov 06 '22 18:11

tghw


You might want to try ngram based detection.

TextCat DEMO (LGPL) seems to work pretty well (recognizes almost 70 languages). There is a python port provided by Thomas Mangin here using the same corpus.

Edit: TextCat competitors page provides some interesting links too.

Edit2: I wonder if making a python wrapper for http://www.mnogosearch.org/guesser/ would be difficult...

like image 3
Wojciech Bederski Avatar answered Nov 06 '22 18:11

Wojciech Bederski