Python: How to determine the language?

1. TextBlob.

Requires NLTK package, uses Google.

    from textblob import TextBlob
    b = TextBlob("bonjour")
    b.detect_language()

pip install textblob

Note: This solution requires internet access and Textblob is using Google Translate's language detector by calling the API.

2. Polyglot.

Requires numpy and some arcane libraries, ~~unlikely to get it work for Windows~~. (For Windows, get an appropriate versions of PyICU, Morfessor and PyCLD2 from here, then just pip install downloaded_wheel.whl.) Able to detect texts with mixed languages.

    from polyglot.detect import Detector

    mixed_text = u"""
    China (simplified Chinese: 中国; traditional Chinese: 中國),
    officially the People's Republic of China (PRC), is a sovereign state
    located in East Asia.
    """
    for language in Detector(mixed_text).languages:
            print(language)

    # name: English     code: en       confidence:  87.0 read bytes:  1154
    # name: Chinese     code: zh_Hant  confidence:   5.0 read bytes:  1755
    # name: un          code: un       confidence:   0.0 read bytes:     0

pip install polyglot

To install the dependencies, run: sudo apt-get install python-numpy libicu-dev

Note: Polyglot is using pycld2, see https://github.com/aboSamoor/polyglot/blob/master/polyglot/detect/base.py#L72 for details.

3. chardet

Chardet has also a feature of detecting languages if there are character bytes in range (127-255]:

    >>> chardet.detect("Я люблю вкусные пампушки".encode('cp1251'))
    {'encoding': 'windows-1251', 'confidence': 0.9637267119204621, 'language': 'Russian'}

pip install chardet

4. langdetect

Requires large portions of text. It uses non-deterministic approach under the hood. That means you get different results for the same text sample. Docs say you have to use following code to make it determined:

    from langdetect import detect, DetectorFactory
    DetectorFactory.seed = 0
    detect('今一はお前さん')

pip install langdetect

5. guess_language

Can detect very short samples by using this spell checker with dictionaries.

pip install guess_language-spirit

6. langid

langid.py provides both module

    import langid
    langid.classify("This is a test")
    # ('en', -54.41310358047485)

and a command-line tool:

    $ langid < README.md

pip install langid

7. FastText

FastText is a text classifier, can be used to recognize 176 languages with a proper models for language classification. Download this model, then:

    import fasttext
    model = fasttext.load_model('lid.176.ftz')
    print(model.predict('الشمس تشرق', k=2))  # top 2 matching languages

    (('__label__ar', '__label__fa'), array([0.98124713, 0.01265871]))

pip install fasttext

8. pyCLD3

pycld3 is a neural network model for language identification. This package contains the inference code and a trained model.

    import cld3
    cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")

    LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

pip install pycld3

Have you had a look at langdetect?

from langdetect import detect

lang = detect("Ein, zwei, drei, vier")

print lang
#output: de

If you are looking for a library that is fast with long texts, polyglot and fastext are doing the best job here.

I sampled 10000 documents from a collection of dirty and random HTMLs, and here are the results:

+------------+----------+
| Library    | Time     |
+------------+----------+
| polyglot   | 3.67 s   |
+------------+----------+
| fasttext   | 6.41     |
+------------+----------+
| cld3       | 14 s     |
+------------+----------+
| langid     | 1min 8s  |
+------------+----------+
| langdetect | 2min 53s |
+------------+----------+
| chardet    | 4min 36s |
+------------+----------+

I have noticed that a lot of the methods focus on short texts, probably because it is the hard problem to solve: if you have a lot of text, it is really easy to detect languages (e.g. one could just use a dictionary!). However, this makes it difficult to find for an easy and suitable method for long texts.

@Rabash had a good list of tools on https://stackoverflow.com/a/47106810/610569

And @toto_tico did a nice job in presenting the speed comparison.

Here's a summary to complete the great answers above (as of 2021)

Language ID software	Used by	Open Source / Model	Rule-based	Stats-based	Can train/tune
Google Translate Language Detection	TextBlob (limited usage)	✕	-	-	✕
Chardet	-	✓	✓	✕	✕
Guess Language (non-active development)	spirit-guess (updated rewrite)	✓	✓	Minimally	✕
pyCLD2	Polyglot	✓	Somewhat	✓	Not sure
CLD3	-	✓	✕	✓	Possibly
langid-py	-	✓	Not sure	✓	✓
langdetect	SpaCy-langdetect	✓	✕	✓	✓
FastText	What The Lang	✓	✕	✓	Not sure

There is an issue with langdetect when it is being used for parallelization and it fails. But spacy_langdetect is a wrapper for that and you can use it for that purpose. You can use the following snippet as well:

import spacy
from spacy_langdetect import LanguageDetector

nlp = spacy.load("en")
nlp.add_pipe(LanguageDetector(), name="language_detector", last=True)
text = "This is English text Er lebt mit seinen Eltern und seiner Schwester in Berlin. Yo me divierto todos los días en el parque. Je m'appelle Angélica Summer, j'ai 12 ans et je suis canadienne."
doc = nlp(text)
# document level language detection. Think of it like average language of document!
print(doc._.language['language'])
# sentence level language detection
for i, sent in enumerate(doc.sents):
    print(sent, sent._.language)

Related questions
                            
                                How to sort objects by multiple keys in Python?
                            
                                Django URL Redirect
                            
                                What are the pros and cons between get_dummies (Pandas) and OneHotEncoder (Scikit-learn)?
                            
                                How do I add custom field to Python log format string?
                            
                                python location on mac osx
                            
                                Getting TypeError: __init__() missing 1 required positional argument: 'on_delete' when trying to add parent table after child table with entries
                            
                                selecting from multi-index pandas
                            
                                Django: ImproperlyConfigured: The SECRET_KEY setting must not be empty
                            
                                Python: try statement in a single line
                            
                                How do I pass extra arguments to a Python decorator?
                            
                                Matplotlib discrete colorbar
                            
                                How to set environment variables in PyCharm?
                            
                                How can I use UUIDs in SQLAlchemy?
                            
                                month name to month number and vice versa in python
                            
                                Plot a bar using matplotlib using a dictionary
                            
                                Login credentials not working with Gmail SMTP
                            
                                Check to see if python script is running
                            
                                How to Install pip for python 3.7 on Ubuntu 18?
                            
                                How do I calculate r-squared using Python and Numpy?
                            
                                How do I send a POST request as a JSON?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python: How to determine the language?

Tags:

python

string

parsing

People also ask