What is a relatively simple way to determine the probability that a sentence is in English?

Question

I have a number of strings (collections of characters) that represent sentences in different languages, say:

Hello, my name is George.

Das brot ist gut.

... etc.

I want to assign each of them scores (from 0 .. 1) indicating the likelihood that they are English sentences. Is there an accepted algorithm (or Python library) from which to do this?

Note: I don't care if the grammar of the English sentence is perfect.

Raymond Hettinger · Accepted Answer

A bayesian classifier would be a good choice for this task:

>>> from reverend.thomas import Bayes
>>> g = Bayes()    # guesser
>>> g.train('french','La souris est rentrÃ©e dans son trou.')
>>> g.train('english','my tailor is rich.')
>>> g.train('french','Je ne sais pas si je viendrai demain.')
>>> g.train('english','I do not plan to update my website soon.')

>>> print g.guess('Jumping out of cliffs it not a good idea.')
[('english', 0.99990000000000001), ('french', 9.9999999999988987e-005)]

>>> print g.guess('Demain il fera trÃ¨s probablement chaud.')
[('french', 0.99990000000000001), ('english', 9.9999999999988987e-005)]

What is a relatively simple way to determine the probability that a sentence is in English?

Tags:

python

string

nlp

bayesian

sdasdadas

1 Answers

Raymond Hettinger

Recent Activity

Donate For Us

What is a relatively simple way to determine the probability that a sentence is in English?

Tags:

python

string

nlp

bayesian

sdasdadas

1 Answers

Raymond Hettinger

Related questions

Recent Activity

Donate For Us