I'm wondering if it is possible to use Stanford CoreNLP
to detect which language a sentence is written in? If so, how precise can those algorithms be?
Almost certainly there is no language identification in Stanford COreNLP at this moment. 'almost' - because nonexistence is much harder to prove.
EDIT: Nevertheless, below are circumstantial evidences:
Language
classes, but nothing related to language identification - you can
check manually for all 84 occurrence of 'language' word here
Try TIKA, or TextCat, or Language Detection Library for Java (they report "99% over precision for 53 languages").
In general, quality depends on the size of input text: if it is long enough (say, at least several words and not specially chosen), then precision can be pretty good - about 95%.
Standford CoreNLP doesn't have language ID (at least not yet), see http://nlp.stanford.edu/software/corenlp.shtml
There are loads more on language detection/identification tools. But do take the reported precision with a pinch of salt. It is usually evaluated narrowly, bounded by:
Notable language ID tools includes:
An exhaustive list from meta-guide.com, see http://meta-guide.com/software-meta-guide/100-best-github-language-identification/
Noteworthy Language Identification related shared task (with training/testing data) includes:
Also take a look at:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With