Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect language

Are there any good, open source engines out there for detecting what language a text is in, perhaps with a probability metric? One that I can run locally and doesn't query Google or Bing? I'd like to detect language for each page in about 15 million pages of OCR'ed text.

Not all documents will contain languages which use the Latin alphabet.

like image 457
niklassaers Avatar asked Jul 03 '10 22:07

niklassaers


People also ask

Is there an app that detects language?

Google Translate is simple and easy to use. One of the best apps out there, Google Translate supports more than 103 languages in typing. You can download 52 languages offline for times when you don't have access to the Internet.

Can Google identify languages?

Starting today, Google Translate's camera can automatically detect languages so you can point your camera at a flyer or sign and get results in your native tongue even if you don't know what language you're reading.

How do you check text language?

Google Translate - If you need to determine the language of an entire web page or an online document, paste the URL of that page in the Google Translate box and choose “Detect Language” as the source language.


1 Answers

You can surely build your own, given some statistics about letter frequencies, digraph frequencies, etc, of your target languages.

Then release it as open source. And voila, you have an open source engine for detecting the language of text!

like image 72
Dolph Avatar answered Sep 20 '22 17:09

Dolph