Are there any good, open source engines out there for detecting what language a text is in, perhaps with a probability metric? One that I can run locally and doesn't query Google or Bing? I'd like to detect language for each page in about 15 million pages of OCR'ed text.
Not all documents will contain languages which use the Latin alphabet.
Google Translate is simple and easy to use. One of the best apps out there, Google Translate supports more than 103 languages in typing. You can download 52 languages offline for times when you don't have access to the Internet.
Starting today, Google Translate's camera can automatically detect languages so you can point your camera at a flyer or sign and get results in your native tongue even if you don't know what language you're reading.
Google Translate - If you need to determine the language of an entire web page or an online document, paste the URL of that page in the Google Translate box and choose “Detect Language” as the source language.
You can surely build your own, given some statistics about letter frequencies, digraph frequencies, etc, of your target languages.
Then release it as open source. And voila, you have an open source engine for detecting the language of text!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With