I have a product here that have weakness in auto detect the encoding of srt subtitle files compared to competitor. I can auto detect the encoding for smi files, since it has language info in its header. But for srt, i cannot do that. How can I apply this auto detect for srt files? Any good references for example about the algorithm that I can learn as my first step would be appreciated. Fyi, my product should support Western Europe, Central Europe, Cyrillic Alphabet, Greek, Turkish, Hebrew, Arabic, Baltic, Korean, S-Chinese, T-Chinese, Vietnam, Thai.
There is plenty of tools to detect the charset of a text file (e.g. srt files). For example, in the command line of a Linux machine you can use chardet:
chardet subtile_file_name.srt
This utility should be previously installed with pip (Python installer). In Ubuntu:
sudo apt-get install python-pip
pip install chardet
If you need to integrate a detector in your application, there is also open libraries to do the job. For example, in my tool DualSub which is implemented in Java, I used juniversalchardet.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With