I have some subtitle files in UTF-8. Sometimes there are some sporadic multibyte characters in these files which cause problem in some applications.
How do I check in linux (and possibility locate these) if a certain file contains any multibyte character.
A multibyte character is a character composed of sequences of one or more bytes. Each byte sequence represents a single character in the extended character set. Multibyte characters are used in character sets such as Kanji. Wide characters are multilingual character codes that are always 16 bits wide.
A multibyte character set can consist of both 1-byte and 2-byte characters. A multibyte-character string can contain a mixture of single-byte and double-byte characters. A two-byte multibyte character has a lead byte and a trail byte.
A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character). Each character stored in the string may occupy more than one byte.
The single-byte code sets have at most 256 characters and the multibyte code sets have more than 256 (without any theoretical limit). Parent topic: Code sets for multicultural support.
You can use file command
chalet16$ echo test > a.txt
chalet16$ echo testก > b.txt #One of Thai characters
chalet16$ file *.txt
a.txt: ASCII text
b.txt: UTF-8 Unicode text
You can use file
or chardet
command.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With