Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if file contains multibyte character

I have some subtitle files in UTF-8. Sometimes there are some sporadic multibyte characters in these files which cause problem in some applications.

How do I check in linux (and possibility locate these) if a certain file contains any multibyte character.

like image 925
Masroor Avatar asked Apr 29 '12 15:04

Masroor


People also ask

Is a multibyte character?

A multibyte character is a character composed of sequences of one or more bytes. Each byte sequence represents a single character in the extended character set. Multibyte characters are used in character sets such as Kanji. Wide characters are multilingual character codes that are always 16 bits wide.

How many bytes is a multibyte character?

A multibyte character set can consist of both 1-byte and 2-byte characters. A multibyte-character string can contain a mixture of single-byte and double-byte characters. A two-byte multibyte character has a lead byte and a trail byte.

What is a multibyte string?

A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character). Each character stored in the string may occupy more than one byte.

What is single-byte and multibyte?

The single-byte code sets have at most 256 characters and the multibyte code sets have more than 256 (without any theoretical limit). Parent topic: Code sets for multicultural support.


2 Answers

You can use file command

chalet16$ echo test > a.txt
chalet16$ echo testก >  b.txt #One of Thai characters
chalet16$ file *.txt
a.txt: ASCII text
b.txt: UTF-8 Unicode text
like image 185
chalet16 Avatar answered Sep 24 '22 13:09

chalet16


You can use file or chardet command.

like image 45
kev Avatar answered Sep 23 '22 13:09

kev