Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elegant way to search for UTF-8 files with BOM?

For debugging purposes, I need to recursively search a directory for all files which start with a UTF-8 byte order mark (BOM). My current solution is a simple shell script:

find -type f | while read file do     if [ "`head -c 3 -- "$file"`" == $'\xef\xbb\xbf' ]     then         echo "found BOM in: $file"     fi done

Or, if you prefer short, unreadable one-liners:

find -type f|while read file;do [ "`head -c3 -- "$file"`" == $'\xef\xbb\xbf' ] && echo "found BOM in: $file";done

It doesn't work with filenames that contain a line break, but such files are not to be expected anyway.

Is there any shorter or more elegant solution?

Are there any interesting text editors or macros for text editors?

like image 954
vog Avatar asked Oct 15 '08 13:10

vog


People also ask

How do I find BOM files?

To check if BOM character exists, open the file in Notepad++ and look at the bottom right corner. If it says UTF-8-BOM then the file contains BOM character.

Should you use UTF-8 with BOM?

The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM.

What is the difference between UTF-8 and UTF-8 BOM?

The UTF-8 BOM is a sequence of bytes at the start of a text stream ( 0xEF, 0xBB, 0xBF ) that allows the reader to more reliably guess a file as being encoded in UTF-8. Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

How do I add BOM to UTF-8?

To Add BOM to a UTF-8 file, we can directly write Unicode \ufeff or three bytes 0xEF , 0xBB , 0xBF at the beginning of the UTF-8 file. The Unicode \ufeff represents 0xEF , 0xBB , 0xBF , read this.


1 Answers

What about this one simple command which not just finds but clears the nasty BOM? :)

find . -type f -exec sed '1s/^\xEF\xBB\xBF//' -i {} \; 

I love "find" :)

Warning The above will modify binary files which contain those three characters.

If you want just to show BOM files, use this one:

grep -rl $'\xEF\xBB\xBF' . 
like image 63
Denis Avatar answered Sep 27 '22 22:09

Denis