Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding BOM to UTF-8 files

I'm searching (without success) for a script, which would work as a batch file and allow me to prepend a UTF-8 text file with a BOM if it doesn't have one.

Neither the language it is written in (perl, python, c, bash) nor the OS it works on, matters to me. I have access to a wide range of computers.

I've found a lot of scripts to do the reverse (strip the BOM), which sounds to me as kind of silly, as many Windows program will have trouble reading UTF-8 text files if they don't have a BOM.

Did I miss the obvious?

Thanks!

like image 880
Stephane Avatar asked Jun 27 '10 13:06

Stephane


People also ask

Should you use UTF-8 with BOM?

The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM.

How do I save a UTF-8 file as BOM?

Select “Save As” from File menu, go to Save button and open its dropdown menu, select “Save with Encoding…”, choose “Unicode (UTF-8 without signature)”.

What is a UTF-8 BOM file?

The UTF-8 file signature (commonly also called a "BOM") identifies the encoding format rather than the byte order of the document. UTF-8 is a linear sequence of bytes and not sequence of 2-byte or 4-byte units where the byte order is important. Encoding. Encoded BOM. UTF-8.

What is the difference between UTF-8 and UTF-8 BOM?

The UTF-8 BOM is a sequence of bytes at the start of a text stream ( 0xEF, 0xBB, 0xBF ) that allows the reader to more reliably guess a file as being encoded in UTF-8. Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.


1 Answers

I wrote this addbom.sh using the 'file' command and ICU's 'uconv' command.

#!/bin/sh  if [ $# -eq 0 ] then         echo usage $0 files ...         exit 1 fi  for file in "$@" do         echo "# Processing: $file" 1>&2         if [ ! -f "$file" ]         then                 echo Not a file: "$file" 1>&2                 exit 1         fi         TYPE=`file - < "$file" | cut -d: -f2`         if echo "$TYPE" | grep -q '(with BOM)'         then                 echo "# $file already has BOM, skipping." 1>&2         else                 ( mv "${file}" "${file}"~ && uconv -f utf-8 -t utf-8 --add-signature < "${file}~" > "${file}" ) || ( echo Error processing "$file" 1>&2 ; exit 1)         fi done 

edit: Added quotes around the mv arguments. Thanks @DirkR and glad this script has been so helpful!

like image 192
Steven R. Loomis Avatar answered Oct 01 '22 02:10

Steven R. Loomis