I'm trying to write a script that will automatically remove UTF-8 BOMs from a file. I'm having trouble detecting whether the file has one in the first place or not. Here is my code:
function has-bom { # Test if the file starts with 0xEF, 0xBB, and 0xBF head -c 3 "$1" | grep -P '\xef\xbb\xbf' return $? }
For some reason, head
seems to be ignoring the BOM in front of the file. As an example, running this
printf '\xef\xbb\xbf' > file head -c 3 file
won't print anything.
I tried looking for an option in head --help
that would let me work around this, but no luck. Is there anything I can do to make this work?
There are a few options you can use: check the content-type to see if it includes a charset parameter which would indicate the encoding (e.g. Content-Type: text/plain; charset=utf-16 ); check if the uploaded data has a BOM (the first few bytes in the file, which would map to the unicode character U+FEFF - 2 bytes for ...
Under the Encoding menu, you can check the current character encoding of your file. If "Encode in UTF-8" is marked, then the BOM is present. To remove it, under the Encoding menu, select Convert to UTF-8 without BOM. If you check the encoding again, it now should indicate "Encode in UTF-8 without BOM".
If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'.
To Add BOM to a UTF-8 file, we can directly write Unicode \ufeff or three bytes 0xEF , 0xBB , 0xBF at the beginning of the UTF-8 file. The Unicode \ufeff represents 0xEF , 0xBB , 0xBF , read this.
First, let's demonstrate that head
is actually working correctly:
$ printf '\xef\xbb\xbf' >file $ head -c 3 file $ head -c 3 file | hexdump -C 00000000 ef bb bf |...| 00000003
Now, let's create a working function has_bom
. If your grep
supports -P
, then one option is:
$ has_bom() { head -c3 "$1" | LC_ALL=C grep -qP '\xef\xbb\xbf'; } $ has_bom file && echo yes yes
Currently, only GNU grep
supports -P
.
Another option is to use bash's $'...'
:
$ has_bom() { head -c3 "$1" | grep -q $'\xef\xbb\xbf'; } $ has_bom file && echo yes yes
ksh
and zsh
also support $'...'
but this construct is not POSIX and dash
does not support it.
Notes:
The use of an explicit return $?
is optional. The function will, by default, return with the exit code of the last command run.
I have used the POSIX form for defining functions. This is equivalent to the bash form but gives you one less problem to deal with if you ever have to run the function under another shell.
bash does accept the use of the character -
in a function name but this is a controversial feature. I replaced it with _
which is more widely accepted. (For more on this issue, see this answer.)
The -q
option to grep
makes it quiet, meaning that it still sets a proper exit code but it does not send any characters to stdout.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With