I'm trying to write a script that will automatically remove UTF-8 BOMs from a file. I'm having trouble detecting whether the file has one in the first place or not. Here is my code: <pre class="prettyprint"><code>function has-bom { # Test if the file starts with 0xEF, 0xBB, and 0xBF head -c 3 "$1" | grep -P '\xef\xbb\xbf' return $? } </code></pre> For some reason, <code>head</code> seems to be ignoring the BOM in front of the file. As an example, running this <pre class="prettyprint"><code>printf '\xef\xbb\xbf' > file head -c 3 file </code></pre> won't print anything. I tried looking for an option in <code>head --help</code> that would let me work around this, but no luck. Is there anything I can do to make this work?

First, let's demonstrate that <code>head</code> is actually working correctly: <pre class="prettyprint"><code>$ printf '\xef\xbb\xbf' >file $ head -c 3 file $ head -c 3 file | hexdump -C 00000000 ef bb bf |...| 00000003 </code></pre> Now, let's create a working function <code>has_bom</code>. If your <code>grep</code> supports <code>-P</code>, then one option is: <pre class="prettyprint"><code>$ has_bom() { head -c3 "$1" | LC_ALL=C grep -qP '\xef\xbb\xbf'; } $ has_bom file && echo yes yes </code></pre> Currently, only GNU <code>grep</code> supports <code>-P</code>. Another option is to use bash's <code>$'...'</code>: <pre class="prettyprint"><code>$ has_bom() { head -c3 "$1" | grep -q $'\xef\xbb\xbf'; } $ has_bom file && echo yes yes </code></pre> <code>ksh</code> and <code>zsh</code> also support <code>$'...'</code> but this construct is not POSIX and <code>dash</code> does not support it. Notes: <ol> <li>The use of an explicit <code>return $?</code> is optional. The function will, by default, return with the exit code of the last command run.</li> <li>I have used the POSIX form for defining functions. This is equivalent to the bash form but gives you one less problem to deal with if you ever have to run the function under another shell.</li> <li>bash does accept the use of the character <code>-</code> in a function name but this is a controversial feature. I replaced it with <code>_</code> which is more widely accepted. (For more on this issue, see this answer.)</li> <li>The <code>-q</code> option to <code>grep</code> makes it quiet, meaning that it still sets a proper exit code but it does not send any characters to stdout.</li> </ol>

How to detect if a file has a UTF-8 BOM in Bash?

Tags:

linux

bash

unix

encoding

utf-8

I'm trying to write a script that will automatically remove UTF-8 BOMs from a file. I'm having trouble detecting whether the file has one in the first place or not. Here is my code:

function has-bom {     # Test if the file starts with 0xEF, 0xBB, and 0xBF     head -c 3 "$1" | grep -P '\xef\xbb\xbf'     return $? }

For some reason, head seems to be ignoring the BOM in front of the file. As an example, running this

printf '\xef\xbb\xbf' > file head -c 3 file

won't print anything.

I tried looking for an option in head --help that would let me work around this, but no luck. Is there anything I can do to make this work?

591

asked Nov 28 '15 23:11

James Ko

1 Answers

First, let's demonstrate that head is actually working correctly:

$ printf '\xef\xbb\xbf' >file $ head -c 3 file  $ head -c 3 file | hexdump -C 00000000  ef bb bf                                          |...| 00000003

Now, let's create a working function has_bom. If your grep supports -P, then one option is:

$ has_bom() { head -c3 "$1" | LC_ALL=C grep -qP '\xef\xbb\xbf'; } $ has_bom file && echo yes yes

Currently, only GNU grep supports -P.

Another option is to use bash's $'...':

$ has_bom() { head -c3 "$1" | grep -q $'\xef\xbb\xbf'; } $ has_bom file && echo yes yes

ksh and zsh also support $'...' but this construct is not POSIX and dash does not support it.

Notes:

The use of an explicit return $? is optional. The function will, by default, return with the exit code of the last command run.
I have used the POSIX form for defining functions. This is equivalent to the bash form but gives you one less problem to deal with if you ever have to run the function under another shell.
bash does accept the use of the character - in a function name but this is a controversial feature. I replaced it with _ which is more widely accepted. (For more on this issue, see this answer.)
The -q option to grep makes it quiet, meaning that it still sets a proper exit code but it does not send any characters to stdout.

151

answered Sep 21 '22 20:09

John1024

Related questions
                            
                                How do you make linux GUI's?
                            
                                x64 memset core, is passed buffer address truncated?
                            
                                Socket.io POST Requests from Socket.IO-Client-Swift
                            
                                How to access my SSH linux server from outside my home network [closed]
                            
                                Why this macro is defined as ({ 1; })?
                            
                                Limit useable host resources in Docker compose without swarm
                            
                                Call to daemon in a /etc/init.d script is blocking, not running in background
                            
                                Linux TortoiseSVN [closed]
                            
                                What is the best way of determining that two file paths are referring to the same file object?
                            
                                Change user id in linux
                            
                                Does Linux have zero-copy? splice or sendfile?
                            
                                Boot a native OS on a hard disk as a virtual machine
                            
                                Difference in position-independent code: x86 vs x86-64
                            
                                Disable Gnome 3 notifications/pop-ups/integrated notifications
                            
                                why parallel execution on java compile take linear growth in time
                            
                                How to use printf to display off_t, nlink_t, size_t and other special types?
                            
                                On the web, what fonts should I use to create an equivalent experience cross-platform?
                            
                                Linux Kernel - How to obtain a particular version (right upto SUBLEVEL)
                            
                                Nginx cannot write into access.log
                            
                                Font Awesome and i3bar [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With