Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make a pre-commit hook that prevents non-UTF-8 file encodings

Is it possible to make a precommit hook for git or svn that can reject files not committed in a specific encoding?

I have worked on several project where it seems to be a problem to stick to a certain file encoding (like UTF-8 for instance)

like image 938
Jesper Rønn-Jensen Avatar asked Jun 30 '10 11:06

Jesper Rønn-Jensen


3 Answers

Your iconv may be able to tell you if something is not UTF-8, but other encodings may not be so easy (especially 8-bit, single byte encodings like ISO-8859-1).

For Git, you may actually want an update hook instead of a pre-commit hook (so that it can be run in a central repository to enforce the rule).

Git pre-commit hook:

#!/bin/sh
git ls-files -z -- |
xargs -0 sh -c '

    e=""
    for f; do
        if ! git show :"$f" |
             iconv -f UTF-8 -t UTF-8 >/dev/null 2>&1; then
            e=1
            echo "Not UTF-8: $f"
            #exit 255 # to abort after first non-UTF-8 file
        fi
    done
    test -z "$e"

' -

Put one or more Git pathspecs after the -- on the git ls-files command line to limit the pathnames that are checked.

To check the tip of the updated ref in an update hook, use git ls-tree --name-only -r -z $3 -- | to generate the pathnames (note: it does not handle pattern pathspecs like git ls-files, so do any pattern-based filtering in the shell code) and git show "$3:$f" to extract the file contents. You might also want to check not only the tip commit, but each new commit (loop for each commit in git rev-list ^$2 $3 instead of just $3).

like image 133
Chris Johnsen Avatar answered Oct 19 '22 03:10

Chris Johnsen


Precommit hooks are just scripts. So if you can tell the encoding in a script, then you can use that information to reject the wrong sort of file.

You could search the file for characters outside of the normal character range. If there's a magic number or a tag to tell you the encoding for a file, you can check that. Otherwise ask yourself "how would I know this file is in the wrong encoding?" Can you code that up?

like image 30
Peter Avatar answered Oct 19 '22 02:10

Peter


You could maybe use iconv utility to change the encoding from UTF-8 to for example UTF-16. And if the change fails, the source file is not in correct encoding:

$ iconv -f UTF-8 -t UTF-16 Strings.java 
ÿþ
testing = iconv: illegal input sequence at position 11
like image 27
oherrala Avatar answered Oct 19 '22 04:10

oherrala