How to make sure all my source files stay UTF-8 with Unix line endings?

Question

I'm looking for some command-line tools for Linux that can help me detect and convert files from character sets like iso-8859-1 and windows-1252 to utf-8 and from Windows line endings to Unix line endings.

The reason I need this is that I'm working on projects on Linux servers via SFTP with editors on Windows (like Sublime Text) that just constantly screws these things up. Right now I'm guessing about half my files are utf-8, the rest are iso-8859-1 and windows-1252 as it seems Sublime Text is just picking character set by which symbols the file contains when I save it. The line endings are ALWAYS Windows line endings even though I've specified in the options that default line endings are LF, so about half of my files have LF and half are CRLF.

So I would need at least a tool that would recursively scan my project folder and alert me of files that deviate from utf-8 with LF line endings so I could manually fix that before I commit my changes to GIT.

Any comments and personal experiences on the topic would also be welcome.

Thanks

Edit: I have a temporary solution in place where I use tree and file to output information about every file in my project, but it's kinda wonky. If I don't include the -i option for file then a lot of my files gets different output like ASCII C++ program text and HTML document text and English text etc:

$ tree -f -i -a -I node_modules --noreport -n | xargs file | grep -v directory
./config.json:              ASCII C++ program text
./debugserver.sh:           ASCII text
./.gitignore:               ASCII text, with no line terminators
./lib/config.js:            ASCII text
./lib/database.js:          ASCII text
./lib/get_input.js:         ASCII text
./lib/models/stream.js:     ASCII English text
./lib/serverconfig.js:      ASCII text
./lib/server.js:            ASCII text
./package.json:             ASCII text
./public/index.html:        HTML document text
./src/config.coffee:        ASCII English text
./src/database.coffee:      ASCII English text
./src/get_input.coffee:     ASCII English text, with CRLF line terminators
./src/jtv.coffee:           ASCII English text
./src/models/stream.coffee: ASCII English text
./src/server.coffee:        ASCII text
./src/serverconfig.coffee:  ASCII text
./testserver.sh:            ASCII text
./vendor/minify.json.js:    ASCII C++ program text, with CRLF line terminators

But if I do include -i it doesn't show me line terminators:

$ tree -f -i -a -I node_modules --noreport -n | xargs file -i | grep -v directory
./config.json:              text/x-c++; charset=us-ascii
./debugserver.sh:           text/plain; charset=us-ascii
./.gitignore:               text/plain; charset=us-ascii
./lib/config.js:            text/plain; charset=us-ascii
./lib/database.js:          text/plain; charset=us-ascii
./lib/get_input.js:         text/plain; charset=us-ascii
./lib/models/stream.js:     text/plain; charset=us-ascii
./lib/serverconfig.js:      text/plain; charset=us-ascii
./lib/server.js:            text/plain; charset=us-ascii
./package.json:             text/plain; charset=us-ascii
./public/index.html:        text/html; charset=us-ascii
./src/config.coffee:        text/plain; charset=us-ascii
./src/database.coffee:      text/plain; charset=us-ascii
./src/get_input.coffee:     text/plain; charset=us-ascii
./src/jtv.coffee:           text/plain; charset=us-ascii
./src/models/stream.coffee: text/plain; charset=us-ascii
./src/server.coffee:        text/plain; charset=us-ascii
./src/serverconfig.coffee:  text/plain; charset=us-ascii
./testserver.sh:            text/plain; charset=us-ascii
./vendor/minify.json.js:    text/x-c++; charset=us-ascii

Also why does it display charset=us-ascii and not utf-8? And what's text/x-c++? Is there a way I could output only charset=utf-8 and line-terminators=LF for each file?

Hubro · Accepted Answer

The solution I ended up with is the two Sublime Text 2 plugins "EncodingHelper" and "LineEndings". I now get both the file encoding and line endings in the status bar:

Sublime Text 2 status bar

If the encoding is wrong, I can File->Save with Encoding. If the line endings are wrong, the latter plugin comes with commands for changing the line endings:

Sublime Text 2 commands

bmargulies · Answer

If a file has no BOM, and no 'interesting characters' within the amount of text that file looks at, file concludes that it is ~~ASCII~~ ISO-646 -- a strict subset of UTF-8. You might find that putting BOMs on all your files encourages all these Windows tools to behave; the convention of a BOM on a UTF-8 file originated on Windows. Or it might make things worse. As for x/c++, well, that's just file tryin' to be helpful, and failing. You javascript has something in it that looks like C++.

Apache Tika has an encoding detector; you could even use the command-line driver that comes with it as an alternative to file. It will stick to MIME types and not wander off to C++.

tripleee · Answer

Instead of file, try a custom program to check just the things you want. Here is a quick hack, mainly based on some Google hits, which were incidentally written by @ikegami.

#!/usr/bin/perl

use strict;
use warnings;

use Encode qw( decode );

use vars (qw(@ARGV));

@ARGV > 0 or die "Usage: $0 files ...
";

for my $filename (@ARGV)
{
    my $terminator = 'CRLF';
    my $charset = 'UTF-8';
    local $/;
    undef $/;
    my $file;
    if (open (F, "<", $filename))
    {
        $file = <F>;
        close F;    
        # Don't print bogus data e.g. for directories
        unless (defined $file)
        {
            warn "$0: Skipping $filename: $!
;
            next;
        }
    }
    else
    {
        warn "$0: Could not open $filename: $!
";
        next;
    }

    my $have_crlf = ($file =~ /
/);
    my $have_cr = ($file =~ /
(?!
)/);
    my $have_lf = ($file =~ /(?!
).
/);
    my $sum = $have_crlf + $have_cr + $have_lf;
    if ($sum == 0)
    {
        $terminator = "no";
    }
    elsif ($sum > 2)
    {
        $terminator = "mixed";
    }
    elsif ($have_cr)    
    {
        $terminator = "CR";
    }
    elsif ($have_lf)
    {
        $terminator = "LF";
    }

    $charset = 'ASCII' unless ($file =~ /[^\000-\177]/);

    $charset = 'unknown'
        unless eval { decode('UTF-8', $file, Encode::FB_CROAK); 1 };

    print "$filename: charset $charset, $terminator line endings
";
}

Note that this has no concept of legacy 8-bit encodings - it will simply throw unknown if it's neither pure 7-bit ASCII nor proper UTF-8.

How to make sure all my source files stay UTF-8 with Unix line endings?

Tags:

unix

command-line

character-encoding

sublimetext

line-endings

Hubro

3 Answers

Hubro

bmargulies

tripleee

Recent Activity

Donate For Us

How to make sure all my source files stay UTF-8 with Unix line endings?

Tags:

unix

command-line

character-encoding

sublimetext

line-endings

Hubro

3 Answers

Hubro

bmargulies

tripleee

Related questions

Recent Activity

Donate For Us