I'm looking for some command-line tools for Linux that can help me detect and convert files from character sets like iso-8859-1 and windows-1252 to utf-8 and from Windows line endings to Unix line endings.
The reason I need this is that I'm working on projects on Linux servers via SFTP with editors on Windows (like Sublime Text) that just constantly screws these things up. Right now I'm guessing about half my files are utf-8, the rest are iso-8859-1 and windows-1252 as it seems Sublime Text is just picking character set by which symbols the file contains when I save it. The line endings are ALWAYS Windows line endings even though I've specified in the options that default line endings are LF, so about half of my files have LF and half are CRLF.
So I would need at least a tool that would recursively scan my project folder and alert me of files that deviate from utf-8 with LF line endings so I could manually fix that before I commit my changes to GIT.
Any comments and personal experiences on the topic would also be welcome.
Thanks
Edit: I have a temporary solution in place where I use tree
and file
to output information about every file in my project, but it's kinda wonky. If I don't include the -i
option for file
then a lot of my files gets different output like ASCII C++ program text and HTML document text and English text etc:
$ tree -f -i -a -I node_modules --noreport -n | xargs file | grep -v directory ./config.json: ASCII C++ program text ./debugserver.sh: ASCII text ./.gitignore: ASCII text, with no line terminators ./lib/config.js: ASCII text ./lib/database.js: ASCII text ./lib/get_input.js: ASCII text ./lib/models/stream.js: ASCII English text ./lib/serverconfig.js: ASCII text ./lib/server.js: ASCII text ./package.json: ASCII text ./public/index.html: HTML document text ./src/config.coffee: ASCII English text ./src/database.coffee: ASCII English text ./src/get_input.coffee: ASCII English text, with CRLF line terminators ./src/jtv.coffee: ASCII English text ./src/models/stream.coffee: ASCII English text ./src/server.coffee: ASCII text ./src/serverconfig.coffee: ASCII text ./testserver.sh: ASCII text ./vendor/minify.json.js: ASCII C++ program text, with CRLF line terminators
But if I do include -i
it doesn't show me line terminators:
$ tree -f -i -a -I node_modules --noreport -n | xargs file -i | grep -v directory ./config.json: text/x-c++; charset=us-ascii ./debugserver.sh: text/plain; charset=us-ascii ./.gitignore: text/plain; charset=us-ascii ./lib/config.js: text/plain; charset=us-ascii ./lib/database.js: text/plain; charset=us-ascii ./lib/get_input.js: text/plain; charset=us-ascii ./lib/models/stream.js: text/plain; charset=us-ascii ./lib/serverconfig.js: text/plain; charset=us-ascii ./lib/server.js: text/plain; charset=us-ascii ./package.json: text/plain; charset=us-ascii ./public/index.html: text/html; charset=us-ascii ./src/config.coffee: text/plain; charset=us-ascii ./src/database.coffee: text/plain; charset=us-ascii ./src/get_input.coffee: text/plain; charset=us-ascii ./src/jtv.coffee: text/plain; charset=us-ascii ./src/models/stream.coffee: text/plain; charset=us-ascii ./src/server.coffee: text/plain; charset=us-ascii ./src/serverconfig.coffee: text/plain; charset=us-ascii ./testserver.sh: text/plain; charset=us-ascii ./vendor/minify.json.js: text/x-c++; charset=us-ascii
Also why does it display charset=us-ascii and not utf-8? And what's text/x-c++? Is there a way I could output only charset=utf-8
and line-terminators=LF
for each file?
The solution I ended up with is the two Sublime Text 2 plugins "EncodingHelper" and "LineEndings". I now get both the file encoding and line endings in the status bar:
If the encoding is wrong, I can File->Save with Encoding. If the line endings are wrong, the latter plugin comes with commands for changing the line endings:
If a file has no BOM, and no 'interesting characters' within the amount of text that file
looks at, file
concludes that it is ASCII ISO-646 -- a strict subset of UTF-8. You might find that putting BOMs on all your files encourages all these Windows tools to behave; the convention of a BOM on a UTF-8 file originated on Windows. Or it might make things worse. As for x/c++, well, that's just file
tryin' to be helpful, and failing. You javascript has something in it that looks like C++.
Apache Tika has an encoding detector; you could even use the command-line driver that comes with it as an alternative to file
. It will stick to MIME types and not wander off to C++.
Instead of file
, try a custom program to check just the things you want. Here is a quick hack, mainly based on some Google hits, which were incidentally written by @ikegami.
#!/usr/bin/perl
use strict;
use warnings;
use Encode qw( decode );
use vars (qw(@ARGV));
@ARGV > 0 or die "Usage: $0 files ...\n";
for my $filename (@ARGV)
{
my $terminator = 'CRLF';
my $charset = 'UTF-8';
local $/;
undef $/;
my $file;
if (open (F, "<", $filename))
{
$file = <F>;
close F;
# Don't print bogus data e.g. for directories
unless (defined $file)
{
warn "$0: Skipping $filename: $!\n;
next;
}
}
else
{
warn "$0: Could not open $filename: $!\n";
next;
}
my $have_crlf = ($file =~ /\r\n/);
my $have_cr = ($file =~ /\r(?!\n)/);
my $have_lf = ($file =~ /(?!\r\n).\n/);
my $sum = $have_crlf + $have_cr + $have_lf;
if ($sum == 0)
{
$terminator = "no";
}
elsif ($sum > 2)
{
$terminator = "mixed";
}
elsif ($have_cr)
{
$terminator = "CR";
}
elsif ($have_lf)
{
$terminator = "LF";
}
$charset = 'ASCII' unless ($file =~ /[^\000-\177]/);
$charset = 'unknown'
unless eval { decode('UTF-8', $file, Encode::FB_CROAK); 1 };
print "$filename: charset $charset, $terminator line endings\n";
}
Note that this has no concept of legacy 8-bit encodings - it will simply throw unknown
if it's neither pure 7-bit ASCII nor proper UTF-8.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With