Question says it all, I've got a 500,000 line file that gets generated as part of an automated build process on a Windows box and it's riddled with ^M's. When it goes out the door it needs to *nix friendly, what's the best approach here, is there a handy snippet of code that could do this for me? Or do I need to write a little C# or Java app?
Sometimes, you will need to move files between windows and unix systems. Window files use the same format as Dos, where the end of line is signified by two characters, Carriage Return or CR or \r followed by Line Feed or LF or \n. Unix files, on the other hand, use only Line Feed (\n).
The simplest way to convert line breaks in a text file is to use the dos2unix tool. The command converts the file without saving it in the original format. If you want to save the original file, add the -b attribute before the file name. Doing so creates a backup file under the same name and the .
If you want to automatically run dos2unix on hidden files and folders, you can use find or dos2unix ** **/. * The **/. * will expand only the hidden files and folders, including .
dos2unix and unix2dos with Unicode UTF-16 support can read little and big endian UTF-16 encoded text files. To see if dos2unix was built with UTF-16 support type "dos2unix -V". The Windows versions of dos2unix and unix2dos convert UTF-16 encoded files always to UTF-8 encoded files.
Here is a Perl one-liner, taken from http://www.technocage.com/~caskey/dos2unix/
#!/usr/bin/perl -pi
s/\r\n/\n/;
You can run it as follows:
perl dos2unix.pl < file.dos > file.unix
Or, you can run it also in this way (the conversion is done in-place):
perl -pi dos2unix.pl file.dos
And here is my (naive) C version:
#include <stdio.h>
int main(void)
{
int c;
while( (c = fgetc(stdin)) != EOF )
if(c != '\r')
fputc(c, stdout);
return 0;
}
You should run it with input and output redirection:
dos2unix.exe < file.dos > file.unix
If installing a base cygwin is too heavy, there are a number of standalone dos2unix
and unix2dos
Windows standalone console-based programs on the net, many with C/C++ source available. If I'm understanding the requirement correctly, either of these solutions would fit nicely into an automated build script.
If you're on Windows and need something run in a batch script, you can compile a simple C program to do the trick.
#include <stdio.h>
int main() {
while(1) {
int c = fgetc(stdin);
if(c == EOF)
break;
if(c == '\r')
continue;
fputc(c, stdout);
}
return 0;
}
Usage:
myprogram.exe < input > output
Editing in-place would be a bit more difficult. Besides, you may want to keep backups of the originals for some reason (in case you accidentally strip a binary file, for example).
That version removes all CR characters; if you only want to remove the ones that are in a CR-LF pair, you can use (this is the classic one-character-back method :-):
/* XXX Contains a bug -- see comments XXX */
#include <stdio.h>
int main() {
int lastc = EOF;
int c;
while ((c = fgetc(stdin)) != EOF) {
if ((lastc != '\r') || (c != '\n')) {
fputc (lastc, stdout);
}
lastc = c;
}
fputc (lastc, stdout);
return 0;
}
You can edit the file in-place using mode "r+". Below is a general myd2u program, which accepts file names as arguments. NOTE: This program uses ftruncate to chop off extra characters at the end. If there's any better (standard) way to do this, please edit or comment. Thanks!
#include <stdio.h>
int main(int argc, char **argv) {
FILE *file;
if(argc < 2) {
fprintf(stderr, "Usage: myd2u <files>\n");
return 1;
}
file = fopen(argv[1], "rb+");
if(!file) {
perror("");
return 2;
}
long readPos = 0, writePos = 0;
int lastC = EOF;
while(1) {
fseek(file, readPos, SEEK_SET);
int c = fgetc(file);
readPos = ftell(file); /* For good measure. */
if(c == EOF)
break;
if(c == '\n' && lastC == '\r') {
/* Move back so we override the \r with the \n. */
--writePos;
}
fseek(file, writePos, SEEK_SET);
fputc(c, file);
writePos = ftell(file);
lastC = c;
}
ftruncate(fileno(file), writePos); /* Not in C89/C99/ANSI! */
fclose(file);
/* 'cus I'm too lazy to make a loop. */
if(argc > 2)
main(argc - 1, argv - 1);
return 0;
}
tr -d '^M' < infile > outfile
You will type ^M as : ctrl+V , Enter
Edit: You can use '\r' instead of manually entering a carriage return, [thanks to @strager]
tr -d '\r' < infile > outfile
Edit 2: 'tr' is a unix utility, you can download a native windows version from http://unxutils.sourceforge.net[thanks to @Rob Kennedy] or use cygwin's unix emulation.
Ftp it from the dos box, to the unix box, as an ascii file, instead of a binary file. Ftp will strip the crlf, and insert a lf. Transfer it back to the dos box as a binary file, and the lf will be retained.
Some text editors, such as UltraEdit/UEStudio have this functionality built-in.
File > Conversions > DOS to UNIX
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With