Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the best way of doing dos2unix on a 500k line file, in Windows? [closed]

Question says it all, I've got a 500,000 line file that gets generated as part of an automated build process on a Windows box and it's riddled with ^M's. When it goes out the door it needs to *nix friendly, what's the best approach here, is there a handy snippet of code that could do this for me? Or do I need to write a little C# or Java app?

like image 932
ninesided Avatar asked Nov 24 '08 00:11

ninesided


People also ask

How do I use dos2unix on Windows?

Sometimes, you will need to move files between windows and unix systems. Window files use the same format as Dos, where the end of line is signified by two characters, Carriage Return or CR or \r followed by Line Feed or LF or \n. Unix files, on the other hand, use only Line Feed (\n).

How do I use dos2unix on a file?

The simplest way to convert line breaks in a text file is to use the dos2unix tool. The command converts the file without saving it in the original format. If you want to save the original file, add the -b attribute before the file name. Doing so creates a backup file under the same name and the .

How do you dos2unix recursively?

If you want to automatically run dos2unix on hidden files and folders, you can use find or dos2unix ** **/. * The **/. * will expand only the hidden files and folders, including .

How do I check my dos2unix?

dos2unix and unix2dos with Unicode UTF-16 support can read little and big endian UTF-16 encoded text files. To see if dos2unix was built with UTF-16 support type "dos2unix -V". The Windows versions of dos2unix and unix2dos convert UTF-16 encoded files always to UTF-8 encoded files.


6 Answers

Here is a Perl one-liner, taken from http://www.technocage.com/~caskey/dos2unix/

#!/usr/bin/perl -pi
s/\r\n/\n/;

You can run it as follows:

perl dos2unix.pl < file.dos > file.unix

Or, you can run it also in this way (the conversion is done in-place):

perl -pi dos2unix.pl file.dos

And here is my (naive) C version:

#include <stdio.h>

int main(void)
{
   int c;
   while( (c = fgetc(stdin)) != EOF )
      if(c != '\r')
         fputc(c, stdout);
   return 0;
}

You should run it with input and output redirection:

dos2unix.exe < file.dos > file.unix
like image 144
Federico A. Ramponi Avatar answered Sep 25 '22 03:09

Federico A. Ramponi


If installing a base cygwin is too heavy, there are a number of standalone dos2unix and unix2dos Windows standalone console-based programs on the net, many with C/C++ source available. If I'm understanding the requirement correctly, either of these solutions would fit nicely into an automated build script.

like image 42
Ken Gentle Avatar answered Sep 23 '22 03:09

Ken Gentle


If you're on Windows and need something run in a batch script, you can compile a simple C program to do the trick.

#include <stdio.h>

int main() {
    while(1) {
        int c = fgetc(stdin);

        if(c == EOF)
            break;

        if(c == '\r')
            continue;

        fputc(c, stdout);
    }

    return 0;
}

Usage:

myprogram.exe < input > output

Editing in-place would be a bit more difficult. Besides, you may want to keep backups of the originals for some reason (in case you accidentally strip a binary file, for example).

That version removes all CR characters; if you only want to remove the ones that are in a CR-LF pair, you can use (this is the classic one-character-back method :-):

/* XXX Contains a bug -- see comments XXX */

#include <stdio.h>

int main() {
    int lastc = EOF;
    int c;
    while ((c = fgetc(stdin)) != EOF) {
        if ((lastc != '\r') || (c != '\n')) {
            fputc (lastc, stdout);
        }
        lastc = c;
    }
    fputc (lastc, stdout);
    return 0;
}

You can edit the file in-place using mode "r+". Below is a general myd2u program, which accepts file names as arguments. NOTE: This program uses ftruncate to chop off extra characters at the end. If there's any better (standard) way to do this, please edit or comment. Thanks!

#include <stdio.h>

int main(int argc, char **argv) {
    FILE *file;

    if(argc < 2) {
        fprintf(stderr, "Usage: myd2u <files>\n");
        return 1;
    }

    file = fopen(argv[1], "rb+");

    if(!file) {
        perror("");
        return 2;
    }

    long readPos = 0, writePos = 0;
    int lastC = EOF;

    while(1) {
        fseek(file, readPos, SEEK_SET);
        int c = fgetc(file);
        readPos = ftell(file);  /* For good measure. */

        if(c == EOF)
            break;

        if(c == '\n' && lastC == '\r') {
            /* Move back so we override the \r with the \n. */
            --writePos;
        }

        fseek(file, writePos, SEEK_SET);
        fputc(c, file);
        writePos = ftell(file);

        lastC = c;
    }

    ftruncate(fileno(file), writePos); /* Not in C89/C99/ANSI! */

    fclose(file);

    /* 'cus I'm too lazy to make a loop. */
    if(argc > 2)
        main(argc - 1, argv - 1);

    return 0;
}
like image 33
strager Avatar answered Sep 22 '22 03:09

strager


tr -d '^M' < infile > outfile

You will type ^M as : ctrl+V , Enter

Edit: You can use '\r' instead of manually entering a carriage return, [thanks to @strager]

tr -d '\r' < infile > outfile

Edit 2: 'tr' is a unix utility, you can download a native windows version from http://unxutils.sourceforge.net[thanks to @Rob Kennedy] or use cygwin's unix emulation.

like image 21
hayalci Avatar answered Sep 26 '22 03:09

hayalci


Ftp it from the dos box, to the unix box, as an ascii file, instead of a binary file. Ftp will strip the crlf, and insert a lf. Transfer it back to the dos box as a binary file, and the lf will be retained.

like image 23
EvilTeach Avatar answered Sep 23 '22 03:09

EvilTeach


Some text editors, such as UltraEdit/UEStudio have this functionality built-in.

File > Conversions > DOS to UNIX

like image 36
nickf Avatar answered Sep 23 '22 03:09

nickf