Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting a large, complex one column file into several columns with awk

Tags:

bash

field

awk

rows

I have a text file produced by some commercial software, looking like below. It consists in brackets delimited sections, each of which counts several million elements but the exact value changes from one case to another.

(1
 2
 3
...
)
(11
22
33
...
)
(111
222
333
...
)

I need to achieve an output like:

 1;  11;   111
 2;  22;   222
 3;  33;   333
...  ...  ...

I found a complicated way that is:

  • perform sed operations to get

    1
    2
    3
    ...
    #
    11
    22
    33
    ...
    #
    111
    222
    333
    ...
    
  • use awk as follows to split my file in several sub-files

    awk -v RS="#" '{print > ("splitted-" NR ".txt")}'
    
  • remove white spaces from my subfiles again with sed

    sed -i '/^[[:space:]]*$/d' splitted*.txt
    
  • join everything together:

    paste splitted*.txt > out.txt
    
  • add a field separator (defined in my bash script)

    awk -v sep=$my_sep 'BEGIN{OFS=sep}{$1=$1; print }' out.txt > formatted.txt
    

I feel this is crappy as I loop over million lines several time. Even if the return time is quite OK (~80sec), I'd like to find a full awk solution but can't get to it. Something like:

awk 'BEGIN{RS="(\\n)"; OFS=";"} { print something } '

I found some related questions, especially this one row to column conversion with awk, but it assumes a constant number of lines between brackets which I can't do.

Any help would be appreciated.

like image 909
EdouardIFP Avatar asked Dec 05 '22 10:12

EdouardIFP


2 Answers

With GNU awk for multi-char RS and true multi dimensional arrays:

$ cat tst.awk
BEGIN {
    RS  = "(\\s*[()]\\s*)+"
    OFS = ";"
}
NR>1 {
    cell[NR][1]
    split($0,cell[NR])
}
END {
    for (rowNr=1; rowNr<=NF; rowNr++) {
        for (colNr=2; colNr<=NR; colNr++) {
            printf "%6s%s", cell[colNr][rowNr], (colNr<NR ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
     1;    11;   111
     2;    22;   222
     3;    33;   333
   ...;   ...;   ...
like image 181
Ed Morton Avatar answered Dec 31 '22 14:12

Ed Morton


If you know you have 3 columns, you can do it in a very ugly way as following:

pr -3ts <file>

All that needs to be done then is to remove your brackets:

$ pr -3ts ~/tmp/f | awk 'BEGIN{OFS="; "}{gsub(/[()]/,"")}(NF){$1=$1; print}'
1; 11; 111
2; 22; 222
3; 33; 333
...; ...; ...

You can also do it in a single awk line, but it just complicates things. The above is quick and easy.

This awk program does the full generic version:

awk 'BEGIN{r=c=0}
     /)/{r=0; c++; next}
     {gsub(/[( ]/,"")}
     (NF){a[r++,c]=$1; rm=rm>r?rm:r}
     END{ for(i=0;i<rm;++i) {
            printf a[i,0];
            for(j=1;j<c;++j) printf "; " a[i,j];
            print ""
          }
     }' <file>
like image 35
kvantour Avatar answered Dec 31 '22 14:12

kvantour