Splitting a large, complex one column file into several columns with awk

Question

I have a text file produced by some commercial software, looking like below. It consists in brackets delimited sections, each of which counts several million elements but the exact value changes from one case to another.

(1
 2
 3
...
)
(11
22
33
...
)
(111
222
333
...
)

I need to achieve an output like:

 1;  11;   111
 2;  22;   222
 3;  33;   333
...  ...  ...

I found a complicated way that is:

perform sed operations to get

1
2
3
...
#
11
22
33
...
#
111
222
333
...

use awk as follows to split my file in several sub-files
```
awk -v RS="#" '{print > ("splitted-" NR ".txt")}'
```
remove white spaces from my subfiles again with sed
```
sed -i '/^[[:space:]]*$/d' splitted*.txt
```
join everything together:
```
paste splitted*.txt > out.txt
```

add a field separator (defined in my bash script)

awk -v sep=$my_sep 'BEGIN{OFS=sep}{$1=$1; print }' out.txt > formatted.txt

I feel this is crappy as I loop over million lines several time. Even if the return time is quite OK (~80sec), I'd like to find a full awk solution but can't get to it. Something like:

awk 'BEGIN{RS="(\n)"; OFS=";"} { print something } '

I found some related questions, especially this one row to column conversion with awk, but it assumes a constant number of lines between brackets which I can't do.

Any help would be appreciated.

Ed Morton · Accepted Answer

With GNU awk for multi-char RS and true multi dimensional arrays:

$ cat tst.awk
BEGIN {
    RS  = "(\s*[()]\s*)+"
    OFS = ";"
}
NR>1 {
    cell[NR][1]
    split($0,cell[NR])
}
END {
    for (rowNr=1; rowNr<=NF; rowNr++) {
        for (colNr=2; colNr<=NR; colNr++) {
            printf "%6s%s", cell[colNr][rowNr], (colNr<NR ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
     1;    11;   111
     2;    22;   222
     3;    33;   333
   ...;   ...;   ...

kvantour · Answer

If you know you have 3 columns, you can do it in a very ugly way as following:

pr -3ts <file>

All that needs to be done then is to remove your brackets:

$ pr -3ts ~/tmp/f | awk 'BEGIN{OFS="; "}{gsub(/[()]/,"")}(NF){$1=$1; print}'
1; 11; 111
2; 22; 222
3; 33; 333
...; ...; ...

You can also do it in a single awk line, but it just complicates things. The above is quick and easy.

This awk program does the full generic version:

awk 'BEGIN{r=c=0}
     /)/{r=0; c++; next}
     {gsub(/[( ]/,"")}
     (NF){a[r++,c]=$1; rm=rm>r?rm:r}
     END{ for(i=0;i<rm;++i) {
            printf a[i,0];
            for(j=1;j<c;++j) printf "; " a[i,j];
            print ""
          }
     }' <file>

Splitting a large, complex one column file into several columns with awk

Tags:

bash

field

awk

rows

EdouardIFP

2 Answers

Ed Morton

kvantour

Recent Activity

Donate For Us

Splitting a large, complex one column file into several columns with awk

Tags:

bash

field

awk

rows

EdouardIFP

2 Answers

Ed Morton

kvantour

Related questions

Recent Activity

Donate For Us