I have a text file produced by some commercial software, looking like below. It consists in brackets delimited sections, each of which counts several million elements but the exact value changes from one case to another.
(1
2
3
...
)
(11
22
33
...
)
(111
222
333
...
)
I need to achieve an output like:
1; 11; 111
2; 22; 222
3; 33; 333
... ... ...
I found a complicated way that is:
perform sed operations to get
1
2
3
...
#
11
22
33
...
#
111
222
333
...
use awk as follows to split my file in several sub-files
awk -v RS="#" '{print > ("splitted-" NR ".txt")}'
remove white spaces from my subfiles again with sed
sed -i '/^[[:space:]]*$/d' splitted*.txt
join everything together:
paste splitted*.txt > out.txt
add a field separator (defined in my bash script)
awk -v sep=$my_sep 'BEGIN{OFS=sep}{$1=$1; print }' out.txt > formatted.txt
I feel this is crappy as I loop over million lines several time. Even if the return time is quite OK (~80sec), I'd like to find a full awk solution but can't get to it. Something like:
awk 'BEGIN{RS="(\\n)"; OFS=";"} { print something } '
I found some related questions, especially this one row to column conversion with awk, but it assumes a constant number of lines between brackets which I can't do.
Any help would be appreciated.
With GNU awk for multi-char RS and true multi dimensional arrays:
$ cat tst.awk
BEGIN {
RS = "(\\s*[()]\\s*)+"
OFS = ";"
}
NR>1 {
cell[NR][1]
split($0,cell[NR])
}
END {
for (rowNr=1; rowNr<=NF; rowNr++) {
for (colNr=2; colNr<=NR; colNr++) {
printf "%6s%s", cell[colNr][rowNr], (colNr<NR ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
1; 11; 111
2; 22; 222
3; 33; 333
...; ...; ...
If you know you have 3 columns, you can do it in a very ugly way as following:
pr -3ts <file>
All that needs to be done then is to remove your brackets:
$ pr -3ts ~/tmp/f | awk 'BEGIN{OFS="; "}{gsub(/[()]/,"")}(NF){$1=$1; print}'
1; 11; 111
2; 22; 222
3; 33; 333
...; ...; ...
You can also do it in a single awk line, but it just complicates things. The above is quick and easy.
This awk program does the full generic version:
awk 'BEGIN{r=c=0}
/)/{r=0; c++; next}
{gsub(/[( ]/,"")}
(NF){a[r++,c]=$1; rm=rm>r?rm:r}
END{ for(i=0;i<rm;++i) {
printf a[i,0];
for(j=1;j<c;++j) printf "; " a[i,j];
print ""
}
}' <file>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With