My input tab-delimited file is like this:
13435 830169 830264 a 95 y 16
09433 835620 835672 x 46
30945 838405 838620 a 21 c 19
94853 850475 850660 y 15
04958 865700 865978 c 16 a 98
After first three columns, the file shows variable and its value in the next column. I need to change data structure so that after first three columns, there are columns for variables like this:
a x y c
13435 830169 830264 95 16
09433 835620 835672 46
30945 838405 838620 21 19
94853 850475 850660 15
04958 865700 865978 98 16
Is there any code to do this on linux? The file size is 7.6 MB and the total number of lines are around 450,000. The total number of variables are four.
Thank you
Assumptions:
a/c/x/y
in the sample input) are not known beforehandawk
arrays); this allows for a single pass of the input file; if memory were to be an issue (ie, input file cannot fit in memory) then a different coding/design would be needed (not addressed in this answer)Another awk
idea ... requiring GNU awk
for the use of multi-dimensional arrays as well as the PROCINFO["sorted_in"]
construct:
awk '
BEGIN { FS=OFS="\t" } # input/output field delimiters = <tab>
{ first3[FNR]=$1 OFS $2 OFS $3 # store first 3 fields
for (i=4;i<=NF;i=i+2) { # loop through rest of fields, 2 at a time
vars[$i] # keep track of variable names
values[FNR][$i]=$(i+1) # store the value for this line/variable combo
}
}
END { PROCINFO["sorted_in"]="@ind_str_asc" # sort vars[] indexes in ascending order
printf "%s%s", OFS, OFS # start printing header line ...
for (v in vars) # loop through variable names ...
printf "%s%s", OFS, v # printing to header line
printf "\n" # terminate header line
for (i=1;i<=FNR;i++) { # loop through our set of lines ...
printf "%s",first3[i] # print the 1st 3 fields and then ...
for (v in vars) # loop through list of all variables ...
printf "%s%s",OFS,values[i][v] # printing the associated value; non-existent values default to the empty string ""
printf "\n" # terminate the current line of output
}
}
' inputfile
NOTE: This design allows for a variable number of variables to be processed.
For demonstration purposes we'll use the following tab-delimited input files:
$ cat input4 # OP's sample input file w/ 4 variables
13435 830169 830264 a 95 y 16
09433 835620 835672 x 46
30945 838405 838620 a 21 c 19
94853 850475 850660 y 15
04958 865700 865978 c 16 a 98
$ cat input6 # 2 additional variables added to OP's original input file
13435 830169 830264 a 95 y 16
09433 835620 835672 x 46 t 375
30945 838405 838620 a 21 c 19
94853 850475 850660 y 15 j 127 t 453
04958 865700 865978 c 16 a 98
Running these through the awk
script generates:
############# input4
a c x y
13435 830169 830264 95 16
09433 835620 835672 46
30945 838405 838620 21 19
94853 850475 850660 15
04958 865700 865978 98 16
############# input6
a c j t x y
13435 830169 830264 95 16
09433 835620 835672 375 46
30945 838405 838620 21 19
94853 850475 850660 127 453 15
04958 865700 865978 98 16
In pure bash (requires bash 4.0 or newer):
#!/bin/bash
declare -A var
printf '\t\t\ta\tx\ty\tc\n'
while IFS=$'\t' read -ra fld; do
var[a]="" var[x]="" var[y]="" var[c]=""
for ((i = 3; i < ${#fld[@]}; i += 2)); do
var["${fld[i]}"]=${fld[i + 1]}
done
printf '%s\t' "${fld[@]:0:3}"
printf '%s\t%s\t%s\t%s\n' "${var[a]}" "${var[x]}" "${var[y]}" "${var[c]}"
done < file
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With