Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting data in rows to columns

My input tab-delimited file is like this:

13435    830169  830264  a    95   y    16
09433    835620  835672  x    46
30945    838405  838620  a    21   c    19
94853    850475  850660  y    15
04958    865700  865978  c    16   a    98

After first three columns, the file shows variable and its value in the next column. I need to change data structure so that after first three columns, there are columns for variables like this:

                         a    x    y    c   
13435    830169  830264  95        16
09433    835620  835672       46
30945    838405  838620  21             19
94853    850475  850660            15
04958    865700  865978  98             16

Is there any code to do this on linux? The file size is 7.6 MB and the total number of lines are around 450,000. The total number of variables are four.

Thank you

like image 471
Ali_A423 Avatar asked Jan 25 '23 04:01

Ali_A423


2 Answers

Assumptions:

  • the four variable names (a/c/x/y in the sample input) are not known beforehand
  • a variable always has a non-blank value following it
  • the number of variable/value pairs (on a single input line) is not known beforehand
  • OP is ok with printing the variable columns in alphabetical order (OP's desired output does not designate if/how the four variable columns are to be sorted)
  • ordering of rows is to remain the same (input order == output order)
  • host has enough memory to hold entire input file in memory (via awk arrays); this allows for a single pass of the input file; if memory were to be an issue (ie, input file cannot fit in memory) then a different coding/design would be needed (not addressed in this answer)

Another awk idea ... requiring GNU awk for the use of multi-dimensional arrays as well as the PROCINFO["sorted_in"] construct:

awk '
BEGIN { FS=OFS="\t" }                             # input/output field delimiters = <tab>

      { first3[FNR]=$1 OFS $2 OFS $3              # store first 3 fields

        for (i=4;i<=NF;i=i+2) {                   # loop through rest of fields, 2 at a time
            vars[$i]                              # keep track of variable names
            values[FNR][$i]=$(i+1)                # store the value for this line/variable combo
        }
      }

END   { PROCINFO["sorted_in"]="@ind_str_asc"      # sort vars[] indexes in ascending order

        printf "%s%s", OFS, OFS                   # start printing header line ...
        for (v in vars)                           # loop through variable names ...
            printf "%s%s", OFS, v                 # printing to header line
        printf "\n"                               # terminate header line

        for (i=1;i<=FNR;i++) {                    # loop through our set of lines ...
            printf "%s",first3[i]                 # print the 1st 3 fields and then ...
            for (v in vars)                       # loop through list of all variables ...
                printf "%s%s",OFS,values[i][v]    # printing the associated value; non-existent values default to the empty string ""
            printf "\n"                           # terminate the current line of output
        }
      }
' inputfile

NOTE: This design allows for a variable number of variables to be processed.

For demonstration purposes we'll use the following tab-delimited input files:

$ cat input4                                         # OP's sample input file w/ 4 variables
13435   830169  830264  a       95      y       16
09433   835620  835672  x       46
30945   838405  838620  a       21      c       19
94853   850475  850660  y       15
04958   865700  865978  c       16      a       98

$ cat input6                                         # 2 additional variables added to OP's original input file
13435   830169  830264  a       95      y       16
09433   835620  835672  x       46      t       375
30945   838405  838620  a       21      c       19
94853   850475  850660  y       15      j       127     t       453
04958   865700  865978  c       16      a       98

Running these through the awk script generates:

############# input4
                        a       c       x       y
13435   830169  830264  95                      16
09433   835620  835672                  46
30945   838405  838620  21      19
94853   850475  850660                          15
04958   865700  865978  98      16

############# input6
                        a       c       j       t       x       y
13435   830169  830264  95                                      16
09433   835620  835672                          375     46
30945   838405  838620  21      19
94853   850475  850660                  127     453             15
04958   865700  865978  98      16
like image 188
markp-fuso Avatar answered Jan 29 '23 09:01

markp-fuso


In pure bash (requires bash 4.0 or newer):

#!/bin/bash

declare -A var

printf '\t\t\ta\tx\ty\tc\n'
while IFS=$'\t' read -ra fld; do
    var[a]=""  var[x]=""  var[y]=""  var[c]=""
    for ((i = 3; i < ${#fld[@]}; i += 2)); do
        var["${fld[i]}"]=${fld[i + 1]}
    done
    printf '%s\t' "${fld[@]:0:3}"
    printf '%s\t%s\t%s\t%s\n' "${var[a]}" "${var[x]}" "${var[y]}" "${var[c]}"
done < file
like image 41
M. Nejat Aydin Avatar answered Jan 29 '23 09:01

M. Nejat Aydin