Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to awk every nth line starting from different lines each iteration

Tags:

awk

I would like awk to print every nth line out of a file starting from line 0. Then, after awk has gone through the whole file, I would like it to print every nth line starting from line 1...then print every nth line starting from line 2...etc, up to printing every nth line starting from line n-1. My sad attempt thus far:

#!/bin/bash

rm *.sad *.sadd *.out

#Create loop index
for i in $(seq 20 1 36);
do
        listm+=($i)
done

#Create input file
for j in "${listm[@]}"
do
        if [ $j -eq 20 ];
        then
                awk 'NR % 20 == 0' vel_VMDout > atomvel.dat
                awk '{print $2,$3,$4}' atomvel.dat > velocity.dat
        else
                awk 'NR % 20 == 1' vel_VMDout  > $j.sad
                egrep -v "^[[:space:]]*$|^#" $j.sad > $j.sadd
                awk '{print $2, $3, $4}' $j.sadd > $j.out
                paste velocity.dat $j.out  > taste
        fi
done

Let me try to clarify this by providing the input and what the output should look like. Th input is an xyz file of an MD simulation consisting of frames of the atoms' xyz coordinates.

INPUT:

input

This image shows the 1st snapshot and part of the second snapshot. Because these are snapshot, the ordering of the atoms do not change. Thus, I am trying to print the xyz coordinates from each snapshot for each specific atom in their own columns as shown below. This would eventually make a file consisting of 3N columns, where N is the number of atoms.

OUTPUT:

output

As you can see, the each atoms' coordinates are in their own columns and the total file is a Nx3N array. My bash script was me trying to do this, but could only do the first two atoms. I wanted to print every nth line (coordinates of the nth atom) so they look like the output. I really appreciate your patience all.

like image 645
Jack_Bandit Avatar asked Nov 01 '22 00:11

Jack_Bandit


1 Answers

Generating sample data

This is a step that should not be necessary; the question should have included usable sample data and the required output from that sample data.

At one level, it won't help much because you don't have my random number generator program, but the script below shows how I generated the data that follows, and it illustrates the lengths to which it might be necessary to go when the question doesn't supply readable data. I generated some data that looks similar to the data in the question (at least superficially):

18
 Generated by VMD in absentia
 C     0.979485   -6.665347   0.575383
 C     1.191999   -3.002386   2.859484
 C     3.151517   -5.610077   0.429413
 C     3.439828   -6.454984   1.319724
 C     3.726201   -0.123038   2.096854
 C     1.363325   -3.031238   0.016019
 C     6.090283   -3.915340   2.396358
 C     0.407755   -7.957784   -0.846842
 C     0.203074   -0.796428   2.659573
 O     2.600610   -2.259674   -0.260378
 O     4.773839   -6.765097   0.588508
 H     2.743424   -2.890016   2.906452
 H     2.810233   -6.641054   -0.797672
 H     6.854169   -3.191721   -0.925670
 O     2.914233   -1.060001   0.776983
 H     3.803923   -1.497032   2.908799
 H     5.669443   -7.227666   -0.647552
 H     0.092455   -5.850637   2.959987
18
 Generated by VMD in absentia
 C     6.042840   -7.254720   2.093573
 C     2.551942   -6.044322   2.061072
 C     3.523150   -6.167163   2.451689
 C     5.197316   -3.429866   -0.412062
 C     2.548777   -6.422851   1.282846
 C     3.775197   -2.012031   1.377440
 C     3.405112   -3.206415   -0.879886
 C     1.448359   -5.419629   0.467291
 C     3.661964   -2.789234   2.644294
 O     4.214854   -2.439574   -0.951704
 O     5.297609   -2.320418   2.709898
 H     2.653940   -4.431080   -0.511743
 H     5.040635   -0.676199   -0.590970
 H     1.546725   -1.294582   2.562937
 O     4.231461   -7.180908   1.629901
 H     3.297836   -1.557133   -0.133280
 H     3.442481   -4.489962   2.111930
 H     1.423611   -7.982655   0.715618
18
 Generated by VMD in absentia
 C     1.432495   -7.686243   2.525734
 C     5.038409   -4.976270   2.826846
 C     6.184137   -7.303094   2.711561
 C     3.208125   -0.606556   1.978725
 C     2.171859   -6.792060   0.678988
 C     6.521124   -5.622797   -0.773797
 C     1.725619   -5.768633   -0.223397
 C     3.602427   -2.325680   1.762008
 C     1.937521   -1.686895   1.743159
 O     0.745526   -0.114246   -0.949490
 O     4.754360   -6.531145   1.998913
 H     1.114732   -1.158810   1.486939
 H     6.410490   -5.411647   0.062737
 H     4.164330   -6.743763   1.802804
 O     2.587841   -3.979700   2.609748
 H     2.192073   -2.815376   -0.809569
 H     5.501795   -2.326438   1.325829
 H     3.285032   -1.212541   1.284453
18
 Generated by VMD in absentia
 C     3.564424   -3.117406   -0.032879
 C     2.894745   -0.632591   0.532311
 C     3.384916   -5.383135   1.179585
 C     0.793488   -0.894539   -0.886891
 C     1.348785   -6.501867   1.648604
 C     2.189941   -2.438067   0.616090
 C     2.043378   -4.966472   0.691603
 C     3.124161   -5.792896   0.545362
 C     5.741472   -0.640590   2.825374
 O     0.300550   -7.149663   0.942726
 O     1.344387   -0.121382   2.169401
 H     4.963296   -0.964665   -0.230523
 H     6.651423   -4.905053   2.509626
 H     5.059694   -6.166516   0.102255
 O     5.046864   -3.288883   0.853948
 H     2.389007   -3.057664   1.806301
 H     2.365876   -0.956860   1.458959
 H     2.892502   -0.097422   -0.531714

The script I used to do it was:

random -n $((4 * 18)) -T '%8:6[0:7]F   %8:6[-8:0]F   %8:6[-1:3]F' |
awk 'BEGIN { n = split("CCCCCCCCCOOHHHOHHH", atoms, ""); atoms[0] = atoms[n] }
     NR % n == 1 { print n; print " Generated by VMD in absentia" }
     { print "", atoms[NR%18], "   ", $0 }'

The -n option to random says how many rows to generate; I chose 72. The -T option is a template, and the notation %8:6[0:7]F means use %8.6F format to print uniformly distributed random numbers between 0 and 7. The awk script takes the data that is so generated and interpolates the noise (the number of atoms and a variant on the 'generated by VMD' line), as well as tagging the lines with the appropriate atomic symbol.

Processing the sample data

Given some data, you then need to munge it to get the required output. This script more or less does the job. There are endless ways it should be improved, of course, such as taking file names as command line arguments, using temporary file names instead of fixed names, cleaning up the intermediate files, different compounds, different atoms (nitrogen, phosphorous, etc), and so on. However, it should adapt reasonably easily.

input="data"
output="output"
n=$(sed 1q "$input")
n2=$(($n+2))

for ((i = 3; i <= n2; i++))
do
    colno=$(printf "%.2d" $(($i-2)))
    awk -v N=$n2 -v R=$i \
        '   BEGIN { name["C"] = "Carbon"; name["H"] = "Hydrogen"; name["O"] = "Oxygen";
                    R0 = R % N }
            NR > 2 && NR <= R { count[$1]++; }
            NR == R { printf "%-32.32s\n", name[$1] " " count[$1]; }
            NR % N == R0 { xyz = sprintf("%s %s %s", $2, $3, $4); printf "%-32.32s\n", xyz }
        ' "$input" > "column.$colno"
done

paste -d ' ' column.* > "$output"

The first four lines set up the control parameters, collecting the number of lines per unit of data from the input file, and adjusting things accordingly. The for loop iterates over offsets 3 to $n2 inclusive (skipping the two header lines), and runs the awk script. That encodes atom types (BEGIN), determines which atom it is processing this time (NR > 2 && NR <= R and NR == R), and then arranges to print the triplets of data for the relevant atom. The formatting is carefully organized so that the column headings and the actual xyz-triplets are uniformly spaced. These are written to a file column.$colno. When all's done, the column.* files are pasted to generate a single output file, which looks like this:

Carbon 1                         Carbon 2                         Carbon 3                         Carbon 4                         Carbon 5                         Carbon 6                         Carbon 7                         Carbon 8                         Carbon 9                         Oxygen 1                         Oxygen 2                         Hydrogen 1                       Hydrogen 2                       Hydrogen 3                       Oxygen 3                         Hydrogen 4                       Hydrogen 5                       Hydrogen 6                      
0.979485 -6.665347 0.575383      1.191999 -3.002386 2.859484      3.151517 -5.610077 0.429413      3.439828 -6.454984 1.319724      3.726201 -0.123038 2.096854      1.363325 -3.031238 0.016019      6.090283 -3.915340 2.396358      0.407755 -7.957784 -0.846842     0.203074 -0.796428 2.659573      2.600610 -2.259674 -0.260378     4.773839 -6.765097 0.588508      2.743424 -2.890016 2.906452      2.810233 -6.641054 -0.797672     6.854169 -3.191721 -0.925670     2.914233 -1.060001 0.776983      3.803923 -1.497032 2.908799      5.669443 -7.227666 -0.647552     0.092455 -5.850637 2.959987     
6.042840 -7.254720 2.093573      2.551942 -6.044322 2.061072      3.523150 -6.167163 2.451689      5.197316 -3.429866 -0.412062     2.548777 -6.422851 1.282846      3.775197 -2.012031 1.377440      3.405112 -3.206415 -0.879886     1.448359 -5.419629 0.467291      3.661964 -2.789234 2.644294      4.214854 -2.439574 -0.951704     5.297609 -2.320418 2.709898      2.653940 -4.431080 -0.511743     5.040635 -0.676199 -0.590970     1.546725 -1.294582 2.562937      4.231461 -7.180908 1.629901      3.297836 -1.557133 -0.133280     3.442481 -4.489962 2.111930      1.423611 -7.982655 0.715618     
1.432495 -7.686243 2.525734      5.038409 -4.976270 2.826846      6.184137 -7.303094 2.711561      3.208125 -0.606556 1.978725      2.171859 -6.792060 0.678988      6.521124 -5.622797 -0.773797     1.725619 -5.768633 -0.223397     3.602427 -2.325680 1.762008      1.937521 -1.686895 1.743159      0.745526 -0.114246 -0.949490     4.754360 -6.531145 1.998913      1.114732 -1.158810 1.486939      6.410490 -5.411647 0.062737      4.164330 -6.743763 1.802804      2.587841 -3.979700 2.609748      2.192073 -2.815376 -0.809569     5.501795 -2.326438 1.325829      3.285032 -1.212541 1.284453     
3.564424 -3.117406 -0.032879     2.894745 -0.632591 0.532311      3.384916 -5.383135 1.179585      0.793488 -0.894539 -0.886891     1.348785 -6.501867 1.648604      2.189941 -2.438067 0.616090      2.043378 -4.966472 0.691603      3.124161 -5.792896 0.545362      5.741472 -0.640590 2.825374      0.300550 -7.149663 0.942726      1.344387 -0.121382 2.169401      4.963296 -0.964665 -0.230523     6.651423 -4.905053 2.509626      5.059694 -6.166516 0.102255      5.046864 -3.288883 0.853948      2.389007 -3.057664 1.806301      2.365876 -0.956860 1.458959      2.892502 -0.097422 -0.531714    

Your task is to understand why all the bits of the awk script are present. For example, why is R0 needed (hint, experiment without the R0 calculation, and use R in its place).

like image 178
Jonathan Leffler Avatar answered Nov 08 '22 04:11

Jonathan Leffler