I would like awk
to print every nth line out of a file starting from line 0. Then, after awk has gone through the whole file, I would like it to print every nth line starting from line 1...then print every nth line starting from line 2...etc, up to printing every nth line starting from line n-1. My sad attempt thus far:
#!/bin/bash
rm *.sad *.sadd *.out
#Create loop index
for i in $(seq 20 1 36);
do
listm+=($i)
done
#Create input file
for j in "${listm[@]}"
do
if [ $j -eq 20 ];
then
awk 'NR % 20 == 0' vel_VMDout > atomvel.dat
awk '{print $2,$3,$4}' atomvel.dat > velocity.dat
else
awk 'NR % 20 == 1' vel_VMDout > $j.sad
egrep -v "^[[:space:]]*$|^#" $j.sad > $j.sadd
awk '{print $2, $3, $4}' $j.sadd > $j.out
paste velocity.dat $j.out > taste
fi
done
Let me try to clarify this by providing the input and what the output should look like. Th input is an xyz file of an MD simulation consisting of frames of the atoms' xyz coordinates.
INPUT:
This image shows the 1st snapshot and part of the second snapshot. Because these are snapshot, the ordering of the atoms do not change. Thus, I am trying to print the xyz coordinates from each snapshot for each specific atom in their own columns as shown below. This would eventually make a file consisting of 3N columns, where N is the number of atoms.
OUTPUT:
As you can see, the each atoms' coordinates are in their own columns and the total file is a Nx3N array. My bash script was me trying to do this, but could only do the first two atoms. I wanted to print every nth line (coordinates of the nth atom) so they look like the output. I really appreciate your patience all.
This is a step that should not be necessary; the question should have included usable sample data and the required output from that sample data.
At one level, it won't help much because you don't have my random number generator program, but the script below shows how I generated the data that follows, and it illustrates the lengths to which it might be necessary to go when the question doesn't supply readable data. I generated some data that looks similar to the data in the question (at least superficially):
18
Generated by VMD in absentia
C 0.979485 -6.665347 0.575383
C 1.191999 -3.002386 2.859484
C 3.151517 -5.610077 0.429413
C 3.439828 -6.454984 1.319724
C 3.726201 -0.123038 2.096854
C 1.363325 -3.031238 0.016019
C 6.090283 -3.915340 2.396358
C 0.407755 -7.957784 -0.846842
C 0.203074 -0.796428 2.659573
O 2.600610 -2.259674 -0.260378
O 4.773839 -6.765097 0.588508
H 2.743424 -2.890016 2.906452
H 2.810233 -6.641054 -0.797672
H 6.854169 -3.191721 -0.925670
O 2.914233 -1.060001 0.776983
H 3.803923 -1.497032 2.908799
H 5.669443 -7.227666 -0.647552
H 0.092455 -5.850637 2.959987
18
Generated by VMD in absentia
C 6.042840 -7.254720 2.093573
C 2.551942 -6.044322 2.061072
C 3.523150 -6.167163 2.451689
C 5.197316 -3.429866 -0.412062
C 2.548777 -6.422851 1.282846
C 3.775197 -2.012031 1.377440
C 3.405112 -3.206415 -0.879886
C 1.448359 -5.419629 0.467291
C 3.661964 -2.789234 2.644294
O 4.214854 -2.439574 -0.951704
O 5.297609 -2.320418 2.709898
H 2.653940 -4.431080 -0.511743
H 5.040635 -0.676199 -0.590970
H 1.546725 -1.294582 2.562937
O 4.231461 -7.180908 1.629901
H 3.297836 -1.557133 -0.133280
H 3.442481 -4.489962 2.111930
H 1.423611 -7.982655 0.715618
18
Generated by VMD in absentia
C 1.432495 -7.686243 2.525734
C 5.038409 -4.976270 2.826846
C 6.184137 -7.303094 2.711561
C 3.208125 -0.606556 1.978725
C 2.171859 -6.792060 0.678988
C 6.521124 -5.622797 -0.773797
C 1.725619 -5.768633 -0.223397
C 3.602427 -2.325680 1.762008
C 1.937521 -1.686895 1.743159
O 0.745526 -0.114246 -0.949490
O 4.754360 -6.531145 1.998913
H 1.114732 -1.158810 1.486939
H 6.410490 -5.411647 0.062737
H 4.164330 -6.743763 1.802804
O 2.587841 -3.979700 2.609748
H 2.192073 -2.815376 -0.809569
H 5.501795 -2.326438 1.325829
H 3.285032 -1.212541 1.284453
18
Generated by VMD in absentia
C 3.564424 -3.117406 -0.032879
C 2.894745 -0.632591 0.532311
C 3.384916 -5.383135 1.179585
C 0.793488 -0.894539 -0.886891
C 1.348785 -6.501867 1.648604
C 2.189941 -2.438067 0.616090
C 2.043378 -4.966472 0.691603
C 3.124161 -5.792896 0.545362
C 5.741472 -0.640590 2.825374
O 0.300550 -7.149663 0.942726
O 1.344387 -0.121382 2.169401
H 4.963296 -0.964665 -0.230523
H 6.651423 -4.905053 2.509626
H 5.059694 -6.166516 0.102255
O 5.046864 -3.288883 0.853948
H 2.389007 -3.057664 1.806301
H 2.365876 -0.956860 1.458959
H 2.892502 -0.097422 -0.531714
The script I used to do it was:
random -n $((4 * 18)) -T '%8:6[0:7]F %8:6[-8:0]F %8:6[-1:3]F' |
awk 'BEGIN { n = split("CCCCCCCCCOOHHHOHHH", atoms, ""); atoms[0] = atoms[n] }
NR % n == 1 { print n; print " Generated by VMD in absentia" }
{ print "", atoms[NR%18], " ", $0 }'
The -n
option to random
says how many rows to generate; I chose 72. The -T
option is a template, and the notation %8:6[0:7]F
means use %8.6F
format to print uniformly distributed random numbers between 0 and 7. The awk
script takes the data that is so generated and interpolates the noise (the number of atoms and a variant on the 'generated by VMD' line), as well as tagging the lines with the appropriate atomic symbol.
Given some data, you then need to munge it to get the required output. This script more or less does the job. There are endless ways it should be improved, of course, such as taking file names as command line arguments, using temporary file names instead of fixed names, cleaning up the intermediate files, different compounds, different atoms (nitrogen, phosphorous, etc), and so on. However, it should adapt reasonably easily.
input="data"
output="output"
n=$(sed 1q "$input")
n2=$(($n+2))
for ((i = 3; i <= n2; i++))
do
colno=$(printf "%.2d" $(($i-2)))
awk -v N=$n2 -v R=$i \
' BEGIN { name["C"] = "Carbon"; name["H"] = "Hydrogen"; name["O"] = "Oxygen";
R0 = R % N }
NR > 2 && NR <= R { count[$1]++; }
NR == R { printf "%-32.32s\n", name[$1] " " count[$1]; }
NR % N == R0 { xyz = sprintf("%s %s %s", $2, $3, $4); printf "%-32.32s\n", xyz }
' "$input" > "column.$colno"
done
paste -d ' ' column.* > "$output"
The first four lines set up the control parameters, collecting the number of lines per unit of data from the input file, and adjusting things accordingly. The for
loop iterates over offsets 3
to $n2
inclusive (skipping the two header lines), and runs the awk
script. That encodes atom types (BEGIN
), determines which atom it is processing this time (NR > 2 && NR <= R
and NR == R
), and then arranges to print the triplets of data for the relevant atom. The formatting is carefully organized so that the column headings and the actual xyz-triplets are uniformly spaced. These are written to a file column.$colno
. When all's done, the column.*
files are pasted to generate a single output file, which looks like this:
Carbon 1 Carbon 2 Carbon 3 Carbon 4 Carbon 5 Carbon 6 Carbon 7 Carbon 8 Carbon 9 Oxygen 1 Oxygen 2 Hydrogen 1 Hydrogen 2 Hydrogen 3 Oxygen 3 Hydrogen 4 Hydrogen 5 Hydrogen 6
0.979485 -6.665347 0.575383 1.191999 -3.002386 2.859484 3.151517 -5.610077 0.429413 3.439828 -6.454984 1.319724 3.726201 -0.123038 2.096854 1.363325 -3.031238 0.016019 6.090283 -3.915340 2.396358 0.407755 -7.957784 -0.846842 0.203074 -0.796428 2.659573 2.600610 -2.259674 -0.260378 4.773839 -6.765097 0.588508 2.743424 -2.890016 2.906452 2.810233 -6.641054 -0.797672 6.854169 -3.191721 -0.925670 2.914233 -1.060001 0.776983 3.803923 -1.497032 2.908799 5.669443 -7.227666 -0.647552 0.092455 -5.850637 2.959987
6.042840 -7.254720 2.093573 2.551942 -6.044322 2.061072 3.523150 -6.167163 2.451689 5.197316 -3.429866 -0.412062 2.548777 -6.422851 1.282846 3.775197 -2.012031 1.377440 3.405112 -3.206415 -0.879886 1.448359 -5.419629 0.467291 3.661964 -2.789234 2.644294 4.214854 -2.439574 -0.951704 5.297609 -2.320418 2.709898 2.653940 -4.431080 -0.511743 5.040635 -0.676199 -0.590970 1.546725 -1.294582 2.562937 4.231461 -7.180908 1.629901 3.297836 -1.557133 -0.133280 3.442481 -4.489962 2.111930 1.423611 -7.982655 0.715618
1.432495 -7.686243 2.525734 5.038409 -4.976270 2.826846 6.184137 -7.303094 2.711561 3.208125 -0.606556 1.978725 2.171859 -6.792060 0.678988 6.521124 -5.622797 -0.773797 1.725619 -5.768633 -0.223397 3.602427 -2.325680 1.762008 1.937521 -1.686895 1.743159 0.745526 -0.114246 -0.949490 4.754360 -6.531145 1.998913 1.114732 -1.158810 1.486939 6.410490 -5.411647 0.062737 4.164330 -6.743763 1.802804 2.587841 -3.979700 2.609748 2.192073 -2.815376 -0.809569 5.501795 -2.326438 1.325829 3.285032 -1.212541 1.284453
3.564424 -3.117406 -0.032879 2.894745 -0.632591 0.532311 3.384916 -5.383135 1.179585 0.793488 -0.894539 -0.886891 1.348785 -6.501867 1.648604 2.189941 -2.438067 0.616090 2.043378 -4.966472 0.691603 3.124161 -5.792896 0.545362 5.741472 -0.640590 2.825374 0.300550 -7.149663 0.942726 1.344387 -0.121382 2.169401 4.963296 -0.964665 -0.230523 6.651423 -4.905053 2.509626 5.059694 -6.166516 0.102255 5.046864 -3.288883 0.853948 2.389007 -3.057664 1.806301 2.365876 -0.956860 1.458959 2.892502 -0.097422 -0.531714
Your task is to understand why all the bits of the awk
script are present. For example, why is R0
needed (hint, experiment without the R0
calculation, and use R
in its place).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With