I have huge binary matrices of many columns, and I am trying to get a count of zeroes and ones in each field, for each file, while keeping track of the file and the headers. Each file has the same headers and number of columns (but variable number of rows) and it is like this:
File 1:
Header1 Header2 Header3 Header4
0 1 0 1
0 1 0 1
1 0 0 1
0 1 0 1
File 2:
Header1 Header2 Header3 Header4
0 1 0 0
0 0 0 0
0 0 0 1
Desired output with counts of 0/1
Header1 Header2 Header3 Header4 Total
File1 1 3 0 4 4
File2 0 1 0 1 3
For now I have a count of values equal to ones for only file1, but it comes out with each row as a header while I would like the original headers to stay as headers and also this doesn't print a 0 if there are no ones... And it does not contain the original file name, so it is not right overall! Can you please direct me to the correct way to do this?
awk 'NF>0{
for (i=1; i<=NF; i++)
if(NR==1)h[i]=$i;else if($i==1) a[i]++;
} END{for(i=1; i<=length(a); i++) print h[i], a[i], NR}' file1
The following seems to work for me:
awk '
# Gather headers, only from the first line of the first file.
NR==1{
for(i=1;i<=NF;i++){
h[i]=$i;
}
}
# Do not process header as if they were data.
FNR==1{ next; }
NF>limit{ limit=NF; }
# Step through data
{
f[FILENAME]++;
for(i=1;i<=NF;i++){
a[FILENAME,i]+=$i;
}
}
# Display what we found.
END{
# Headers...
printf("File\t");
for(i=1;i<=length(h);i++){
printf("%s\t",h[i])
}
print "Total";
# And data.
for(file in f){
printf("%s",file);
for(i=1;i<=limit;i++){
printf("\t%d",a[file,i])
}
printf("\t%d\n",f[file]);
}
}' file1 file2
Note that we're keeping an array f[]
for filenames because awk doesn't really support multidimensional arrays. The script above should work in just about any old awk. (I tested it in FreeBSD.) Though you might run in to a problem if you're processing millions of files, as the arrays use a non-zero amount of memory. On the other hand, the number of files is also limited by your shell's command line length. :-)
One thing I'm not sure of is why your header count does not match the number of fields in your data. But perhaps this is sufficient for you to take it the rest of the way.
Assuming that the example files should, in fact, have as many headers as there are columns (the example has four columns but only three headers), the following code works for me:
#!/bin/sh
awk '
function pr(filename) {
if (filename) printf ("%s",filename)
for (i=1; i<=NF; i++) {
if (filename)
printf ("%s%s",OFS,a[i])
else
printf ("%s%s",OFS,$i)
a[i] = 0
}
if (filename)
printf ("%s%s",OFS,prevFNR-1)
else
printf ("%sTotal",OFS)
printf ("\n")
}
FNR==1 {
pr(prevFileName)
prevFileName = FILENAME
next
}
NF>0 {
for (i=1; i<=NF; i++)
if ($i==1) a[i]++
prevFNR = FNR
}
END {
pr(FILENAME)
}' file1 file2
There is no way to know that the program has reached the last line of any given file but FNR==1
is true at the start of the next file, so I've used that to trigger printing each line. For that reason, the program uses prevFNR
and prevFileName
to remember the number of records in the previous file and the name of that file for display. The printing code is called from two different places so I've put it in a function, pr()
, which uses the fact that prevFileName
doesn't have a value the first time FNR==1
to indicate that it should print the header line rather than the computed summary information.
The output is:
Header1 Header2 Header3 Header4 Total
file1 1 3 0 4 4
file2 0 1 0 1 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With