Print a count of unique values for each column in many files

Question

I have huge binary matrices of many columns, and I am trying to get a count of zeroes and ones in each field, for each file, while keeping track of the file and the headers. Each file has the same headers and number of columns (but variable number of rows) and it is like this:

File 1:
Header1 Header2 Header3 Header4
0 1 0 1 
0 1 0 1
1 0 0 1
0 1 0 1

File 2:
Header1 Header2 Header3 Header4
0 1 0 0 
0 0 0 0
0 0 0 1

Desired output with counts of 0/1

    Header1 Header2 Header3 Header4 Total
File1 1 3 0 4  4
File2 0 1 0 1  3

For now I have a count of values equal to ones for only file1, but it comes out with each row as a header while I would like the original headers to stay as headers and also this doesn't print a 0 if there are no ones... And it does not contain the original file name, so it is not right overall! Can you please direct me to the correct way to do this?

awk 'NF>0{
  for (i=1; i<=NF; i++) 
      if(NR==1)h[i]=$i;else if($i==1) a[i]++;
  } END{for(i=1; i<=length(a); i++) print h[i], a[i], NR}' file1

ghoti · Accepted Answer

The following seems to work for me:

awk '
  # Gather headers, only from the first line of the first file.
  NR==1{
    for(i=1;i<=NF;i++){
      h[i]=$i;
    }
  }
  # Do not process header as if they were data.
  FNR==1{ next; }

  NF>limit{ limit=NF; }

  # Step through data 
  {
    f[FILENAME]++;
    for(i=1;i<=NF;i++){
      a[FILENAME,i]+=$i;
    }
  }

  # Display what we found.
  END{
    # Headers...
    printf("File	");
    for(i=1;i<=length(h);i++){
      printf("%s	",h[i])
    }
    print "Total";

    # And data.
    for(file in f){
      printf("%s",file);
      for(i=1;i<=limit;i++){
        printf("	%d",a[file,i])
      }
      printf("	%d
",f[file]);
    }
  }' file1 file2

Note that we're keeping an array f[] for filenames because awk doesn't really support multidimensional arrays. The script above should work in just about any old awk. (I tested it in FreeBSD.) Though you might run in to a problem if you're processing millions of files, as the arrays use a non-zero amount of memory. On the other hand, the number of files is also limited by your shell's command line length. :-)

One thing I'm not sure of is why your header count does not match the number of fields in your data. But perhaps this is sufficient for you to take it the rest of the way.

Simon · Answer

Assuming that the example files should, in fact, have as many headers as there are columns (the example has four columns but only three headers), the following code works for me:

#!/bin/sh
awk '
    function pr(filename) {
        if (filename) printf ("%s",filename)
        for (i=1; i<=NF; i++) {
            if (filename)
                printf ("%s%s",OFS,a[i])
            else
                printf ("%s%s",OFS,$i) 
            a[i] = 0
            }
        if (filename)
            printf ("%s%s",OFS,prevFNR-1) 
        else 
            printf ("%sTotal",OFS)
        printf ("
")
        }

    FNR==1  {
            pr(prevFileName)
            prevFileName = FILENAME
            next
            }

    NF>0    {
            for (i=1; i<=NF; i++) 
                if ($i==1) a[i]++
            prevFNR = FNR
            } 

    END {
        pr(FILENAME)
        }' file1 file2

There is no way to know that the program has reached the last line of any given file but FNR==1 is true at the start of the next file, so I've used that to trigger printing each line. For that reason, the program uses prevFNR and prevFileName to remember the number of records in the previous file and the name of that file for display. The printing code is called from two different places so I've put it in a function, pr(), which uses the fact that prevFileName doesn't have a value the first time FNR==1 to indicate that it should print the header line rather than the computed summary information.

The output is:

 Header1 Header2 Header3 Header4 Total
file1 1 3 0 4 4
file2 0 1 0 1 3

Print a count of unique values for each column in many files

Tags:

linux

awk

user971102

2 Answers

ghoti

Simon

Recent Activity

Donate For Us

Print a count of unique values for each column in many files

Tags:

linux

awk

user971102

2 Answers

ghoti

Simon

Related questions

Recent Activity

Donate For Us