Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Print a count of unique values for each column in many files

Tags:

linux

awk

I have huge binary matrices of many columns, and I am trying to get a count of zeroes and ones in each field, for each file, while keeping track of the file and the headers. Each file has the same headers and number of columns (but variable number of rows) and it is like this:

File 1:
Header1 Header2 Header3 Header4
0 1 0 1 
0 1 0 1
1 0 0 1
0 1 0 1

File 2:
Header1 Header2 Header3 Header4
0 1 0 0 
0 0 0 0
0 0 0 1

Desired output with counts of 0/1

    Header1 Header2 Header3 Header4 Total
File1 1 3 0 4  4
File2 0 1 0 1  3

For now I have a count of values equal to ones for only file1, but it comes out with each row as a header while I would like the original headers to stay as headers and also this doesn't print a 0 if there are no ones... And it does not contain the original file name, so it is not right overall! Can you please direct me to the correct way to do this?

awk 'NF>0{
  for (i=1; i<=NF; i++) 
      if(NR==1)h[i]=$i;else if($i==1) a[i]++;
  } END{for(i=1; i<=length(a); i++) print h[i], a[i], NR}' file1
like image 567
user971102 Avatar asked Dec 18 '22 23:12

user971102


2 Answers

The following seems to work for me:

awk '
  # Gather headers, only from the first line of the first file.
  NR==1{
    for(i=1;i<=NF;i++){
      h[i]=$i;
    }
  }
  # Do not process header as if they were data.
  FNR==1{ next; }

  NF>limit{ limit=NF; }

  # Step through data 
  {
    f[FILENAME]++;
    for(i=1;i<=NF;i++){
      a[FILENAME,i]+=$i;
    }
  }

  # Display what we found.
  END{
    # Headers...
    printf("File\t");
    for(i=1;i<=length(h);i++){
      printf("%s\t",h[i])
    }
    print "Total";

    # And data.
    for(file in f){
      printf("%s",file);
      for(i=1;i<=limit;i++){
        printf("\t%d",a[file,i])
      }
      printf("\t%d\n",f[file]);
    }
  }' file1 file2

Note that we're keeping an array f[] for filenames because awk doesn't really support multidimensional arrays. The script above should work in just about any old awk. (I tested it in FreeBSD.) Though you might run in to a problem if you're processing millions of files, as the arrays use a non-zero amount of memory. On the other hand, the number of files is also limited by your shell's command line length. :-)

One thing I'm not sure of is why your header count does not match the number of fields in your data. But perhaps this is sufficient for you to take it the rest of the way.

like image 99
ghoti Avatar answered Mar 06 '23 09:03

ghoti


Assuming that the example files should, in fact, have as many headers as there are columns (the example has four columns but only three headers), the following code works for me:

#!/bin/sh
awk '
    function pr(filename) {
        if (filename) printf ("%s",filename)
        for (i=1; i<=NF; i++) {
            if (filename)
                printf ("%s%s",OFS,a[i])
            else
                printf ("%s%s",OFS,$i) 
            a[i] = 0
            }
        if (filename)
            printf ("%s%s",OFS,prevFNR-1) 
        else 
            printf ("%sTotal",OFS)
        printf ("\n")
        }

    FNR==1  {
            pr(prevFileName)
            prevFileName = FILENAME
            next
            }

    NF>0    {
            for (i=1; i<=NF; i++) 
                if ($i==1) a[i]++
            prevFNR = FNR
            } 

    END {
        pr(FILENAME)
        }' file1 file2

There is no way to know that the program has reached the last line of any given file but FNR==1 is true at the start of the next file, so I've used that to trigger printing each line. For that reason, the program uses prevFNR and prevFileName to remember the number of records in the previous file and the name of that file for display. The printing code is called from two different places so I've put it in a function, pr(), which uses the fact that prevFileName doesn't have a value the first time FNR==1 to indicate that it should print the header line rather than the computed summary information.

The output is:

 Header1 Header2 Header3 Header4 Total
file1 1 3 0 4 4
file2 0 1 0 1 3
like image 20
Simon Avatar answered Mar 06 '23 11:03

Simon