Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count the number of instances of entries in column 1 and print the value to a new column

Tags:

bash

awk

I have a tab delimited file that looks like the following:

cluster.1   Adult.1
cluster.2   Comp.1
cluster.3   Adult.2
cluster.3   Pre.3
cluster.4   Pre.1
cluster.4   Juv.2
cluster.4   Comp.4
cluster.4   Adult.3
cluster.5   Adult.2
cluster.6   Pre.5

I would like to count the number of times an entry occurs in column one and then print that to a new column three so that the output would look like this.

cluster.1   Adult.1 1
cluster.2   Comp.1  1
cluster.3   Adult.2 2
cluster.3   Pre.3   2
cluster.4   Pre.1   4
cluster.4   Juv.2   4
cluster.4   Comp.4  4
cluster.4   Adult.3 4
cluster.5   Adult.2 1
cluster.6   Pre.5   1

In the end I plan to delete those rows from my file where column 3 equals 1 but figured it will probably be a two step process to do so. Thanks.

like image 213
acalcino Avatar asked Dec 06 '13 08:12

acalcino


4 Answers

With awk you can read the file twice as follows:

$ awk 'NR==FNR {a[$1]++; next} {print $0, a[$1]}' file file
cluster.1   Adult.1 1
cluster.2   Comp.1 1
cluster.3   Adult.2 2
cluster.3   Pre.3 2
cluster.4   Pre.1 4
cluster.4   Juv.2 4
cluster.4   Comp.4 4
cluster.4   Adult.3 4
cluster.5   Adult.2 1
cluster.6   Pre.5 1

The first time is stated by NR==FNR and counts the item. The second time is the second {} block and prints the line plus the counter.

like image 199
fedorqui 'SO stop harming' Avatar answered Nov 02 '22 23:11

fedorqui 'SO stop harming'


Using join:

cut -f1 input | sort | uniq -c | sed 's/^ *\([0-9]*\) */\1\t/' | \
      join -t $'\t'  -1 2 -2 1 -o '2.1 2.2 1.1' - input

Output:

cluster.1   Adult.1 1
cluster.2   Comp.1  1
cluster.3   Adult.2 2
cluster.3   Pre.3   2
cluster.4   Pre.1   4
cluster.4   Juv.2   4
cluster.4   Comp.4  4
cluster.4   Adult.3 4
cluster.5   Adult.2 1
cluster.6   Pre.5   1
like image 31
perreal Avatar answered Nov 03 '22 00:11

perreal


A Bash solution using an associative array:

declare -A array

while read col1 col2 ; do
  ((array[$col1]++))
done < "$infile"

while read col1 col2 ; do
  echo -e "$col1\t$col2\t${array[$col1]}"
done < "$infile"

The output:

cluster.1       Adult.1 1
cluster.2       Comp.1  1
cluster.3       Adult.2 2
cluster.3       Pre.3   2
cluster.4       Pre.1   4
cluster.4       Juv.2   4
cluster.4       Comp.4  4
cluster.4       Adult.3 4
cluster.5       Adult.2 1
cluster.6       Pre.5   1
like image 29
Fritz G. Mehner Avatar answered Nov 02 '22 23:11

Fritz G. Mehner


Perl solution:

#!/usr/bin/perl
use warnings;
use strict;


sub output {
    my $buffer_ref = shift;
    print "$_\t", 0 + @$buffer_ref, "\n" for @$buffer_ref;
}


my $previous_cluster = q();
my @buffer;

while (<>) {
    chomp;
    my ($cluster, $val) = split /\t/;
    if ($cluster ne $previous_cluster) {
        output(\@buffer);
        undef @buffer;
        $previous_cluster = $cluster;
    }
    push @buffer, $_;
}
# Do not forget to output the last cluster.
output(\@buffer);
like image 44
choroba Avatar answered Nov 02 '22 23:11

choroba