I have a tab delimited file that looks like the following:
cluster.1 Adult.1
cluster.2 Comp.1
cluster.3 Adult.2
cluster.3 Pre.3
cluster.4 Pre.1
cluster.4 Juv.2
cluster.4 Comp.4
cluster.4 Adult.3
cluster.5 Adult.2
cluster.6 Pre.5
I would like to count the number of times an entry occurs in column one and then print that to a new column three so that the output would look like this.
cluster.1 Adult.1 1
cluster.2 Comp.1 1
cluster.3 Adult.2 2
cluster.3 Pre.3 2
cluster.4 Pre.1 4
cluster.4 Juv.2 4
cluster.4 Comp.4 4
cluster.4 Adult.3 4
cluster.5 Adult.2 1
cluster.6 Pre.5 1
In the end I plan to delete those rows from my file where column 3 equals 1 but figured it will probably be a two step process to do so. Thanks.
With awk
you can read the file twice as follows:
$ awk 'NR==FNR {a[$1]++; next} {print $0, a[$1]}' file file
cluster.1 Adult.1 1
cluster.2 Comp.1 1
cluster.3 Adult.2 2
cluster.3 Pre.3 2
cluster.4 Pre.1 4
cluster.4 Juv.2 4
cluster.4 Comp.4 4
cluster.4 Adult.3 4
cluster.5 Adult.2 1
cluster.6 Pre.5 1
The first time is stated by NR==FNR
and counts the item. The second time is the second {}
block and prints the line plus the counter.
Using join
:
cut -f1 input | sort | uniq -c | sed 's/^ *\([0-9]*\) */\1\t/' | \
join -t $'\t' -1 2 -2 1 -o '2.1 2.2 1.1' - input
Output:
cluster.1 Adult.1 1
cluster.2 Comp.1 1
cluster.3 Adult.2 2
cluster.3 Pre.3 2
cluster.4 Pre.1 4
cluster.4 Juv.2 4
cluster.4 Comp.4 4
cluster.4 Adult.3 4
cluster.5 Adult.2 1
cluster.6 Pre.5 1
A Bash solution using an associative array:
declare -A array
while read col1 col2 ; do
((array[$col1]++))
done < "$infile"
while read col1 col2 ; do
echo -e "$col1\t$col2\t${array[$col1]}"
done < "$infile"
The output:
cluster.1 Adult.1 1
cluster.2 Comp.1 1
cluster.3 Adult.2 2
cluster.3 Pre.3 2
cluster.4 Pre.1 4
cluster.4 Juv.2 4
cluster.4 Comp.4 4
cluster.4 Adult.3 4
cluster.5 Adult.2 1
cluster.6 Pre.5 1
Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
sub output {
my $buffer_ref = shift;
print "$_\t", 0 + @$buffer_ref, "\n" for @$buffer_ref;
}
my $previous_cluster = q();
my @buffer;
while (<>) {
chomp;
my ($cluster, $val) = split /\t/;
if ($cluster ne $previous_cluster) {
output(\@buffer);
undef @buffer;
$previous_cluster = $cluster;
}
push @buffer, $_;
}
# Do not forget to output the last cluster.
output(\@buffer);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With