Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select matrix first row based on 1st, 8th and 9th column value with awk or sed

Tags:

r

sed

awk

I have some rows where the 1st, 8th and 9th columns are mostly the same. The total number of rows is well over 60K. Now I want to simplify keeping only the first rows for which the 1st,8th and 9th column are same.

Input file:

chr exon_start  exon_end    cnv tumor_DOC   control_DOC rationormalized_after_smoothing CNV_start   CNV_end seg_mean
chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502
chr1    861281  861490  3   101 117 1.29744744  762097  6706109 1.297328502
chr1    7868860 7869039 2   78  119 1.123385189 7796356 8921423 1.088752407
chr1    7869841 7870041 2   140 169 1.123385189 7796356 8921423 1.088752407
chr1    7870411 7870596 2   83  163 1.123385189 7796356 8921423 1.088752407
chr1    7879297 7879467 2   290 360 1.024742732 7796356 8921423 1.088752407
chr1    21012415    21012609    3   89  135 1.230421209 19536504    21054539    1.247494175
chr1    21013924    21014512    3   234 219 1.359224182 19536504    21054539    1.247494175
chr1    21016588    21016803    3   172 179 1.230421209 19536504    21054539    1.247494175
chr1    21024895    21025101    3   147 120 1.230421209 19536504    21054539    1.247494175
chr14   20920169    20920704    3   211 214 1.254261327 20840851    20923828    1.288877208
chr14   20922716    20922919    3   253 262 1.228396526 20840851    20923828    1.288877208
chr14   20923634    20923828    3   188 201 1.206226522 20840851    20923828    1.288877208
chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038
chr14   20924787    20925701    2   314 306 1.305351797 20924141    21465086    1.088234038
chr14   20926636    20926836    2   134 136 1.206226522 20924141    21465086    1.088234038

Desired output:

chr exon_start  exon_end    cnv tumor_DOC   control_DOC rationormalized_after_smoothing CNV_start   CNV_end seg_mean
chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502
chr1    7869841 7870041 2   140 169 1.123385189 7796356 8921423 1.088752407
chr1    21024895    21025101    3   147 120 1.230421209 19536504    21054539    1.247494175
chr14   20922716    20922919    3   253 262 1.228396526 20840851    20923828    1.288877208
chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038

I only keep one row for each distinct categories which have similar column1, column 8th and column 9, best would be to just keep the first row whenever there is a change.

How can I achieve this in awk, sed or in R?

like image 583
ivivek_ngs Avatar asked Apr 22 '15 14:04

ivivek_ngs


2 Answers

Just one awk line:

awk '!seen[$1,$8,$9]++' file

This keeps track in the array seen[] of how many times a given tuple (field1, field8, field9) appeared so far. When it sees one, the counter increments by one. When the value was already 1 or bigger, !value will be evaluated to False, so that awk won't print the line.

First time:

  • seen[$1,$8,$9] is 0 (default value when not set).
  • !0 evaluates to True, so the line is printed.
  • seen[$1,$8,$9] increments by one.

Next times:

  • seen[$1,$8,$9] is 1 or more.
  • !1 evaluates to False, so the line is not printed.
  • seen[$1,$8,$9] increments by one.

Test

$ awk '!seen[$1,$8,$9]++' a
chr exon_start  exon_end    cnv tumor_DOC   control_DOC rationormalized_after_smoothing CNV_start   CNV_end seg_mean
chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502
chr1    7868860 7869039 2   78  119 1.123385189 7796356 8921423 1.088752407
chr1    21012415    21012609    3   89  135 1.230421209 19536504    21054539    1.247494175
chr14   20920169    20920704    3   211 214 1.254261327 20840851    20923828    1.288877208
chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038
like image 188
fedorqui 'SO stop harming' Avatar answered Oct 20 '22 20:10

fedorqui 'SO stop harming'


Import your data into R (you would specify the file):

DF <- read.table(text = "chr exon_start  exon_end    cnv tumor_DOC   control_DOC rationormalized_after_smoothing CNV_start   CNV_end seg_mean
chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502
chr1    861281  861490  3   101 117 1.29744744  762097  6706109 1.297328502
chr1    7868860 7869039 2   78  119 1.123385189 7796356 8921423 1.088752407
chr1    7869841 7870041 2   140 169 1.123385189 7796356 8921423 1.088752407
chr1    7870411 7870596 2   83  163 1.123385189 7796356 8921423 1.088752407
chr1    7879297 7879467 2   290 360 1.024742732 7796356 8921423 1.088752407
chr1    21012415    21012609    3   89  135 1.230421209 19536504    21054539    1.247494175
chr1    21013924    21014512    3   234 219 1.359224182 19536504    21054539    1.247494175
chr1    21016588    21016803    3   172 179 1.230421209 19536504    21054539    1.247494175
chr1    21024895    21025101    3   147 120 1.230421209 19536504    21054539    1.247494175
chr14   20920169    20920704    3   211 214 1.254261327 20840851    20923828    1.288877208
chr14   20922716    20922919    3   253 262 1.228396526 20840851    20923828    1.288877208
chr14   20923634    20923828    3   188 201 1.206226522 20840851    20923828    1.288877208
chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038
chr14   20924787    20925701    2   314 306 1.305351797 20924141    21465086    1.088234038
chr14   20926636    20926836    2   134 136 1.206226522 20924141    21465086    1.088234038", header = TRUE)

Extract rows where columns 1, 8, 9 are not duplicates of earlier rows:

DF[!duplicated(DF[, c(1,8,9)]),]
#     chr exon_start exon_end cnv tumor_DOC control_DOC rationormalized_after_smoothing CNV_start  CNV_end seg_mean
#1   chr1     762097   762270   3       821         717                       1.4566102    762097  6706109 1.297329
#3   chr1    7868860  7869039   2        78         119                       1.1233852   7796356  8921423 1.088752
#7   chr1   21012415 21012609   3        89         135                       1.2304212  19536504 21054539 1.247494
#11 chr14   20920169 20920704   3       211         214                       1.2542613  20840851 20923828 1.288877
#14 chr14   20924141 20924329   2       244         344                       0.9022995  20924141 21465086 1.088234
like image 36
Roland Avatar answered Oct 20 '22 20:10

Roland