I have some rows where the 1st, 8th and 9th columns are mostly the same. The total number of rows is well over 60K. Now I want to simplify keeping only the first rows for which the 1st,8th and 9th column are same.
Input file:
chr exon_start exon_end cnv tumor_DOC control_DOC rationormalized_after_smoothing CNV_start CNV_end seg_mean
chr1 762097 762270 3 821 717 1.456610215 762097 6706109 1.297328502
chr1 861281 861490 3 101 117 1.29744744 762097 6706109 1.297328502
chr1 7868860 7869039 2 78 119 1.123385189 7796356 8921423 1.088752407
chr1 7869841 7870041 2 140 169 1.123385189 7796356 8921423 1.088752407
chr1 7870411 7870596 2 83 163 1.123385189 7796356 8921423 1.088752407
chr1 7879297 7879467 2 290 360 1.024742732 7796356 8921423 1.088752407
chr1 21012415 21012609 3 89 135 1.230421209 19536504 21054539 1.247494175
chr1 21013924 21014512 3 234 219 1.359224182 19536504 21054539 1.247494175
chr1 21016588 21016803 3 172 179 1.230421209 19536504 21054539 1.247494175
chr1 21024895 21025101 3 147 120 1.230421209 19536504 21054539 1.247494175
chr14 20920169 20920704 3 211 214 1.254261327 20840851 20923828 1.288877208
chr14 20922716 20922919 3 253 262 1.228396526 20840851 20923828 1.288877208
chr14 20923634 20923828 3 188 201 1.206226522 20840851 20923828 1.288877208
chr14 20924141 20924329 2 244 344 0.902299535 20924141 21465086 1.088234038
chr14 20924787 20925701 2 314 306 1.305351797 20924141 21465086 1.088234038
chr14 20926636 20926836 2 134 136 1.206226522 20924141 21465086 1.088234038
Desired output:
chr exon_start exon_end cnv tumor_DOC control_DOC rationormalized_after_smoothing CNV_start CNV_end seg_mean
chr1 762097 762270 3 821 717 1.456610215 762097 6706109 1.297328502
chr1 7869841 7870041 2 140 169 1.123385189 7796356 8921423 1.088752407
chr1 21024895 21025101 3 147 120 1.230421209 19536504 21054539 1.247494175
chr14 20922716 20922919 3 253 262 1.228396526 20840851 20923828 1.288877208
chr14 20924141 20924329 2 244 344 0.902299535 20924141 21465086 1.088234038
I only keep one row for each distinct categories which have similar column1, column 8th and column 9, best would be to just keep the first row whenever there is a change.
How can I achieve this in awk, sed or in R?
Just one awk
line:
awk '!seen[$1,$8,$9]++' file
This keeps track in the array seen[]
of how many times a given tuple (field1, field8, field9)
appeared so far. When it sees one, the counter increments by one. When the value was already 1 or bigger, !value
will be evaluated to False, so that awk
won't print the line.
First time:
seen[$1,$8,$9]
is 0
(default value when not set).!0
evaluates to True, so the line is printed.seen[$1,$8,$9]
increments by one.Next times:
seen[$1,$8,$9]
is 1
or more.!1
evaluates to False, so the line is not printed.seen[$1,$8,$9]
increments by one.$ awk '!seen[$1,$8,$9]++' a
chr exon_start exon_end cnv tumor_DOC control_DOC rationormalized_after_smoothing CNV_start CNV_end seg_mean
chr1 762097 762270 3 821 717 1.456610215 762097 6706109 1.297328502
chr1 7868860 7869039 2 78 119 1.123385189 7796356 8921423 1.088752407
chr1 21012415 21012609 3 89 135 1.230421209 19536504 21054539 1.247494175
chr14 20920169 20920704 3 211 214 1.254261327 20840851 20923828 1.288877208
chr14 20924141 20924329 2 244 344 0.902299535 20924141 21465086 1.088234038
Import your data into R (you would specify the file):
DF <- read.table(text = "chr exon_start exon_end cnv tumor_DOC control_DOC rationormalized_after_smoothing CNV_start CNV_end seg_mean
chr1 762097 762270 3 821 717 1.456610215 762097 6706109 1.297328502
chr1 861281 861490 3 101 117 1.29744744 762097 6706109 1.297328502
chr1 7868860 7869039 2 78 119 1.123385189 7796356 8921423 1.088752407
chr1 7869841 7870041 2 140 169 1.123385189 7796356 8921423 1.088752407
chr1 7870411 7870596 2 83 163 1.123385189 7796356 8921423 1.088752407
chr1 7879297 7879467 2 290 360 1.024742732 7796356 8921423 1.088752407
chr1 21012415 21012609 3 89 135 1.230421209 19536504 21054539 1.247494175
chr1 21013924 21014512 3 234 219 1.359224182 19536504 21054539 1.247494175
chr1 21016588 21016803 3 172 179 1.230421209 19536504 21054539 1.247494175
chr1 21024895 21025101 3 147 120 1.230421209 19536504 21054539 1.247494175
chr14 20920169 20920704 3 211 214 1.254261327 20840851 20923828 1.288877208
chr14 20922716 20922919 3 253 262 1.228396526 20840851 20923828 1.288877208
chr14 20923634 20923828 3 188 201 1.206226522 20840851 20923828 1.288877208
chr14 20924141 20924329 2 244 344 0.902299535 20924141 21465086 1.088234038
chr14 20924787 20925701 2 314 306 1.305351797 20924141 21465086 1.088234038
chr14 20926636 20926836 2 134 136 1.206226522 20924141 21465086 1.088234038", header = TRUE)
Extract rows where columns 1, 8, 9 are not duplicates of earlier rows:
DF[!duplicated(DF[, c(1,8,9)]),]
# chr exon_start exon_end cnv tumor_DOC control_DOC rationormalized_after_smoothing CNV_start CNV_end seg_mean
#1 chr1 762097 762270 3 821 717 1.4566102 762097 6706109 1.297329
#3 chr1 7868860 7869039 2 78 119 1.1233852 7796356 8921423 1.088752
#7 chr1 21012415 21012609 3 89 135 1.2304212 19536504 21054539 1.247494
#11 chr14 20920169 20920704 3 211 214 1.2542613 20840851 20923828 1.288877
#14 chr14 20924141 20924329 2 244 344 0.9022995 20924141 21465086 1.088234
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With