I used to process csv file with awk, here is my 1st script:
tail -n +2 shifted_final.csv | awk -F, 'BEGIN {old=$2} {if($2!=old){print $0; old=$2;}}' | less
this script looks for repeating values in 2nd column (if value on line n is same as on line n+1, n+2 ...) and prints only first occurrence. For example if you feed following input:
ord,orig,pred,as,o-p
1,0,0,1.0,0
2,0,0,1.0,0
3,0,0,1.0,0
4,0,0,0.0,0
5,0,0,0.0,0
6,0,0,0.0,0
7,0,0,0.0,0
8,0,0,0.0,0
9,0,0,0.0,0
10,0,0,0.0,0
11,0,0,0.0,0
12,0,0,0.0,0
13,0,0,0.0,0
14,0,0,0.0,0
15,0,0,0.0,0
16,0,0,0.0,0
17,0,0,0.0,0
18,0,0,0.0,0
19,0,0,0.0,0
20,0,0,0.0,0
21,0,0,0.0,0
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0
Then the output will be:
1,0,0,1.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0
EDIT: I've made this a bit challenging adding 2nd script:
The second script does the same but prints last duplicate occurrence:
tail -n +2 shifted_final.csv | awk -F, 'BEGIN {old=$2; line=$0} {if($2==old){line=$0}else{print line; old=$2; line=$0}} END {print $0}' | less
It's output will be:
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0
I suppose R is powerful language which should handle such tasks, but I've found only questions regarding calling awk scripts from R etc. How to do this in R?
Regarding the update to your question, a more general solution, thanks to @nicola:
Idx.first <- c(TRUE, tbl$orig[-1] != tbl$orig[-nrow(tbl)])
##
R> tbl[Idx.first,]
# ord orig pred as o.p
# 1 1 0 0 1 0
# 23 23 4 0 0 4
# 24 24 402 0 1 402
# 25 25 0 0 1 0
If you want to use the last occurrence of a value in a run, rather than the first, just append TRUE
to @nicola's indexing expression instead of prepending it:
Idx.last <- c(tbl$orig[-1] != tbl$orig[-nrow(tbl)], TRUE)
##
R> tbl[Idx.last,]
# ord orig pred as o.p
# 22 22 0 0 0 0
# 23 23 4 0 0 4
# 24 24 402 0 1 402
# 25 25 0 0 1 0
In either case, tbl$orig[-1] != tbl$orig[-nrow(tbl)]
is comparing the 2nd through nth values in column 2 with the 1st through n-1th values in column 2. The result is a logical vector, where TRUE
elements indicate a change in consecutive values. Since the comparison is of length n-1, pushing an extra TRUE
value to the front (case 1) will select the first occurrence in a run, whereas adding an extra TRUE
to the back (case 2) will select the last occurrence in a run.
Data:
tbl <- read.table(text = "ord,orig,pred,as,o-p
1,0,0,1.0,0
2,0,0,1.0,0
3,0,0,1.0,0
4,0,0,0.0,0
5,0,0,0.0,0
6,0,0,0.0,0
7,0,0,0.0,0
8,0,0,0.0,0
9,0,0,0.0,0
10,0,0,0.0,0
11,0,0,0.0,0
12,0,0,0.0,0
13,0,0,0.0,0
14,0,0,0.0,0
15,0,0,0.0,0
16,0,0,0.0,0
17,0,0,0.0,0
18,0,0,0.0,0
19,0,0,0.0,0
20,0,0,0.0,0
21,0,0,0.0,0
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0",
header = TRUE,
sep = ",")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With