Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is preprocessing file with awk needed or it can be done directly in R?

Tags:

r

csv

awk

I used to process csv file with awk, here is my 1st script:

tail -n +2 shifted_final.csv | awk -F, 'BEGIN {old=$2} {if($2!=old){print $0; old=$2;}}' | less

this script looks for repeating values in 2nd column (if value on line n is same as on line n+1, n+2 ...) and prints only first occurrence. For example if you feed following input:

ord,orig,pred,as,o-p
1,0,0,1.0,0
2,0,0,1.0,0
3,0,0,1.0,0
4,0,0,0.0,0
5,0,0,0.0,0
6,0,0,0.0,0
7,0,0,0.0,0
8,0,0,0.0,0
9,0,0,0.0,0
10,0,0,0.0,0
11,0,0,0.0,0
12,0,0,0.0,0
13,0,0,0.0,0
14,0,0,0.0,0
15,0,0,0.0,0
16,0,0,0.0,0
17,0,0,0.0,0
18,0,0,0.0,0
19,0,0,0.0,0
20,0,0,0.0,0
21,0,0,0.0,0
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0

Then the output will be:

1,0,0,1.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0

EDIT: I've made this a bit challenging adding 2nd script:

The second script does the same but prints last duplicate occurrence:

tail -n +2 shifted_final.csv | awk -F, 'BEGIN {old=$2; line=$0} {if($2==old){line=$0}else{print line; old=$2; line=$0}} END {print $0}' | less

It's output will be:

22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0

I suppose R is powerful language which should handle such tasks, but I've found only questions regarding calling awk scripts from R etc. How to do this in R?

like image 945
Wakan Tanka Avatar asked Jan 07 '23 04:01

Wakan Tanka


1 Answers

Regarding the update to your question, a more general solution, thanks to @nicola:

Idx.first <- c(TRUE, tbl$orig[-1] != tbl$orig[-nrow(tbl)])
##
R> tbl[Idx.first,]
#    ord orig pred as o.p
# 1    1    0    0  1   0
# 23  23    4    0  0   4
# 24  24  402    0  1 402
# 25  25    0    0  1   0

If you want to use the last occurrence of a value in a run, rather than the first, just append TRUE to @nicola's indexing expression instead of prepending it:

Idx.last <- c(tbl$orig[-1] != tbl$orig[-nrow(tbl)], TRUE)
##
R> tbl[Idx.last,]
#    ord orig pred as o.p
# 22  22    0    0  0   0
# 23  23    4    0  0   4
# 24  24  402    0  1 402
# 25  25    0    0  1   0

In either case, tbl$orig[-1] != tbl$orig[-nrow(tbl)] is comparing the 2nd through nth values in column 2 with the 1st through n-1th values in column 2. The result is a logical vector, where TRUE elements indicate a change in consecutive values. Since the comparison is of length n-1, pushing an extra TRUE value to the front (case 1) will select the first occurrence in a run, whereas adding an extra TRUE to the back (case 2) will select the last occurrence in a run.


Data:

tbl <- read.table(text = "ord,orig,pred,as,o-p
1,0,0,1.0,0
2,0,0,1.0,0
3,0,0,1.0,0
4,0,0,0.0,0
5,0,0,0.0,0
6,0,0,0.0,0
7,0,0,0.0,0
8,0,0,0.0,0
9,0,0,0.0,0
10,0,0,0.0,0
11,0,0,0.0,0
12,0,0,0.0,0
13,0,0,0.0,0
14,0,0,0.0,0
15,0,0,0.0,0
16,0,0,0.0,0
17,0,0,0.0,0
18,0,0,0.0,0
19,0,0,0.0,0
20,0,0,0.0,0
21,0,0,0.0,0
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0",
header = TRUE,
sep = ",")
like image 122
nrussell Avatar answered Jan 31 '23 11:01

nrussell