I have a large data table Divvy (over 2.4 million records) that appears as such (some columns removed):
X trip_id from_station_id.x to_station_id.x
1 1109420 94 69
2 1109421 69 216
3 1109427 240 245
4 1109431 113 94
5 1109433 127 332
3 1109429 240 245
I would like to find the number of trips from each station to each opposing station. So for example,
From X To Y Sum
94 69 1
240 245 2
etc. and then join it back to the inital table using dplyr to make something like the below and then limit it to distinct from_station_id/to_combos, which I'll use to map routes (I have lat/long for each station):
X trip_id from_station_id.x to_station_id.x Sum
1 1109420 94 69 1
2 1109421 69 216 1
3 1109427 240 245 2
4 1109431 113 94 1
5 1109433 127 332 1
3 1109429 240 245 1
I successfully used count to get some of this, such as:
count(Divvy$from_station_id.x==94 & Divvy$to_station_id.x == 69)
x freq
1 FALSE 2454553
2 TRUE 81
But this is obviously labor intensive as there are 300 unique stations, so well over 44k poss combinations. I created a helper table thinking I could loop it.
n <- select(Divvy, from_station_id.y )
from_station_id.x
1 94
2 69
3 240
4 113
5 113
6 127
count(Divvy$from_station_id.x==n[1,1] & Divvy$to_station_id.x == n[2,1])
x freq
1 FALSE 2454553
2 TRUE 81
I felt like a loop such as
output <- matrix(ncol=variables, nrow=iterations)
output <- matrix()
for(i in 1:n)(output[i, count(Divvy$from_station_id.x==n[1,1] & Divvy$to_station_id.x == n[2,1]))
should work but come to think of it that will still only return 300 rows, not 44k, so it would have to then loop back and do n[2] & n[1] etc...
I felt like there might also be a quicker dplyr solution that would let me return a count of each combo and append it directly without the extra steps/table creation, but I haven't found it.
I'm newer to R and I have searched around/think I'm close, but I can't quite connect that last dot of joining that result to Divvy. Any help appreciated.
#Here is the data.table solution, which is useful if you are working with large data:
library(data.table)
setDT(DF)[,sum:=.N,by=.(from_station_id.x,to_station_id.x)][] #DF is your dataframe
X trip_id from_station_id.x to_station_id.x sum
1: 1 1109420 94 69 1
2: 2 1109421 69 216 1
3: 3 1109427 240 245 2
4: 4 1109431 113 94 1
5: 5 1109433 127 332 1
6: 3 1109429 240 245 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With