Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregating sequential and grouped data in R

I have a dataset that looks like this toy example. The data describes the location a person has moved to and the time since this relocation happened. For example, person 1 started out in a rural area, but moved to a city 463 days ago (2nd row), and 415 days ago he moved from this city to a town (3rd row), etc.

set.seed(123)
df <- as.data.frame(sample.int(1000, 10))
colnames(df) <- "time"
df$destination <- as.factor(sample(c("city", "town", "rural"), size = 10, replace = TRUE, prob = c(.50, .25, .25)))
df$user <- sample.int(3, 10, replace = TRUE)
df[order(df[,"user"], -df[,"time"]), ]

The data:

time destination user
 526       rural    1
 463        city    1
 415        town    1
 299        city    1
 179       rural    1
 938        town    2
 229        town    2
 118        city    2
 818        city    3
 195        city    3

I wish to aggregate this data to the format below. That is, to count the types of relocations for each user, and sum it up to one matrix. How do I achieve this (preferably without writing loops)?

from  to     count
city  city   1
city  town   1
city  rural  1
town  city   2
town  town   1
town  rural  0
rural city   1
rural town   0
rural rural  0
like image 466
Joshua Avatar asked Jul 22 '21 19:07

Joshua


People also ask

How do you aggregate data in R?

The process involves two stages. First, collate individual cases of raw data together with a grouping variable. Second, perform which calculation you want on each group of cases.

How do I summarize a Dataframe in R?

R – Summary of Data Frame To get the summary of Data Frame, call summary() function and pass the Data Frame as argument to the function. We may pass additional arguments to summary() that affects the summary output. The output of summary() contains summary for each column.

How do I summarize a column in R?

summary statistic is computed using summary() function in R. summary() function is automatically applied to each column. The format of the result depends on the data type of the column. If the column is a numeric variable, mean, median, min, max and quartiles are returned.


1 Answers

One possible way based on data.table package:

library(data.table)

cases <- unique(df$destination)

setDT(df)[, .(from=destination, to=shift(destination, -1)), by=user
          ][CJ(from=cases, to=cases), .(count=.N), by=.EACHI, on=c("from", "to")]


#      from     to count
#    <char> <char> <int>
# 1:   city   city     1
# 2:   city  rural     1
# 3:   city   town     1
# 4:  rural   city     1
# 5:  rural  rural     0
# 6:  rural   town     0
# 7:   town   city     2
# 8:   town  rural     0
# 9:   town   town     1
like image 137
B. Christian Kamgang Avatar answered Sep 27 '22 21:09

B. Christian Kamgang