I have data like this:
ID ATTRIBUTE START END
1 A 01-01-2000 15-03-2010
1 B 05-11-2001 06-02-2002
2 B 01-02-2002 08-05-2008
2 B 01-06-2008 01-07-2008
I now want to count the number of different IDs having a certain attribute per year.
A result could look like this:
YEAR count(A) count(B)
2000 1 0
2001 1 1
2002 1 2
2003 1 1
2004 1 1
2005 1 1
2006 1 1
2007 1 1
2008 1 1
2009 1 0
2010 1 0
I the second step of counting the occurences is probably easy.
But how would I split my data into years?
Thank you in advance!
Here is an approach using a few of Hadley's packages.
library(lubridate); library(reshape2); library(plyr)
# extract years from start and end dates after converting them to date
dfr2 = transform(dfr, START = year(dmy(START)), END = year(dmy(END)))
# for every row, construct a sequence of years from start to end
dfr2 = adply(dfr2, 1, transform, YEAR = START:END)
# create pivot table of year vs. attribute with number of unique values of ID
dcast(dfr2, YEAR ~ ATTRIBUTE, function(x) length(unique(x)), value_var = 'ID')
EDIT: If the original data.frame
is large, then adply
might take a lot of time. A useful alternate in such cases is to use the data.table
package. Here is how we can replace the adply
call using data.table
.
require(data.table)
dfr2 = data.table(dfr2)[,list(YEAR = START:END),'ID, ATTRIBUTE']
Here is a solution that only uses the core of R. First we show the input data to keep this all self contained:
DF <- data.frame(ID = c(1, 1, 2, 2),
ATTRIBUTE = c("A", "B", "B", "B"),
START = c("01-01-2000", "05-11-2001", "01-02-2002", "01-06-2008"),
END = c("15-03-2010", "06-02-2002", "08-05-2008", "01-07-2008"))
Now that we have the input the solution follows: yr
is defined to be a function which extracts the year. The guts of the calculation is the statement following the definition of yr
. For each row of DF
the anonymous function produces a data frame having the years spanned in column 1 and the ATTRIBUTE
and ID
in columns 2 and 3. For example, the data frame corresponding to the first row of DF
is the 11 row data.frame(YEAR = 2000:2010, ATTRIBUTE = 1, ID = "A")
and the data frame corresponding to the second row of DF
is the two row data.frame(YEAR = 2001:2002, ATTRIBUTE = 1, ID = "B")
. The lapply
produces a list of such data frames, one for each row of DF
so in the example input above it produces a list with 4 components. Using do.call
we rbind
the components of that list, i.e. the individuals data frames, producing a single large data frame. We eliminate duplicate rows (using unique
) from this large data frame, drop the ID
column (the third column) and run table
on the result:
yr <- function(d) as.numeric(sub(".*-", "", d))
out <- table(unique(do.call(rbind, lapply(1:nrow(DF), function(r) with(DF[r, ],
data.frame(YEAR = seq(yr(START), yr(END)), ATTRIBUTE, ID)))))[, -3])
The resulting table is:
> out
ATTRIBUTE
YEAR A B
2000 1 0
2001 1 1
2002 1 2
2003 1 1
2004 1 1
2005 1 1
2006 1 1
2007 1 1
2008 1 1
2009 1 0
2010 1 0
EDIT:
Poster has later indicated that memory might be a problem so here is an sqldf solution which processes the key large intermediate results in sqlite outside of R (the dbname = tempfile()
tells it to do that) so any memory limitation of R will not affect it. It uses the same input and the same yr
function shown above and returns the same result, tab
is the same asout
above. Also try it without the dbname = tempfile()
in case it actually does fit in memory.
library(sqldf)
DF2 <- transform(DF, START = yr(START), END = yr(END))
years <- data.frame(year = min(DF2$START):max(DF2$END))
tab.df <- sqldf("select year, ATTRIBUTE, count(*) as count from
(select distinct year, ATTRIBUTE, ID
from years, DF2
where year between START and END)
group by year, ATTRIBUTE", dbname = tempfile())
tab <- xtabs(count ~., tab.df)
I did not intend to give an answer here as the problem seemed a bit tricky, so I could have made up only an ugly solution, but after reading @Roman Luštrik's comment I could not escape this challenge :)
Anyway, I am not sure if you will like this solution, so be prepared!
Loading your demo data:
dfr <- structure(list(ID = c(1, 1, 2, 2), ATTRIBUTE = structure(c(1L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), START = c("01-01-2000", "05-11-2001", "01-02-2002", "01-06-2008"), END = c("15-03-2010", "06-02-2002", "08-05-2008", "01-07-2008")), .Names = c("ID", "ATTRIBUTE", "START", "END"), row.names = c(NA, -4L), class = "data.frame")
We are not dealing with months and so, just keeping the year in the table:
> dfr$START <- as.numeric(substr(dfr$START, 7, 10))
> dfr$END <- as.numeric(substr(dfr$END, 7, 10))
> dfr
ID ATTRIBUTE START END
1 1 A 2000 2010
2 1 B 2001 2002
3 2 B 2002 2008
4 2 B 2008 2008
Clear out duplicated rows (by merging years based on ID
and ATTRIBUTE
):
> dfr <- merge(aggregate(START ~ ID + ATTRIBUTE, dfr, min), aggregate(END ~ ID + ATTRIBUTE, dfr, max), by=c('ID', 'ATTRIBUTE'))
> dfr
ID ATTRIBUTE START END
1 1 A 2000 2010
2 1 B 2001 2002
3 2 B 2002 2008
And run a one-liner with some apply
, lapply
, do.call
and friends to show the beauty of R! :)
> t(table(do.call(rbind, lapply(apply(dfr, 1, function(x) cbind(x[2], x[3]:x[4])), function(x) as.data.frame(x)))))
V1
V2 A B
2000 1 0
2001 1 1
2002 1 2
2003 1 1
2004 1 1
2005 1 1
2006 1 1
2007 1 1
2008 1 1
2009 1 0
2010 1 0
Slighty convoluted, but try this:
dfr <- data.frame(ID=c(1,1,2,2),ATTRIBUTE=c("A","B","B","B"),START=c("01-01-2000","05-11-2001","01-02-2002","01-06-2008"),END=c("15-03-2010","06-02-2002","08-05-2008","01-07-2008"),stringsAsFactors=F)
dfr$ATTRIBUTE <- factor(dfr$ATTRIBUTE)
actYears <- mapply(":",as.numeric(substr(dfr$START,7,10)),as.numeric(substr(dfr$END,7,10)))
yrRng <- ":"(range(actYears)[1],range(actYears)[2])
yrTable <- sapply(actYears,function(x) yrRng %in% x)
rownames(yrTable) <- yrRange
colnames(yrTable) <- dfr$ATTRIBUTE
Which gives:
yrTable
A B B B
2000 TRUE FALSE FALSE FALSE
2001 TRUE TRUE FALSE FALSE
2002 TRUE TRUE TRUE FALSE
2003 TRUE FALSE TRUE FALSE
2004 TRUE FALSE TRUE FALSE
2005 TRUE FALSE TRUE FALSE
2006 TRUE FALSE TRUE FALSE
2007 TRUE FALSE TRUE FALSE
2008 TRUE FALSE TRUE TRUE
2009 TRUE FALSE FALSE FALSE
2010 TRUE FALSE FALSE FALSE
Now we can build the table:
t(apply(yrTable,1,function(x) table(dfr$ATTRIBUTE[x])))
A B
2000 1 0
2001 1 1
2002 1 2
2003 1 1
2004 1 1
2005 1 1
2006 1 1
2007 1 1
2008 1 2
2009 1 0
2010 1 0
Its still double counting the IDs, but it would probably be easier to merge overlapping ranges in the original data.frame
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With