Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split data by year

Tags:

split

dataframe

r

I have data like this:

ID    ATTRIBUTE        START          END
 1            A   01-01-2000   15-03-2010
 1            B   05-11-2001   06-02-2002
 2            B   01-02-2002   08-05-2008
 2            B   01-06-2008   01-07-2008

I now want to count the number of different IDs having a certain attribute per year.

A result could look like this:

YEAR    count(A)    count(B)
2000          1           0
2001          1           1
2002          1           2
2003          1           1
2004          1           1
2005          1           1
2006          1           1
2007          1           1
2008          1           1
2009          1           0
2010          1           0

I the second step of counting the occurences is probably easy.

But how would I split my data into years?

Thank you in advance!

like image 603
speendo Avatar asked Oct 24 '11 08:10

speendo


4 Answers

Here is an approach using a few of Hadley's packages.

library(lubridate); library(reshape2); library(plyr)

# extract years from start and end dates after converting them to date
dfr2 = transform(dfr, START = year(dmy(START)), END = year(dmy(END)))

# for every row, construct a sequence of years from start to end
dfr2 = adply(dfr2, 1, transform, YEAR = START:END)

# create pivot table of year vs. attribute with number of unique values of ID
dcast(dfr2, YEAR ~ ATTRIBUTE, function(x) length(unique(x)), value_var = 'ID')

EDIT: If the original data.frame is large, then adply might take a lot of time. A useful alternate in such cases is to use the data.table package. Here is how we can replace the adply call using data.table.

require(data.table)
dfr2 = data.table(dfr2)[,list(YEAR = START:END),'ID, ATTRIBUTE']
like image 182
Ramnath Avatar answered Nov 16 '22 02:11

Ramnath


Here is a solution that only uses the core of R. First we show the input data to keep this all self contained:

DF <- data.frame(ID = c(1, 1, 2, 2), 
    ATTRIBUTE = c("A", "B", "B", "B"), 
    START = c("01-01-2000", "05-11-2001", "01-02-2002", "01-06-2008"), 
    END = c("15-03-2010", "06-02-2002", "08-05-2008", "01-07-2008"))

Now that we have the input the solution follows: yr is defined to be a function which extracts the year. The guts of the calculation is the statement following the definition of yr. For each row of DF the anonymous function produces a data frame having the years spanned in column 1 and the ATTRIBUTE and ID in columns 2 and 3. For example, the data frame corresponding to the first row of DF is the 11 row data.frame(YEAR = 2000:2010, ATTRIBUTE = 1, ID = "A") and the data frame corresponding to the second row of DF is the two row data.frame(YEAR = 2001:2002, ATTRIBUTE = 1, ID = "B"). The lapply produces a list of such data frames, one for each row of DF so in the example input above it produces a list with 4 components. Using do.call we rbind the components of that list, i.e. the individuals data frames, producing a single large data frame. We eliminate duplicate rows (using unique) from this large data frame, drop the ID column (the third column) and run table on the result:

yr <- function(d) as.numeric(sub(".*-", "", d))
out <- table(unique(do.call(rbind, lapply(1:nrow(DF), function(r) with(DF[r, ],
    data.frame(YEAR = seq(yr(START), yr(END)), ATTRIBUTE, ID)))))[, -3])

The resulting table is:

> out
      ATTRIBUTE
YEAR   A B
  2000 1 0
  2001 1 1
  2002 1 2
  2003 1 1
  2004 1 1
  2005 1 1
  2006 1 1
  2007 1 1
  2008 1 1
  2009 1 0
  2010 1 0

EDIT:

Poster has later indicated that memory might be a problem so here is an sqldf solution which processes the key large intermediate results in sqlite outside of R (the dbname = tempfile() tells it to do that) so any memory limitation of R will not affect it. It uses the same input and the same yr function shown above and returns the same result, tab is the same asout above. Also try it without the dbname = tempfile() in case it actually does fit in memory.

library(sqldf)

DF2 <- transform(DF, START = yr(START), END = yr(END))
years <- data.frame(year = min(DF2$START):max(DF2$END))

tab.df <- sqldf("select year, ATTRIBUTE, count(*) as count from
    (select distinct year, ATTRIBUTE, ID
    from years, DF2
    where year between START and END)
    group by year, ATTRIBUTE", dbname = tempfile())

tab <- xtabs(count ~., tab.df)
like image 27
G. Grothendieck Avatar answered Nov 16 '22 02:11

G. Grothendieck


I did not intend to give an answer here as the problem seemed a bit tricky, so I could have made up only an ugly solution, but after reading @Roman Luštrik's comment I could not escape this challenge :)

Anyway, I am not sure if you will like this solution, so be prepared!

Loading your demo data:

dfr <- structure(list(ID = c(1, 1, 2, 2), ATTRIBUTE = structure(c(1L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), START = c("01-01-2000", "05-11-2001", "01-02-2002", "01-06-2008"), END = c("15-03-2010", "06-02-2002", "08-05-2008", "01-07-2008")), .Names = c("ID", "ATTRIBUTE", "START", "END"), row.names = c(NA, -4L), class = "data.frame")

We are not dealing with months and so, just keeping the year in the table:

> dfr$START <- as.numeric(substr(dfr$START, 7, 10))
> dfr$END <- as.numeric(substr(dfr$END, 7, 10))
> dfr
  ID ATTRIBUTE START  END
1  1         A  2000 2010
2  1         B  2001 2002
3  2         B  2002 2008
4  2         B  2008 2008

Clear out duplicated rows (by merging years based on ID and ATTRIBUTE):

> dfr <- merge(aggregate(START ~ ID + ATTRIBUTE, dfr, min), aggregate(END ~ ID + ATTRIBUTE, dfr, max), by=c('ID', 'ATTRIBUTE'))
> dfr
  ID ATTRIBUTE START  END
1  1         A  2000 2010
2  1         B  2001 2002
3  2         B  2002 2008

And run a one-liner with some apply, lapply, do.call and friends to show the beauty of R! :)

> t(table(do.call(rbind, lapply(apply(dfr, 1, function(x) cbind(x[2], x[3]:x[4])), function(x) as.data.frame(x)))))
      V1
V2     A B
  2000 1 0
  2001 1 1
  2002 1 2
  2003 1 1
  2004 1 1
  2005 1 1
  2006 1 1
  2007 1 1
  2008 1 1
  2009 1 0
  2010 1 0
like image 44
daroczig Avatar answered Nov 16 '22 01:11

daroczig


Slighty convoluted, but try this:

dfr <- data.frame(ID=c(1,1,2,2),ATTRIBUTE=c("A","B","B","B"),START=c("01-01-2000","05-11-2001","01-02-2002","01-06-2008"),END=c("15-03-2010","06-02-2002","08-05-2008","01-07-2008"),stringsAsFactors=F)
dfr$ATTRIBUTE <- factor(dfr$ATTRIBUTE)

actYears <- mapply(":",as.numeric(substr(dfr$START,7,10)),as.numeric(substr(dfr$END,7,10)))

yrRng <- ":"(range(actYears)[1],range(actYears)[2])

yrTable <- sapply(actYears,function(x) yrRng %in% x)
rownames(yrTable) <- yrRange
colnames(yrTable) <- dfr$ATTRIBUTE

Which gives:

yrTable
        A     B     B     B
2000 TRUE FALSE FALSE FALSE
2001 TRUE  TRUE FALSE FALSE
2002 TRUE  TRUE  TRUE FALSE
2003 TRUE FALSE  TRUE FALSE
2004 TRUE FALSE  TRUE FALSE
2005 TRUE FALSE  TRUE FALSE
2006 TRUE FALSE  TRUE FALSE
2007 TRUE FALSE  TRUE FALSE
2008 TRUE FALSE  TRUE  TRUE
2009 TRUE FALSE FALSE FALSE
2010 TRUE FALSE FALSE FALSE

Now we can build the table:

t(apply(yrTable,1,function(x) table(dfr$ATTRIBUTE[x])))
     A B
2000 1 0
2001 1 1
2002 1 2
2003 1 1
2004 1 1
2005 1 1
2006 1 1
2007 1 1
2008 1 2
2009 1 0
2010 1 0

Its still double counting the IDs, but it would probably be easier to merge overlapping ranges in the original data.frame.

like image 37
James Avatar answered Nov 16 '22 02:11

James