Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr does not group data by date

Tags:

r

dplyr

strptime

I am trying to calculate the frequency of bikes that are taken by people using a dataset provided by Leada.

Here is the code:

library(dplyr)

setAs("character", "POSIXlt", function(from) strptime(from, format = "%m/%d/%y %H:%M"))
d <- read.csv("http://mandrillapp.com/track/click/30315607/s3-us-west-1.amazonaws.com?p=eyJzIjoiemxlVjNUREczQ2l5UFVPeEFCalNUdmlDYTgwIiwidiI6MSwicCI6IntcInVcIjozMDMxNTYwNyxcInZcIjoxLFwidXJsXCI6XCJodHRwczpcXFwvXFxcL3MzLXVzLXdlc3QtMS5hbWF6b25hd3MuY29tXFxcL2RhdGF5ZWFyXFxcL2Jpa2VfdHJpcF9kYXRhLmNzdlwiLFwiaWRcIjpcImEyODNiNjMzOWJkOTQxMGM5ZjlkYzE0MmQ0NDQ5YmU4XCIsXCJ1cmxfaWRzXCI6W1wiMTVlYzMzNWM1NDRlMTM1ZDI0YjAwODE4ZjI5YTdkMmFkZjU2NWQ2MVwiXX0ifQ",
              colClasses = c("numeric", "numeric", "POSIXlt", "factor", "numeric", "POSIXlt", "factor", "numeric", "numeric", "factor", "character"),
              stringsAsFactors = T)
names(d)[9] <- "BikeNo"

d <- tbl_df(d)

d <- d %>% mutate(Weekday = factor(weekdays(Start.Date)))
d %>% group_by(Weekday) 
  %>% summarise(Total = n()) 
  %>% select(Weekday, Total)

It is strange but dplyr does not want to group data by Weekday saying:

Error: column 'Start.Date' has unsupported type

Why it cares about Start.Date column where I group by a factor? You can run the code locally to reproduce the error: it will download the data automatically.

P.S. I am using dplyr version: dplyr_0.3.0.2

like image 870
Sergei Avatar asked Jan 07 '15 21:01

Sergei


2 Answers

The lubridate package is useful when dealing with dates. Here is the code to parse Start.Date and End.Date, extract week days, then group by week days:

Read dates as character vectors

library(dplyr)
library(lubridate)
# For some reason your instruction to load the csv directly from a url
# didn't work. I save the csv to a temporary directory.
d <- read.csv("/tmp/bike_trip_data.csv", colClasses = c("numeric", "numeric", "character", "factor", "numeric", "character", "factor", "numeric", "numeric", "factor", "character"), stringsAsFactors = T)

names(d)[9] <- "BikeNo"
d <- tbl_df(d)

Use lubridate to convert start date and end date

d <- d %>% 
  mutate(
    Start.Date = parse_date_time(Start.Date,"%m/%d/%y %H:%M"),
    End.Date = parse_date_time(End.Date,"%m/%d/%y %H:%M"),
    Weekday = wday(Start.Date, label=TRUE, abbr=FALSE))

Number of lines per week day

d %>%
  group_by(Weekday) %>%
  summarise(Total = n())

#     Weekday Total
# 1    Sunday 10587
# 2    Monday 23138
# 3   Tuesday 24678
# 4 Wednesday 23651
# 5  Thursday 25265
# 6    Friday 24283
# 7  Saturday 12413
like image 167
Paul Rougieux Avatar answered Nov 16 '22 09:11

Paul Rougieux


I am sorry if this issue is long forgotten, but it weirds me out to see everyone recommending to convert to POSIX.ct or character when I have been using the much simpler solution of calling the arrange function from the plyr package using plyr::arrange, as it doesn't seem to have issues with the POSIXlt formats. As I am usually not the one finding the easiest solution for a problem in R, I am starting to think that there is something wrong with it. Does it not do the same as the dplyr version?

like image 45
Virginia Morera Pujol Avatar answered Nov 16 '22 09:11

Virginia Morera Pujol