Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reorganizing a unique (NYC MTA turnstile) dataset in R

Tags:

r

dataformat

I have a unique looking dataset (NYC MTA turnstile data) that I need to re-organize in a certain way to perform some analyses. I have written code that works but is not very efficient as it is a very large dataset. I'm hoping someone can suggest a better way.

The dataset in question has 43 columns. Columns 1-3 are unique identifiers (i.e. turnstile at a specific station). Then columns 4-8 identifies metered time, metered type, entries and then exit. 9-13 and then the rest of the columns up to 43 follow the same pattern. The dataset is ugly so I don't want to post it here but you can find it in the below link. You will have to look at pre 10/18/14 data.

http://web.mta.info/developers/turnstile.html

#Vector of column numbers that identifies the break
a <- c(4, 9, 14, 19, 24, 29, 34, 39)
#The actual loop to re-sort the data
for (i in 1:nrow(data)) {
    for (j in 1:length(a)) {
        if (j == 8 ){ all <- rbind(all, cbind(data[i, 1:3], data[i, a[j]:43])) }
        else { all <- rbind(all, cbind(data[i, 1:3], data[i,a[j]:(a[j+1]-1)])) } } }

The end result of all of this is something that looks like this.

        1    2        3        1        2       3       4        5
5083  H026 R137 00-00-00 10-04-14 00:00:00 REGULAR 4072851 10491832
50831 H026 R137 00-00-00 10-04-14 04:00:00 REGULAR 4072918 10492356
50832 H026 R137 00-00-00 10-04-14 08:00:00 REGULAR 4073125 10492613
50833 H026 R137 00-00-00 10-04-14 12:00:00 REGULAR 4073511 10493116
50834 H026 R137 00-00-00 10-04-14 16:00:00 REGULAR 4073820 10493877
50835 H026 R137 00-00-00 10-04-14 20:00:00 REGULAR 4074140 10494817

It works, but I know there's a much more efficient way out there to do this. Any help would be very much appreciated!

edit:

I should add a bit more to this as I left out some critical pieces that may change the approach to this. After I read in the data with read.csv I subset the data with only a couple of the meters(column 2). Since I liked the suggestion I converted the subsetted data into a string as seen below. This actually performs pretty decently but any further suggestion would be appreciated!

out1 <- function() {
  data <- read.csv(name, header=FALSE)

##Isolate data for stations included in network area
  station <- subset(data, V2%in% station_names)
  data <- apply(station, 1, paste, collapse=",")
  starts <- seq(from=4, to=43, by=5)
  new_data <- rbindlist(lapply(strsplit(data, ","), function(x) {
  rbindlist(lapply(starts, function(y) {
  as.list(x[c(1:3, y:(y+4))])
  }))
})) 
setnames(new_data, colnames(new_data), c("C.A", "UNIT", "SCP", "DATE",  "TIME","DESC", "ENTRIES", "EXIT"))
new_data <- as.data.frame(new_data)
}
like image 781
Sho Ohata Avatar asked Jan 08 '15 22:01

Sho Ohata


1 Answers

If you don't mind doing the processing on data load:

# data via http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description_pre-10-18-2014.txt

data <- readLines(textConnection("A002,R051,02-00-00,03-21-10,00:00:00,REGULAR,002670738,000917107,03-21-10,04:00:00,REGULAR,002670738,000917107,03-21-10,08:00:00,REGULAR,002670746,000917117,03-21-10,12:00:00,REGULAR,002670790,000917166,03-21-10,16:00:00,REGULAR,002670932,000917204,03-21-10,20:00:00,REGULAR,002671164,000917230,03-22-10,00:00:00,REGULAR,002671181,000917231,03-22-10,04:00:00,REGULAR,002671181,000917231
A002,R051,02-00-00,03-22-10,08:00:00,REGULAR,002671220,000917324,03-22-10,12:00:00,REGULAR,002671364,000917640,03-22-10,16:00:00,REGULAR,002671651,000917719,03-22-10,20:00:00,REGULAR,002672430,000917789,03-23-10,00:00:00,REGULAR,002672473,000917795,03-23-10,04:00:00,REGULAR,002672474,000917795,03-23-10,08:00:00,REGULAR,002672516,000917876,03-23-10,12:00:00,REGULAR,002672652,000917934
A002,R051,02-00-00,03-23-10,16:00:00,REGULAR,002672879,000917996,03-23-10,20:00:00,REGULAR,002673636,000918073,03-24-10,00:00:00,REGULAR,002673683,000918079,03-24-10,04:00:00,REGULAR,002673683,000918079,03-24-10,08:00:00,REGULAR,002673722,000918171,03-24-10,12:00:00,REGULAR,002673876,000918514,03-24-10,16:00:00,REGULAR,002674221,000918594,03-24-10,20:00:00,REGULAR,002675082,000918671
A002,R051,02-00-00,03-25-10,00:00:00,REGULAR,002675153,000918675,03-25-10,04:00:00,REGULAR,002675153,000918675,03-25-10,08:00:00,REGULAR,002675190,000918752,03-25-10,12:00:00,REGULAR,002675345,000919053,03-25-10,16:00:00,REGULAR,002675676,000919118,03-25-10,20:00:00,REGULAR,002676557,000919179,03-26-10,00:00:00,REGULAR,002676688,000919207,03-26-10,04:00:00,REGULAR,002676694,000919208
A002,R051,02-00-00,03-26-10,08:00:00,REGULAR,002676735,000919287,03-26-10,12:00:00,REGULAR,002676887,000919607,03-26-10,16:00:00,REGULAR,002677213,000919680,03-26-10,20:00:00,REGULAR,002678039,000919743,03-27-10,00:00:00,REGULAR,002678144,000919756,03-27-10,04:00:00,REGULAR,002678145,000919756,03-27-10,08:00:00,REGULAR,002678155,000919777,03-27-10,12:00:00,REGULAR,002678247,000919859
A002,R051,02-00-00,03-27-10,16:00:00,REGULAR,002678531,000919908,03-27-10,20:00:00,REGULAR,002678892,000919964,03-28-10,00:00:00,REGULAR,002678929,000919966,03-28-10,04:00:00,REGULAR,002678929,000919966,03-28-10,08:00:00,REGULAR,002678935,000919982,03-28-10,12:00:00,REGULAR,002679003,000920006,03-28-10,16:00:00,REGULAR,002679231,000920059,03-28-10,20:00:00,REGULAR,002679475,000920098"))


library(data.table)

starts <- seq(from=4, to=43, by=5)

new_data <- rbindlist(lapply(strsplit(data, ","), function(x) {

  rbindlist(lapply(starts, function(y) {
    as.list(x[c(1:3, y:(y+4))])
  }))

}))

setnames(new_data, colnames(new_data), c("control_area", "unit", "scp", "date", "time", "description", "entries", "exits"))

dplyr::glimpse(new_data)

## Observations: 48
## Variables:
## $ control_area (fctr) A002, A002, A002, A002, A002, A002, A002, A002, A002, A002, A002, A002, A002, A0...
## $ unit         (fctr) R051, R051, R051, R051, R051, R051, R051, R051, R051, R051, R051, R051, R051, R0...
## $ scp          (fctr) 02-00-00, 02-00-00, 02-00-00, 02-00-00, 02-00-00, 02-00-00, 02-00-00, 02-00-00, ...
## $ date         (fctr) 03-21-10, 03-21-10, 03-21-10, 03-21-10, 03-21-10, 03-21-10, 03-22-10, 03-22-10, ...
## $ time         (fctr) 00:00:00, 04:00:00, 08:00:00, 12:00:00, 16:00:00, 20:00:00, 00:00:00, 04:00:00, ...
## $ description  (fctr) REGULAR, REGULAR, REGULAR, REGULAR, REGULAR, REGULAR, REGULAR, REGULAR, REGULAR,...
## $ entries      (fctr) 002670738, 002670738, 002670746, 002670790, 002670932, 002671164, 002671181, 002...
## $ exits        (fctr) 000917107, 000917107, 000917117, 000917166, 000917204, 000917230, 000917231, 000...
like image 192
hrbrmstr Avatar answered Nov 18 '22 06:11

hrbrmstr