I have a data frame that has the median salary of every US zip code as well as the all fast food restaurants located within that zip code. Here is a sample of that data frame:
Row_NUM ZIP MEDIAN RESTAURANT
26800 1001 56663 McDonald's
33161 1007 79076 McDonald's
23706 1008 63980 McDonald's
23709 1008 63980 McDonald's
30007 1008 63980 Taco Bell
30008 1008 63980 McDonald's
30009 1011 63476 McDonald's
24429 1013 36578 McDonald's
15323 1020 50058 KFC
29196 1020 50058 McDonald's
33127 1020 50058 McDonald's
39362 1020 50058 Wendy's
44914 1020 50058 Taco Bell
2542 1027 58573 Burger King
35242 1027 58573 McDonald's
I want to do two things. First, I want to create a new data frame that has only unique zip codes, the median salary of that zip code, and the total number of restaurants in that zip code. So, for this data frame sample:
ZIP MEDIAN TOTAL_RESTAURANTS
1001 56663 1
1007 79076 1
1008 63980 4
Second, I want to create a data frame that has only the unique zip code and a column with the total number of each type of fast food restaurant. So for this data frame sample:
ZIP MEDIAN TOTAL_MCDONALDS TOTAL_TACOBELL TOTAL_KFC
1001 56663 1 0 0
1007 79076 1 0 0
1008 63980 3 1 0
I thought the below code would work, but it only gives me the zip code and total restaurant, and I am not sure how to modify it to include the other three columns.
df <- ddply(df,~ZIP, summarise,TOTAL_RESTAURANTS=length(RESTAURANT))
Any help would be greatly appreciated.
Edit: Here is what data types I am working with in my data frame.
str(df)
data.frame': 50002 obs. of 3 variables:
$ ZIP : int 44126 24014 77011 2190 48509 21061 43213 70130 31907 19422 ...
$ MEDIAN : int 54496 50175 27113 74205 50895 62408 36734 47591 38710 103683 ...
$ RESTAURANT: Factor w/ 10 levels "McDonald's","Burger King",..: 2 2 2 2 2 2 2 2 2 2 ...
Here is my session info:
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] graphics grDevices utils datasets grid stats methods base
other attached packages:
[1] lubridate_1.3.3 extrafont_0.17 jsonlite_0.9.16 dplyr_0.4.2 tidyr_0.2.0 tableplot_0.3-5 reshape2_1.4.1 RCurl_1.95-4.7
[9] bitops_1.0-6 gplots_2.17.0 ggthemes_2.2.1 ggplot2_1.0.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.0 Rttf2pt1_1.3.3 magrittr_1.5 MASS_7.3-43 munsell_0.4.2 colorspace_1.2-6 R6_2.1.1
[8] stringr_1.0.0 plyr_1.8.3 caTools_1.17.1 tools_3.2.2 parallel_3.2.2 gtable_0.1.2 KernSmooth_2.23-15
[15] DBI_0.3.1 extrafontdb_1.0 gtools_3.5.0 lazyeval_0.1.10 assertthat_0.1 digest_0.6.8 memoise_0.2.1
[22] labeling_0.3 gdata_2.17.0 stringi_0.5-5 scales_0.3.0 proto_0.3-10
Uisng dplyr
and reshape2
, this would be one way to go. Since you want to summarise the data by ZIP
, you can use the variable to group the data. One thing I am not clear is whether MEDIAN values are identical for each ZIP or not. Here, I assumed that you may have different values. Hence, I used median()
. Using n()
, you can find out how many fast food shops exist.
summarize(group_by(mydf, ZIP), mid = median(MEDIAN), total = n())
#If you have an identical MEDIAN for each ZIP, you could do;
summarize(group_by(mydf, ZIP), mid = first(MEDIAN), total = n())
# ZIP mid total
# (int) (dbl) (int)
#1 1001 56663 1
#2 1007 79076 1
#3 1008 63980 4
#4 1011 63476 1
#5 1013 36578 1
#6 1020 50058 5
#7 1027 58573 2
For the second part, you can use dcast()
. You want to see how many fast food stores exist by the types of fast food store. By the combination of ZIP
and MEDIAN
, you ask R to check how many shops (RESTAURANT) exist.
dcast(mydf, ZIP + MEDIAN ~ RESTAURANT, length, value.var = "RESTAURANT")
# ZIP MEDIAN BurgerKing KFC McDonald's TacoBell Wendy's
#1 1001 56663 0 0 1 0 0
#2 1007 79076 0 0 1 0 0
#3 1008 63980 0 0 3 1 0
#4 1011 63476 0 0 1 0 0
#5 1013 36578 0 0 1 0 0
#6 1020 50058 0 1 2 1 1
#7 1027 58573 1 0 1 0 0
If you use data.table
, you could do the following.
library(data.table)
setDT(mydf)[, list(mid = first(MEDIAN), total = .N), by = ZIP][]
# If you calculate median
setDT(mydf)[, list(mid = as.double(median(MEDIAN)), total = .N), by = ZIP][]
dcast(setDT(mydf), ZIP + MEDIAN ~ RESTAURANT, fun = length, value.var = "RESTAURANT")
DATA
mydf <-structure(list(Row_NUM = c(26800L, 33161L, 23706L, 23709L, 30007L,
30008L, 30009L, 24429L, 15323L, 29196L, 33127L, 39362L, 44914L,
2542L, 35242L), ZIP = c(1001L, 1007L, 1008L, 1008L, 1008L, 1008L,
1011L, 1013L, 1020L, 1020L, 1020L, 1020L, 1020L, 1027L, 1027L
), MEDIAN = c(56663L, 79076L, 63980L, 63980L, 63980L, 63980L,
63476L, 36578L, 50058L, 50058L, 50058L, 50058L, 50058L, 58573L,
58573L), RESTAURANT = structure(c(3L, 3L, 3L, 3L, 4L, 3L, 3L,
3L, 2L, 3L, 3L, 5L, 4L, 1L, 3L), .Label = c("BurgerKing", "KFC",
"McDonald's", "TacoBell", "Wendy's"), class = "factor")), .Names = c("Row_NUM",
"ZIP", "MEDIAN", "RESTAURANT"), class = "data.frame", row.names = c(NA,
-15L))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With