Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count Distinct Restaurants by Zip Code in R

Tags:

r

I have a data frame that has the median salary of every US zip code as well as the all fast food restaurants located within that zip code. Here is a sample of that data frame:

Row_NUM ZIP   MEDIAN RESTAURANT
26800   1001  56663  McDonald's
33161   1007  79076  McDonald's
23706   1008  63980  McDonald's
23709   1008  63980  McDonald's
30007   1008  63980  Taco Bell
30008   1008  63980  McDonald's
30009   1011  63476  McDonald's
24429   1013  36578  McDonald's
15323   1020  50058  KFC
29196   1020  50058  McDonald's
33127   1020  50058  McDonald's
39362   1020  50058  Wendy's
44914   1020  50058  Taco Bell
2542    1027  58573  Burger King 
35242   1027  58573  McDonald's

I want to do two things. First, I want to create a new data frame that has only unique zip codes, the median salary of that zip code, and the total number of restaurants in that zip code. So, for this data frame sample:

ZIP    MEDIAN  TOTAL_RESTAURANTS
1001   56663   1
1007   79076   1
1008   63980   4

Second, I want to create a data frame that has only the unique zip code and a column with the total number of each type of fast food restaurant. So for this data frame sample:

ZIP    MEDIAN  TOTAL_MCDONALDS  TOTAL_TACOBELL  TOTAL_KFC
1001   56663   1                0               0
1007   79076   1                0               0
1008   63980   3                1               0

I thought the below code would work, but it only gives me the zip code and total restaurant, and I am not sure how to modify it to include the other three columns.

df <- ddply(df,~ZIP, summarise,TOTAL_RESTAURANTS=length(RESTAURANT))

Any help would be greatly appreciated.

Edit: Here is what data types I am working with in my data frame.

    str(df)
    data.frame':    50002 obs. of  3 variables:
 $ ZIP       : int  44126 24014 77011 2190 48509 21061 43213 70130 31907 19422 ...
 $ MEDIAN    : int  54496 50175 27113 74205 50895 62408 36734 47591 38710 103683 ...
 $ RESTAURANT: Factor w/ 10 levels "McDonald's","Burger King",..: 2 2 2 2 2 2 2 2 2 2 ...

Here is my session info:

    R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] graphics  grDevices utils     datasets  grid      stats     methods   base     

other attached packages:
 [1] lubridate_1.3.3 extrafont_0.17  jsonlite_0.9.16 dplyr_0.4.2     tidyr_0.2.0     tableplot_0.3-5 reshape2_1.4.1  RCurl_1.95-4.7 
 [9] bitops_1.0-6    gplots_2.17.0   ggthemes_2.2.1  ggplot2_1.0.1  

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.0        Rttf2pt1_1.3.3     magrittr_1.5       MASS_7.3-43        munsell_0.4.2      colorspace_1.2-6   R6_2.1.1          
 [8] stringr_1.0.0      plyr_1.8.3         caTools_1.17.1     tools_3.2.2        parallel_3.2.2     gtable_0.1.2       KernSmooth_2.23-15
[15] DBI_0.3.1          extrafontdb_1.0    gtools_3.5.0       lazyeval_0.1.10    assertthat_0.1     digest_0.6.8       memoise_0.2.1     
[22] labeling_0.3       gdata_2.17.0       stringi_0.5-5      scales_0.3.0       proto_0.3-10 
like image 289
Spencer Avatar asked Oct 14 '15 23:10

Spencer


1 Answers

Uisng dplyr and reshape2, this would be one way to go. Since you want to summarise the data by ZIP, you can use the variable to group the data. One thing I am not clear is whether MEDIAN values are identical for each ZIP or not. Here, I assumed that you may have different values. Hence, I used median(). Using n(), you can find out how many fast food shops exist.

summarize(group_by(mydf, ZIP), mid = median(MEDIAN), total = n())

#If you have an identical MEDIAN for each ZIP, you could do;
summarize(group_by(mydf, ZIP), mid = first(MEDIAN), total = n())

#    ZIP   mid total
#  (int) (dbl) (int)
#1  1001 56663     1
#2  1007 79076     1
#3  1008 63980     4
#4  1011 63476     1
#5  1013 36578     1
#6  1020 50058     5
#7  1027 58573     2

For the second part, you can use dcast(). You want to see how many fast food stores exist by the types of fast food store. By the combination of ZIP and MEDIAN, you ask R to check how many shops (RESTAURANT) exist.

dcast(mydf, ZIP + MEDIAN ~ RESTAURANT, length, value.var = "RESTAURANT")

#   ZIP MEDIAN BurgerKing KFC McDonald's TacoBell Wendy's
#1 1001  56663          0   0          1        0       0
#2 1007  79076          0   0          1        0       0
#3 1008  63980          0   0          3        1       0
#4 1011  63476          0   0          1        0       0
#5 1013  36578          0   0          1        0       0
#6 1020  50058          0   1          2        1       1
#7 1027  58573          1   0          1        0       0

If you use data.table, you could do the following.

library(data.table)
setDT(mydf)[, list(mid = first(MEDIAN), total = .N), by = ZIP][]
# If you calculate median
setDT(mydf)[, list(mid = as.double(median(MEDIAN)), total = .N), by = ZIP][]

dcast(setDT(mydf), ZIP + MEDIAN ~ RESTAURANT, fun = length, value.var = "RESTAURANT")

DATA

mydf <-structure(list(Row_NUM = c(26800L, 33161L, 23706L, 23709L, 30007L, 
30008L, 30009L, 24429L, 15323L, 29196L, 33127L, 39362L, 44914L, 
2542L, 35242L), ZIP = c(1001L, 1007L, 1008L, 1008L, 1008L, 1008L, 
1011L, 1013L, 1020L, 1020L, 1020L, 1020L, 1020L, 1027L, 1027L
), MEDIAN = c(56663L, 79076L, 63980L, 63980L, 63980L, 63980L, 
63476L, 36578L, 50058L, 50058L, 50058L, 50058L, 50058L, 58573L, 
58573L), RESTAURANT = structure(c(3L, 3L, 3L, 3L, 4L, 3L, 3L, 
3L, 2L, 3L, 3L, 5L, 4L, 1L, 3L), .Label = c("BurgerKing", "KFC", 
"McDonald's", "TacoBell", "Wendy's"), class = "factor")), .Names = c("Row_NUM", 
"ZIP", "MEDIAN", "RESTAURANT"), class = "data.frame", row.names = c(NA, 
-15L))
like image 68
jazzurro Avatar answered Nov 11 '22 13:11

jazzurro