I have a data frame of 59720 obs. that looks like below. I want to assign a MARKETNAME to each observation from a lookup table.
> data (a)
DAY HOUR LEAD Row.Count DATE ITIME HOMEPHONE CITY STATE ZIPCODE ZONENAME
1 Monday 13:00 1 9430 7/1/2013 13:42:51 FORT LAUDERDALE FL 33315 68
2 Monday 13:00 1 9432 7/1/2013 13:43:50 xxxxx9802x PLEASANT GROVE AL 35127 82
3 Monday 13:00 1 9434 7/1/2013 13:46:18 5xxxx85x10 ORO VALLEY AZ 85737 54
4 Monday 0:00 1 9435 7/1/2013 0:04:34 50xxxx1x364 SPOKANE WA 99204 211
5 Monday 11:00 1 9436 7/1/2013 11:45:43 951xxxxx20 RIVERSIDE CA 92507 31
6 Monday 11:00 1 9437 7/1/2013 11:46:26 760xxxxx679 VISTA CA 92081 539
I have a lookup table of zip codes with 43126 unique zip codes that looks like this:
> data (b)
MARKETNAME ZIPCODE
NEW YORK 00501
NEW YORK 00544
SPRINGFIELD-HOLYOKE 01001
SPRINGFIELD-HOLYOKE 01002
SPRINGFIELD-HOLYOKE 01003
SPRINGFIELD-HOLYOKE 01004
I wanted to simply assign the MARKETNAME to my dataset "a"
comparing the ZIPCODE
in "b"
. So I used
> c <- merge(a, b, by="ZIPCODE")
.
It returned 58,972 obs. which meant I lost 748 obs. I did not want to lose any record from a
so I changed my code as follows:
> c <- merge (a, b, by = "ZIPCODE" , all.x=TRUE)
.
Strangely this returned 61,652 obs. instead of my expectation which was returning 59,720 obs. as per original a
data frame with some NAs.
As per the documentation,
"if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output."
My interpretation of this is definitely wrong. Can someone please explain what I am doing wrong and how I can accomplish this simple task?
I referred : How to merge data frames and change element values based on certain conditions?, Subsetting and Merging from 2 Related Data Frames in r, how to merge two unequal size data frame in R but none of them are akin to my problem.
I prefer join
from plyr
which by default is a left-join returning all matches of records in the first data frame.
c <- join(a, b, by="ZIPCODE")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With