I've been preparing my data and somehow I have way less data after merging my data sets.
Since I don't have the longitude and latitude in my data I've been using the following code after I downloaded the package zipcode (tel1
is my data containing zipcodes)
merge <- merge(zipcode,tel1,by.x=c('zip'),by.y=c('zip_code'))
Before merging I had 195956 observations, while after merging it dropped down to 180090, but I don't understand why.
In my opinion I just merged them where zip
was equal to zip_code
and I added the information from the dataset zipcode to my folder tel1
Afterward I wanted to remove the rows that contain NA because the merge couldn't define any numbers or whatever. I used this code
final <- result[complete.cases(result),]
Then my number of observations dropped down to 51006 which I just can't believe. There can't be so many mismatches in my data.
Is there any other code that I should use?
Afterwards I've been trying to delete the duplicates with the code
last <- with(final,final[order(state,latitude,longitude),])
but the number of observations was consistent (51006).
What did I do wrong or is there a way to get my data into an excel file again after merging the data so I could manually check if there are really so many mismatches? Thanks
Can use the all
argument to merge
.
merge(zipcode, tel1, by.x='zip', by.y='zip_code', all.y=TRUE)
However, for rows where matches aren't found in the zipcode data, there will be NA
s. Thus if you then na.rm
or something to that effect, you will wind up with the same "data loss"
Check the zip codes for the rows where there are NAs in the lat and long columns after the merge:
tel1[is.na(tel1$latitude) | is.na(tel1$longitude),]
My guess is they aren't valid zip codes or the list of zipcodes you have is not complete.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With