When I merge
two data frames, the result has more rows than the original data.
In this instance, the all dataframe has 104956 rows, koppen has 3968 rows and alltest dataframe has 130335 rows. Ordinarily, alltest should have had rows equal to or less than all.
Why is this inflation happening? I am not sure if giving the reproducible example would help as it does work in the previous instances I have used it.
alltest <- merge(all, koppen, by = "fips", sort = F)
A left join, or left merge, keeps every row from the left dataframe. Result from left-join or left-merge of two dataframes in Pandas. Rows in the left dataframe that have no corresponding join value in the right dataframe are left with NaN values.
The merge sign is a regulatory sign. Drivers who encounter a merge sign are warned that two separate roadways will converge into one lane ahead. The merging traffic sign will typically indicate which lane should be merging into the other. Drivers on the main highway should be aware of merging vehicles.
A left join in R is a merge operation between two data frames where the merge returns all of the rows from one table (the left side) and any matching rows from the second table. A left join in R will NOT return values of the second table which do not already exist in the first table.
First, from ?merge
:
The rows in the two data frames that match on the specified columns are extracted, and joined together. If there is more than one match, all possible matches contribute one row each.
Using your link in the comments:
url <- "http://koeppen-geiger.vu-wien.ac.at/data/KoeppenGeiger.UScounty.txt"
koppen <- read.table(url, header=T, sep="\t")
nrow(koppen)
# [1] 3594
length(unique(koppen$FIPS))
# [1] 2789
So clearly koppen
has duplicated FIPS codes. Examining the dataset and the website, it appears that many of the counties are in more than one climate class, so for example, the county of Ankorage, Alaska has three climate classes:
koppen[koppen$FIPS==2020,]
# STATE COUNTY FIPS CLS PROP
# 73 Alaska Anchorage 2020 Dsc 0.010
# 74 Alaska Anchorage 2020 Dfc 0.961
# 75 Alaska Anchorage 2020 ET 0.029
The solution depends on what you are trying to accomplish. If you want to extract all rows in all
with any FIPS
that appear in koppen
, either of these should work:
merge(all,unique(koppen$FIPS))
all[all$FIPS %in% unique(koppen$FIPS),]
If you need to append the county and state name to all
, use this:
merge(all,unique(koppen[c("STATE","COUNTY","FIPS")]),by="FIPS")
EDIT Based on the exchange below in the comments.
So, since there are sometimes multiple rows in koppen
with the same FIPS
, but different CLS
, we need a way to decide which of the rows (e.g., which CLS
) to pick. Here are two ways:
# this extracts the row with the largest value of PROP, for that FIPS
url <- "http://koeppen-geiger.vu-wien.ac.at/data/KoeppenGeiger.UScounty.txt"
koppen <- read.csv(url, header=T, sep="\t")
koppen <- with(koppen,koppen[order(FIPS,-PROP),])
sub.koppen <- aggregate(koppen,by=list(koppen$FIPS),head,n=1)
result <- merge(all, sub.koppen, by="FIPS")
# this extracts a row at random
sub.koppen <- aggregate(koppen,by=list(koppen$FIPS),
function(x)x[sample(1:length(x),1)])
result <- merge(all, sub.koppen, by="FIPS")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With