When I <code>merge</code> two data frames, the result has more rows than the original data. In this instance, the all dataframe has 104956 rows, koppen has 3968 rows and alltest dataframe has 130335 rows. Ordinarily, alltest should have had rows equal to or less than all. Why is this inflation happening? I am not sure if giving the reproducible example would help as it does work in the previous instances I have used it. <pre class="prettyprint"><code>alltest <- merge(all, koppen, by = "fips", sort = F) </code></pre>

First, from <code>?merge</code>: <blockquote> The rows in the two data frames that match on the specified columns are extracted, and joined together. If there is more than one match, all possible matches contribute one row each. </blockquote> Using your link in the comments: <pre class="prettyprint"><code>url <- "http://koeppen-geiger.vu-wien.ac.at/data/KoeppenGeiger.UScounty.txt" koppen <- read.table(url, header=T, sep="\t") nrow(koppen) # [1] 3594 length(unique(koppen$FIPS)) # [1] 2789 </code></pre> So clearly <code>koppen</code> has duplicated FIPS codes. Examining the dataset and the website, it appears that many of the counties are in more than one climate class, so for example, the county of Ankorage, Alaska has three climate classes: <pre class="prettyprint"><code>koppen[koppen$FIPS==2020,] # STATE COUNTY FIPS CLS PROP # 73 Alaska Anchorage 2020 Dsc 0.010 # 74 Alaska Anchorage 2020 Dfc 0.961 # 75 Alaska Anchorage 2020 ET 0.029 </code></pre> The solution depends on what you are trying to accomplish. If you want to extract all rows in <code>all</code> with any <code>FIPS</code> that appear in <code>koppen</code>, either of these should work: <pre class="prettyprint"><code>merge(all,unique(koppen$FIPS)) all[all$FIPS %in% unique(koppen$FIPS),] </code></pre> If you need to append the county and state name to <code>all</code>, use this: <pre class="prettyprint"><code>merge(all,unique(koppen[c("STATE","COUNTY","FIPS")]),by="FIPS") </code></pre> EDIT Based on the exchange below in the comments. So, since there are sometimes multiple rows in <code>koppen</code> with the same <code>FIPS</code>, but different <code>CLS</code>, we need a way to decide which of the rows (e.g., which <code>CLS</code>) to pick. Here are two ways: <pre class="prettyprint"><code># this extracts the row with the largest value of PROP, for that FIPS url <- "http://koeppen-geiger.vu-wien.ac.at/data/KoeppenGeiger.UScounty.txt" koppen <- read.csv(url, header=T, sep="\t") koppen <- with(koppen,koppen[order(FIPS,-PROP),]) sub.koppen <- aggregate(koppen,by=list(koppen$FIPS),head,n=1) result <- merge(all, sub.koppen, by="FIPS") # this extracts a row at random sub.koppen <- aggregate(koppen,by=list(koppen$FIPS), function(x)x[sample(1:length(x),1)]) result <- merge(all, sub.koppen, by="FIPS") </code></pre>

Why does merge result in more rows than original data?

Tags:

join

r

When I merge two data frames, the result has more rows than the original data.

In this instance, the all dataframe has 104956 rows, koppen has 3968 rows and alltest dataframe has 130335 rows. Ordinarily, alltest should have had rows equal to or less than all.

Why is this inflation happening? I am not sure if giving the reproducible example would help as it does work in the previous instances I have used it.

alltest <- merge(all, koppen, by = "fips", sort = F)

918

asked Jun 10 '14 21:06

Geekuna Matata

1 Answers

First, from ?merge:

The rows in the two data frames that match on the specified columns are extracted, and joined together. If there is more than one match, all possible matches contribute one row each.

Using your link in the comments:

url    <- "http://koeppen-geiger.vu-wien.ac.at/data/KoeppenGeiger.UScounty.txt"
koppen <- read.table(url, header=T, sep="\t")
nrow(koppen)
# [1] 3594
length(unique(koppen$FIPS))
# [1] 2789

So clearly koppen has duplicated FIPS codes. Examining the dataset and the website, it appears that many of the counties are in more than one climate class, so for example, the county of Ankorage, Alaska has three climate classes:

koppen[koppen$FIPS==2020,]
#     STATE    COUNTY FIPS CLS  PROP
# 73 Alaska Anchorage 2020 Dsc 0.010
# 74 Alaska Anchorage 2020 Dfc 0.961
# 75 Alaska Anchorage 2020  ET 0.029

The solution depends on what you are trying to accomplish. If you want to extract all rows in all with any FIPS that appear in koppen, either of these should work:

merge(all,unique(koppen$FIPS))

all[all$FIPS %in% unique(koppen$FIPS),]

If you need to append the county and state name to all, use this:

merge(all,unique(koppen[c("STATE","COUNTY","FIPS")]),by="FIPS")

EDIT Based on the exchange below in the comments.

So, since there are sometimes multiple rows in koppen with the same FIPS, but different CLS, we need a way to decide which of the rows (e.g., which CLS) to pick. Here are two ways:

# this extracts the row with the largest value of PROP, for that FIPS
url        <- "http://koeppen-geiger.vu-wien.ac.at/data/KoeppenGeiger.UScounty.txt"
koppen     <- read.csv(url, header=T, sep="\t")
koppen     <- with(koppen,koppen[order(FIPS,-PROP),])
sub.koppen <- aggregate(koppen,by=list(koppen$FIPS),head,n=1)
result     <- merge(all, sub.koppen, by="FIPS")

# this extracts a row at random
sub.koppen <- aggregate(koppen,by=list(koppen$FIPS), 
                        function(x)x[sample(1:length(x),1)])
result     <- merge(all, sub.koppen, by="FIPS")

136

answered Sep 21 '22 06:09

jlhoward

Related questions
                            
                                Use a factor column in "by" and do not drop empty factors
                            
                                Order and color of bars in ggplot2 barplot
                            
                                ggplot2 multiple sub groups of a bar chart
                            
                                MLE error in R: initial value in 'vmmin' is not finite
                            
                                Preventing column-class inference in fread()
                            
                                rowwise operation with dplyr
                            
                                How to plot barchart onto ggplot2 map [duplicate]
                            
                                Can I recreate this polar coordinate spider chart in plotly?
                            
                                R largest/smallest representable numbers
                            
                                suppressWarnings() doesn't work with pipe operator
                            
                                Incremental slides do not work with a two-column layout
                            
                                subset() drops attributes on vectors; how to maintain/persist them?
                            
                                returning a custom object from a wrapped method in Rcpp
                            
                                Finding pattern in a matrix in R
                            
                                Displaying warnings generated by R script as they happen
                            
                                How to preserve text when saving ggplot2 as .svg?
                            
                                Fill option for fread
                            
                                backticks in variable name
                            
                                Error when trying to use stl and decompose functions in R
                            
                                knitr: output hook with an output.lines= option that works like echo=2:6

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With