This is a question about how to compare several columns of two different data frames with varying length. I have two data frames (data from receiver1 (rec1) and receiver2 (rec2)) of different lengths containing positions for 4 different ships: <pre class="prettyprint"><code>rec1 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 20, replace = TRUE), lon = sample (1:20), lat = sample (1:10) ) rec2 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 30, replace = TRUE), lon = sample (1:30), lat = sample (1:30) ) </code></pre> They contain varying names (ship names, same names for both) and longitude (lon) as well as latitude (lat) coordinates. I am attempting to compare the two dfs to see how many values in both "lon" AND "lat" match for each vessel (i.e. how often the two receivers picked up the same locations) Basically I am trying to find out how good each receiver is and how much of the datapoints overlap (e.g. percentage). I am not sure how this is best done and am open for any suggestions. Thanks a lot!!!

You didn't construct a very useful test case, but here is an approach: <pre class="prettyprint"><code>> both <- rbind(data.frame(grp="A", rec1[, 2:3]), data.frame(grp="B", rec2[, 2:3])) > with(both, table( duplicated(both[,2:3]), grp)) grp A B FALSE 20 30 </code></pre>

The simplest way to make this comparison in base R is with <code>merge</code>. Try this: <pre class="prettyprint"><code># Set the RNG so sample() produces the same output and this example is reproducible set.seed(720) rec1 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 20, replace = TRUE), lon = sample (1:20), lat = sample (1:10) ) rec2 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 30, replace = TRUE), lon = sample (1:30), lat = sample (1:30) ) merged <- merge(x = rec1, y = rec2, by = c("name","lat","lon")) print(merged) </code></pre> The merged data frame will contain all of the cases where all three columns match (in this case, one). You could then do something like <code>table(merged$name)</code> to count the number of times each name appears in the merged data. Though, your question leaves me wondering... there must be some sort of time element here, yes? If you include the measurement time in your data, you could merge by name and time, then calculate the measured lat and lon differences. Edit: I feel I would be remiss if I didn't mention the fabulous dplyr package, which makes analysis like this extremely simple. The above merge and count of unique name values is achieved with this simple one-liner: <pre class="prettyprint"><code>inner_join(rec1, rec2) %>% count(name) </code></pre>

R - Compare two data frames of different length for same values in two columns

Tags:

r

compare

This is a question about how to compare several columns of two different data frames with varying length.

I have two data frames (data from receiver1 (rec1) and receiver2 (rec2)) of different lengths containing positions for 4 different ships:

rec1 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 20, replace = TRUE), 
                lon = sample (1:20), 
                lat = sample (1:10)
                )    
rec2 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 30, replace = TRUE), 
                lon = sample (1:30),
                lat = sample (1:30)
                )

They contain varying names (ship names, same names for both) and longitude (lon) as well as latitude (lat) coordinates.

I am attempting to compare the two dfs to see how many values in both "lon" AND "lat" match for each vessel (i.e. how often the two receivers picked up the same locations)

Basically I am trying to find out how good each receiver is and how much of the datapoints overlap (e.g. percentage).

I am not sure how this is best done and am open for any suggestions. Thanks a lot!!!

262

asked May 01 '15 00:05

Kristina

3 Answers

Here is a modified and reproducible test case together with my answer. I designed the test set to include combinations that will match and some that will not match.

rec1 <- data.frame(shipName = rep(c("Nina", "Doug", "Alli", "Steve"), each = 5), 
                lon = rep.int(c(1:5), 4), 
                lat = rep.int(c(11:15), 4)
                )    
rec2 <- data.frame(shipName = rep(c("Nina", "Doug", "Alli", "Steve"), each = 7), 
                lon = rep.int(c(2, 3, 4, 4, 5, 5, 6), 4),
                lat = rep.int(c(12, 13, 14, 14, 15, 15, 16), 4)
                )

print(rec1)
print(rec2)

#Merge the two data frames together, keeping only those combinations that match
m <- merge(rec1, rec2, by = c("shipName", "lon", "lat"), all = FALSE)

print(m)

If you want to count how many times each combination appears, try the following. (There are different ways to aggregate. Some are here. Below is my preferred method, which requires you to have data.table installed. It's a great tool, so you may want to install it if you haven't yet.)

library(data.table)

#Convert to a data table and optionally set the sort key for faster processing
m <- data.table(m)
setkey(m, shipName, lon, lat)

#Aggregate and create a new column called "Count" with the number of
    #observations in each group (.N)
m <- m[, j = list("Count" = .N), by = list(shipName, lon, lat)]

print(m)

#If you want to return to a standard data frame rather than a data table:
m <- data.frame(m)

answered Oct 23 '22 06:10

Kevin M

You didn't construct a very useful test case, but here is an approach:

> both <- rbind(data.frame(grp="A", rec1[, 2:3]), data.frame(grp="B", rec2[, 2:3]))
> with(both, table( duplicated(both[,2:3]), grp))
       grp
         A  B
  FALSE 20 30

answered Oct 23 '22 04:10

IRTFM

The simplest way to make this comparison in base R is with merge.

Try this:

# Set the RNG so sample() produces the same output and this example is reproducible
set.seed(720) 

rec1 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 20, replace = TRUE), 
            lon = sample (1:20), 
            lat = sample (1:10)
            )    
rec2 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 30, replace = TRUE), 
            lon = sample (1:30),
            lat = sample (1:30)
            )

merged <- merge(x = rec1,
                y = rec2,
                by = c("name","lat","lon"))

print(merged)

The merged data frame will contain all of the cases where all three columns match (in this case, one). You could then do something like table(merged$name) to count the number of times each name appears in the merged data.

Though, your question leaves me wondering... there must be some sort of time element here, yes? If you include the measurement time in your data, you could merge by name and time, then calculate the measured lat and lon differences.

Edit:

I feel I would be remiss if I didn't mention the fabulous dplyr package, which makes analysis like this extremely simple. The above merge and count of unique name values is achieved with this simple one-liner:

inner_join(rec1, rec2) %>% count(name)

answered Oct 23 '22 06:10

transcom

Related questions
                            
                                Saving a file as .RProfile in windows
                            
                                Conditionally filling rows of a data frame
                            
                                how to remove scientific notation for plot() [duplicate]
                            
                                loading ggplot2 (colorspace, actually) opens up x11
                            
                                Visualizing graph/network with 3 layeres (tripartite) in R/igraph
                            
                                Correlation between numeric and logical variable gives (intended) error?
                            
                                R, using Knitr to view a table in HTML
                            
                                Reorganize list into dataframe using dplyr
                            
                                In R print decimal comma instead of decimal point
                            
                                "could not find function" only when in the R debugger
                            
                                R: workaround for variable-width lookbehind
                            
                                line by line debugging in R studio
                            
                                Programmatic subsetting of a data.table in R
                            
                                chordDiagram function, R package circlize
                            
                                R fill in NA with previous row value with condition
                            
                                how to kill parallel program of R in Linux
                            
                                R equivalent of the Matlab spy function
                            
                                Dplyr summarise_each to aggregate results
                            
                                Extracting RColorBrewer palette for other use
                            
                                how do you convert output from readLines to data frame in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With