Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Compare two data frames of different length for same values in two columns

Tags:

r

compare

This is a question about how to compare several columns of two different data frames with varying length.

I have two data frames (data from receiver1 (rec1) and receiver2 (rec2)) of different lengths containing positions for 4 different ships:

rec1 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 20, replace = TRUE), 
                lon = sample (1:20), 
                lat = sample (1:10)
                )    
rec2 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 30, replace = TRUE), 
                lon = sample (1:30),
                lat = sample (1:30)
                )

They contain varying names (ship names, same names for both) and longitude (lon) as well as latitude (lat) coordinates.

I am attempting to compare the two dfs to see how many values in both "lon" AND "lat" match for each vessel (i.e. how often the two receivers picked up the same locations)

Basically I am trying to find out how good each receiver is and how much of the datapoints overlap (e.g. percentage).

I am not sure how this is best done and am open for any suggestions. Thanks a lot!!!

like image 262
Kristina Avatar asked May 01 '15 00:05

Kristina


People also ask

How do I combine two data frames of different lengths in R?

Use the full_join Function to Merge Two R Data Frames With Different Number of Rows. full_join is part of the dplyr package, and it can be used to merge two data frames with a different number of rows.

How do I compare values in two columns in R?

We can compare two columns in R by using ifelse(). This statement is used to check the condition given and return the data accordingly.

How do you find the difference between two sets of data in R?

We can use the compare package in R. We can easily use this package to compare two data frames and check out the summary of what extent it is changed. The function comparedf() is used to compare two dataframes in R. The function takes two dataframes and then check them for comparison.


3 Answers

Here is a modified and reproducible test case together with my answer. I designed the test set to include combinations that will match and some that will not match.

rec1 <- data.frame(shipName = rep(c("Nina", "Doug", "Alli", "Steve"), each = 5), 
                lon = rep.int(c(1:5), 4), 
                lat = rep.int(c(11:15), 4)
                )    
rec2 <- data.frame(shipName = rep(c("Nina", "Doug", "Alli", "Steve"), each = 7), 
                lon = rep.int(c(2, 3, 4, 4, 5, 5, 6), 4),
                lat = rep.int(c(12, 13, 14, 14, 15, 15, 16), 4)
                )

print(rec1)
print(rec2)

#Merge the two data frames together, keeping only those combinations that match
m <- merge(rec1, rec2, by = c("shipName", "lon", "lat"), all = FALSE)

print(m)

If you want to count how many times each combination appears, try the following. (There are different ways to aggregate. Some are here. Below is my preferred method, which requires you to have data.table installed. It's a great tool, so you may want to install it if you haven't yet.)

library(data.table)

#Convert to a data table and optionally set the sort key for faster processing
m <- data.table(m)
setkey(m, shipName, lon, lat)

#Aggregate and create a new column called "Count" with the number of
    #observations in each group (.N)
m <- m[, j = list("Count" = .N), by = list(shipName, lon, lat)]

print(m)

#If you want to return to a standard data frame rather than a data table:
m <- data.frame(m)
like image 70
Kevin M Avatar answered Oct 23 '22 06:10

Kevin M


You didn't construct a very useful test case, but here is an approach:

> both <- rbind(data.frame(grp="A", rec1[, 2:3]), data.frame(grp="B", rec2[, 2:3]))
> with(both, table( duplicated(both[,2:3]), grp))
       grp
         A  B
  FALSE 20 30
like image 41
IRTFM Avatar answered Oct 23 '22 04:10

IRTFM


The simplest way to make this comparison in base R is with merge.

Try this:

# Set the RNG so sample() produces the same output and this example is reproducible
set.seed(720) 

rec1 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 20, replace = TRUE), 
            lon = sample (1:20), 
            lat = sample (1:10)
            )    
rec2 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 30, replace = TRUE), 
            lon = sample (1:30),
            lat = sample (1:30)
            )

merged <- merge(x = rec1,
                y = rec2,
                by = c("name","lat","lon"))

print(merged)

The merged data frame will contain all of the cases where all three columns match (in this case, one). You could then do something like table(merged$name) to count the number of times each name appears in the merged data.

Though, your question leaves me wondering... there must be some sort of time element here, yes? If you include the measurement time in your data, you could merge by name and time, then calculate the measured lat and lon differences.

Edit:

I feel I would be remiss if I didn't mention the fabulous dplyr package, which makes analysis like this extremely simple. The above merge and count of unique name values is achieved with this simple one-liner:

inner_join(rec1, rec2) %>% count(name)
like image 31
transcom Avatar answered Oct 23 '22 06:10

transcom