This is a question about how to compare several columns of two different data frames with varying length.
I have two data frames (data from receiver1 (rec1) and receiver2 (rec2)) of different lengths containing positions for 4 different ships:
rec1 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 20, replace = TRUE),
lon = sample (1:20),
lat = sample (1:10)
)
rec2 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 30, replace = TRUE),
lon = sample (1:30),
lat = sample (1:30)
)
They contain varying names (ship names, same names for both) and longitude (lon) as well as latitude (lat) coordinates.
I am attempting to compare the two dfs to see how many values in both "lon" AND "lat" match for each vessel (i.e. how often the two receivers picked up the same locations)
Basically I am trying to find out how good each receiver is and how much of the datapoints overlap (e.g. percentage).
I am not sure how this is best done and am open for any suggestions. Thanks a lot!!!
Use the full_join Function to Merge Two R Data Frames With Different Number of Rows. full_join is part of the dplyr package, and it can be used to merge two data frames with a different number of rows.
We can compare two columns in R by using ifelse(). This statement is used to check the condition given and return the data accordingly.
We can use the compare package in R. We can easily use this package to compare two data frames and check out the summary of what extent it is changed. The function comparedf() is used to compare two dataframes in R. The function takes two dataframes and then check them for comparison.
Here is a modified and reproducible test case together with my answer. I designed the test set to include combinations that will match and some that will not match.
rec1 <- data.frame(shipName = rep(c("Nina", "Doug", "Alli", "Steve"), each = 5),
lon = rep.int(c(1:5), 4),
lat = rep.int(c(11:15), 4)
)
rec2 <- data.frame(shipName = rep(c("Nina", "Doug", "Alli", "Steve"), each = 7),
lon = rep.int(c(2, 3, 4, 4, 5, 5, 6), 4),
lat = rep.int(c(12, 13, 14, 14, 15, 15, 16), 4)
)
print(rec1)
print(rec2)
#Merge the two data frames together, keeping only those combinations that match
m <- merge(rec1, rec2, by = c("shipName", "lon", "lat"), all = FALSE)
print(m)
If you want to count how many times each combination appears, try the following. (There are different ways to aggregate. Some are here. Below is my preferred method, which requires you to have data.table
installed. It's a great tool, so you may want to install it if you haven't yet.)
library(data.table)
#Convert to a data table and optionally set the sort key for faster processing
m <- data.table(m)
setkey(m, shipName, lon, lat)
#Aggregate and create a new column called "Count" with the number of
#observations in each group (.N)
m <- m[, j = list("Count" = .N), by = list(shipName, lon, lat)]
print(m)
#If you want to return to a standard data frame rather than a data table:
m <- data.frame(m)
You didn't construct a very useful test case, but here is an approach:
> both <- rbind(data.frame(grp="A", rec1[, 2:3]), data.frame(grp="B", rec2[, 2:3]))
> with(both, table( duplicated(both[,2:3]), grp))
grp
A B
FALSE 20 30
The simplest way to make this comparison in base R is with merge
.
Try this:
# Set the RNG so sample() produces the same output and this example is reproducible
set.seed(720)
rec1 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 20, replace = TRUE),
lon = sample (1:20),
lat = sample (1:10)
)
rec2 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 30, replace = TRUE),
lon = sample (1:30),
lat = sample (1:30)
)
merged <- merge(x = rec1,
y = rec2,
by = c("name","lat","lon"))
print(merged)
The merged data frame will contain all of the cases where all three columns match (in this case, one). You could then do something like table(merged$name)
to count the number of times each name appears in the merged data.
Though, your question leaves me wondering... there must be some sort of time element here, yes? If you include the measurement time in your data, you could merge by name and time, then calculate the measured lat and lon differences.
Edit:
I feel I would be remiss if I didn't mention the fabulous dplyr package, which makes analysis like this extremely simple. The above merge and count of unique name values is achieved with this simple one-liner:
inner_join(rec1, rec2) %>% count(name)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With