This image of a left outer join depicts exactly what I would like: removing rows of a data.table based on two columns lat, lon that exactly match the lat, lon columns of another data.table.
Suppose I have the following data.table "dt.master" with over 1 million of rows containing an id and coordinates of a certain location lat, lon:
id lat lon
1 43.23 5.43
2 43.56 4.12
3 52.14 -9.85
4 43.56 4.12
5 43.83 9.43
... ... ...
What I would like to do is remove the rows that match a certain pair of coordinates. You could think of that pair of coordinates as being put on the following blacklist (again a data.table named "dt.blacklist"):
lat lon
43.56 4.12
11.14 -5.85
In this case, when applying the blacklist, the answer would have to be:
id lat lon
1 43.23 5.43
3 52.14 -9.85
5 43.83 9.43
... ... ...
As straightforward as it seems, I cannot get it right.
Using merge, like such:
dt.result <- merge(dt.master, dt.blacklist[, c("lat", "lon")], by.x=c("lat", "lon"), by.y=c("lat", "lon"))
But that yields the rows that match and is thus an inner join. I thought about removing rows based on this result by using subset:
subset(dt.master, lat != dt.result$lat & lon != dt.result$lon)
But the issue is that it partially works as then only 1 row in the above example is removed and not 2 rows as I would like. Somehow it only removes the first "hit".
Using a quick and dirty solution by concatenating lat, lon to a new column named "C" in both data tables and then removing it as such:
dt.master[C != dt.blacklist$C]
Yet, the same issue arises where only 1 of the two rows is removed.
I think you are looking for this:
dt.master[!dt.blacklist, on = .(lat,lon)]
The output:
id lat lon
1: 1 43.23 5.43
2: 3 52.14 -9.85
3: 5 43.83 9.43
Thanks to the warning of the green wise guy, that joinng on floating points could have unintended side-effects. By converting to integers you could prevent that. The join will as a result look a bit more complicated:
dt.master[, (2:3) := lapply(.SD,function(x) as.integer(x*100)), .SDcols = 2:3
][!dt.blacklist[, (1:2) := lapply(.SD,function(x) as.integer(x*100))], on = .(lat,lon)
][, (2:3) := lapply(.SD, `/`, 100), .SDcols = 2:3][]
The output is the same:
id lat lon
1: 1 43.23 5.43
2: 3 52.14 -9.85
3: 5 43.83 9.43
We can use fsetdiff from data.table
fsetdiff(df1[,-1], df2)
or can use anti_join from dplyr
library(dplyr)
anti_join(df1, df2)
# id lat lon
#1 1 43.23 5.43
#2 3 52.14 -9.85
#3 5 43.83 9.43
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With