I have a data.table
named dtA
:
My actual dtA
has 62871932 rows and 3 columns:
date company value
198101 A 1
198101 A 2
198101 B 5
198102 A 2
198102 B 5
198102 B 6
data.table
dtB
have some columns I want to remove from dtA
, so dtB
is like the rules:
Actual dtB
has 19615280 rows and 3 columns:
date company value
198101 A 2
198102 B 5
End result is:
date company value
198101 A 1
198101 B 5
198102 A 2
198102 B 6
It is not as simple as:
dtA=dtA[!(dtB$company %in% dtA$company)]
because it also depends on date and value.
I tried to merge two tables together and chain it with not in statement:
dtA=dtA[dtB, on=date][!(company %in% comapny) & !(value %in% value)]
I get this message:
Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join.
Any ideas?
Use an anti-join:
dtA[!dtB, on=.(date, company, value)]
This matches all records in dtA
that are not found in dtB
using the columns in on
.
I think I know how to solve this:
in dtB I add a pointer using data.table syntax:
dtB[, pointer := 1]
dtB will looks like this
date company value pointer
198101 A 2 1
198102 B 5 1
Then I use LEFT OUTER JOIN method from here: https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html
setkey(dtA, date, company, value)
setkey(dtB, date, company, value)
dtA=merge(dtA, dtB, all.x)
This means on pointer column, if dtB's row exist in dtA, it will give 1. If dtB's row do not exist in dtA's, then it will be given NA
Result will be:
date company value pointer
198101 A 1 NA
198101 A 2 1
198101 B 5 NA
198102 A 2 NA
198102 B 5 1
198102 B 6 NA
I then select those rows with NA and remove pointer column:
dtA=dtA[!(pointer %in% "1")][,-c("pointer")]
I get my result:
date company value
198101 A 1
198101 B 5
198102 A 2
198102 B 6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With