Remove rows in data.table according to another data.table

Question

I have a data.table named dtA:

My actual dtA has 62871932 rows and 3 columns:

  date    company    value
198101          A        1
198101          A        2
198101          B        5
198102          A        2
198102          B        5
198102          B        6

data.table dtB have some columns I want to remove from dtA, so dtB is like the rules:

Actual dtB has 19615280 rows and 3 columns:

  date    company    value
198101          A        2
198102          B        5

End result is:

  date    company    value
198101          A        1
198101          B        5
198102          A        2
198102          B        6

It is not as simple as:

dtA=dtA[!(dtB$company %in% dtA$company)]

because it also depends on date and value.

I tried to merge two tables together and chain it with not in statement:

dtA=dtA[dtB, on=date][!(company %in% comapny) & !(value %in% value)]

I get this message:

Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join.

Any ideas?

ruaridhw · Accepted Answer

Use an anti-join:

dtA[!dtB, on=.(date, company, value)]

This matches all records in dtA that are not found in dtB using the columns in on.

Gabriel · Answer

I think I know how to solve this:

in dtB I add a pointer using data.table syntax:

dtB[, pointer := 1]

dtB will looks like this

  date    company    value    pointer
198101          A        2          1
198102          B        5          1

Then I use LEFT OUTER JOIN method from here: https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html

setkey(dtA, date, company, value)
setkey(dtB, date, company, value)
dtA=merge(dtA, dtB, all.x)

This means on pointer column, if dtB's row exist in dtA, it will give 1. If dtB's row do not exist in dtA's, then it will be given NA

Result will be:

  date    company    value    pointer
198101          A        1         NA
198101          A        2          1
198101          B        5         NA
198102          A        2         NA
198102          B        5          1
198102          B        6         NA

I then select those rows with NA and remove pointer column:

dtA=dtA[!(pointer %in% "1")][,-c("pointer")]

I get my result:

  date    company    value
198101          A        1
198101          B        5
198102          A        2
198102          B        6

Remove rows in data.table according to another data.table

Tags:

r

data.table

Gabriel

2 Answers

ruaridhw

Gabriel

Recent Activity

Donate For Us

Remove rows in data.table according to another data.table

Tags:

r

data.table

Gabriel

2 Answers

ruaridhw

Gabriel

Related questions

Recent Activity

Donate For Us