Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove exact rows and frequency of rows of a data.frame that are in another data.frame in r

Tags:

merge

dataframe

r

Consider the following two data.frames:

a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)])

I would like to remove the exact rows of a1 that are in a2 so that the result should be:

A  B
4  d
5  e
4  d
2  b

Note that one row with 2 b in a1 is retained in the final result. Currently, I use a looping statement, which becomes extremely slow as I have many variables and thousands of rows in my data.frames. Is there any built-in function to get this result?

like image 637
RBL Avatar asked Oct 10 '17 01:10

RBL


2 Answers

The idea is, add a counter for duplicates to each file, so you can get a unique match for each occurrence of a row. Data table is nice because it is easy to count the duplicates (with .N), and it also gives the necessary function (fsetdiff) for set operations.

library(data.table)

a1 <- data.table(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.table(A = c(1:3,2), B = letters[c(1:3,2)])

# add counter for duplicates
a1[, i := 1:.N, .(A,B)]
a2[, i := 1:.N, .(A,B)]

# setdiff gets the exception
# "all = T" allows duplicate rows to be returned
fsetdiff(a1, a2, all = T)

#    A B i
# 1: 4 d 1
# 2: 5 e 1
# 3: 4 d 2
# 4: 2 b 3
like image 113
DWal Avatar answered Oct 01 '22 07:10

DWal


You could use dplyr to do this. I set stringsAsFactors = FALSE to get rid of warnings about factor mismatches.

library(dplyr)

a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)], stringsAsFactors = FALSE)
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)], stringsAsFactors = FALSE)

## Make temp variables to join on then delete later.
# Create a row number
a1_tmp <- 
    a1 %>%
    group_by(A, B) %>%
    mutate(tmp_id = row_number()) %>%
    ungroup()
# Create a count
a2_tmp <-
    a2 %>%
     group_by(A, B) %>%
     summarise(count = n()) %>%
     ungroup()

## Keep all that have no entry int a2 or the id > the count (i.e. used up a2 entries).
left_join(a1_tmp, a2_tmp, by = c('A', 'B')) %>%
    ungroup() %>% filter(is.na(count) | tmp_id > count) %>%
    select(-tmp_id, -count)

## # A tibble: 4 x 2
##       A     B
##   <dbl> <chr>
## 1     4     d
## 2     5     e
## 3     4     d
## 4     2     b

EDIT

Here is a similar solution that is a little shorter. This does the following: (1) add a column for row number to join both data.frame items (2) a temporary column in a2 (2nd data.frame) that will show up as null in the join to a1 (i.e. indicates it's unique to a1).

library(dplyr)

left_join(a1 %>% group_by(A,B) %>% mutate(rn = row_number())             %>% ungroup(),
          a2 %>% group_by(A,B) %>% mutate(rn = row_number(), tmpcol = 0) %>% ungroup(),
          by = c('A', 'B', 'rn')) %>%
filter(is.na(tmpcol)) %>%
select(-tmpcol, -rn)

## # A tibble: 4 x 2
##       A     B
##   <dbl> <chr>
## 1     4     d
## 2     5     e
## 3     4     d
## 4     2     b

I think this solution is a little simpler (perhaps very little) than the first.

like image 42
steveb Avatar answered Oct 01 '22 07:10

steveb