Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Return df with a columns values that occur more than once [duplicate]

I have a data frame df, and I am trying to subset all rows that have a value in column B occur more than once in the dataset.

I tried using table to do it, but am having trouble subsetting from the table:

t<-table(df$B)

Then I try subsetting it using:

subset(df, table(df$B)>1)

And I get the error

"Error in x[subset & !is.na(subset)] : object of type 'closure' is not subsettable"

How can I subset my data frame using table counts?

like image 489
Chris Robles Avatar asked Jul 01 '14 05:07

Chris Robles


People also ask

How often does each unique value occur in the ‘assists’ column?

We can also use the following syntax to find how frequently each unique value occurs in the ‘assists’ column: The value 9 occurs 3 times. The value 7 occurs 2 times. The value 5 occurs 1 time. And so on. Next Interpolation vs. Extrapolation: What’s the Difference?

How do I return multiple columns from a Dataframe in pandas?

Return multiple columns using Pandas apply () method. Objects passed to the pandas.apply () are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1). By default (result_type=None), the final return type is inferred from the return type of the applied function.

Should I consider certain columns for identifying duplicates?

Considering certain columns is optional. Only consider certain columns for identifying duplicates, by default use all of the columns. Determines which duplicates (if any) to mark. first : Mark duplicates as True except for the first occurrence. last : Mark duplicates as True except for the last occurrence.

How to get the final return type of an applied function?

By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument. Syntax: DataFrame.apply (func, axis=0, broadcast=None, raw=False, reduce=None, result_type=None, args= (), **kwds)


2 Answers

Here is a dplyr solution (using mrFlick's data.frame)

library(dplyr)
newd <-  dd %>% group_by(b) %>% filter(n()>1) #
newd
#    a b 
# 1  1 1 
# 2  2 1 
# 3  5 4 
# 4  6 4 
# 5  7 4 
# 6  9 6 
# 7 10 6 

Or, using data.table

setDT(dd)[,if(.N >1) .SD,by=b]

Or using base R

dd[dd$b %in% unique(dd$b[duplicated(dd$b)]),]
like image 54
mnel Avatar answered Oct 31 '22 03:10

mnel


May I suggest an alternative, faster way to do this with data.table?

require(data.table) ## 1.9.2
setDT(df)[, .N, by=B][N > 1L]$B

(or) you can couple .I (another special variable - see ?data.table) which gives the corresponding row number in df, along with .N as follows:

setDT(df)[df[, .I[.N > 1L], by=B]$V1]

(or) have a look at @mnel's another for another variation (using yet another special variable .SD).

like image 30
Mike.Gahan Avatar answered Oct 31 '22 02:10

Mike.Gahan