Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pull subset of rows of dataframe based on conditions from other columns

I have a dataframe like the one below:

x <- data.table(Tickers=c("A","A","A","B","B","B","B","D","D","D","D"),
                Type=c("put","call","put","call","call","put","call","put","call","put","call"),
                Strike=c(35,37.5,37.5,10,11,11,12,40,40,42,42),
                Other=sample(20,11))

    Tickers Type Strike Other
 1:       A  put   35.0     6
 2:       A call   37.5     5
 3:       A  put   37.5    13
 4:       B call   10.0    15
 5:       B call   11.0    12
 6:       B  put   11.0     4
 7:       B call   12.0    20
 8:       D  put   40.0     7
 9:       D call   40.0    11
10:       D  put   42.0    10
11:       D call   42.0     1

I am trying to analyze a subset of the data. The subset I would like to take is data where the ticker and strike are the same. But I also only want to grab this data if both a put and a call exists under type. With the data above for example, I would like to return the following result:

x[c(2,3,5,6,8:11),]

   Tickers Type Strike Other
1:       A call   37.5     5
2:       A  put   37.5    13
3:       B call   11.0    12
4:       B  put   11.0     4
5:       D  put   40.0     7
6:       D call   40.0    11
7:       D  put   42.0    10
8:       D call   42.0     1

I'm not sure what the best way to go about doing this. My thought process is that I should create another column vector like

x$id <- paste(x$Tickers,x$Strike,sep="_")

Then use this vector to only pull values where there are multiple ids.

x[x$id %in% x$id[duplicated(x$id)],]

   Tickers Type Strike Other     id
1:       A call   37.5     5 A_37.5
2:       A  put   37.5    13 A_37.5
3:       B call   11.0    12   B_11
4:       B  put   11.0     4   B_11
5:       D  put   40.0     7   D_40
6:       D call   40.0    11   D_40
7:       D  put   42.0    10   D_42
8:       D call   42.0     1   D_42

I'm not sure how efficient this is, as my actual data consists of a lot more rows. Also, this solution does not check for the type condition of there being one put and one call.

also the wording of the title could be a lot better, I apologize

EDIT::: having checked out this post Finding ALL duplicate rows, including "elements with smaller subscripts"

I could also use this solution:

x$id <- paste(x$Tickers,x$Strike,sep="_")
x[duplicated(x$id) | duplicated(x$id,fromLast=T),]
like image 295
road_to_quantdom Avatar asked May 11 '18 14:05

road_to_quantdom


People also ask

How do I select rows from a DataFrame based on multiple column values?

You can select the Rows from Pandas DataFrame based on column values or based on multiple conditions either using DataFrame. loc[] attribute, DataFrame. query() or DataFrame. apply() method to use lambda function.

How to select a subset of columns and rows from a Dataframe?

In this article, we are going to discuss how to select a subset of columns and rows from a DataFrame. We are going to use the nba.csv dataset to perform all operations. Below are various operations by using which we can select a subset for a given dataframe: To select a single column, we can use a square bracket [ ]:

How to select rows in pandas Dataframe based on conditions?

Selecting rows in pandas DataFrame based on conditions Selecting rows based on particular column value using '>', '=', '=', '<=', '!=' operator. Selecting those rows whose column value is present in the list using isin() method of the dataframe. Selecting rows based on multiple column conditions using '&' operator.

How to select a subset of a Dataframe using the indexing operator?

Select a Subset of a Dataframe using the Indexing Operator 1 Selecting Only Columns#N#To select a column using indexing operator use the following line of code.#N#housing... 2 Selecting Rows More ...

What is subsetting a data frame?

Subsetting a data frame is the process of selecting a set of desired rows and columns from the data frame. You can select: all rows and limited columns. all columns and limited rows. limited rows and limited columns. Subsetting a data frame is important as it allows you to access only a certain part of the data frame.


2 Answers

You could try something like:

x[, select := (.N >= 2 & all(c("put", "call") %in% unique(Type))), by = .(Tickers, Strike)][which(select)]

#   Tickers Type Strike Other select
#1:       A call   37.5    17   TRUE
#2:       A  put   37.5    16   TRUE
#3:       B call   11.0    11   TRUE
#4:       B  put   11.0    20   TRUE
#5:       D  put   40.0     1   TRUE
#6:       D call   40.0    12   TRUE
#7:       D  put   42.0     6   TRUE
#8:       D call   42.0     2   TRUE

Another idea might be a merge:

x[x, on = .(Tickers, Strike), select := (length(Type) >= 2 & all(c("put", "call") %in% Type)),by = .EACHI][which(select)]

I'm not entirely sure how to get around the group-by operations since you want to make sure for each group they have both "call" and "put". I was thinking about using keys, but haven't been able to incorporate the "call"/"put" aspect.

like image 67
Mike H. Avatar answered Oct 18 '22 13:10

Mike H.


An edit to your data to give a case where both put and call does not exist (I changed the very last "call" to "put"):

x <- data.table(Tickers=c("A","A","A","B","B","B","B","D","D","D","D"),
            Type=c("put","call","put","call","call","put","call","put","call","put","put"),
            Strike=c(35,37.5,37.5,10,11,11,12,40,40,42,42),
            Other=sample(20,11))

Since you are using data.table, you can use the built in counter .N along with by variables to count groups and subset with that. If by counting Type you can reliably determine there is both put and call, this could work:

x[, `:=`(n = .N, types = uniqueN(Type)), by = c('Tickers', 'Strike')][n > 1 & types == 2]

The part enclosed in the first set of [] does the counting, and then the [n > 1 & types == 2] does the subsetting.

like image 29
Brian Stamper Avatar answered Oct 18 '22 15:10

Brian Stamper