Pull subset of rows of dataframe based on conditions from other columns

Q: What is subsetting a data frame?

Subsetting a data frame is the process of selecting a set of desired rows and columns from the data frame. You can select: all rows and limited columns. all columns and limited rows. limited rows and limited columns. Subsetting a data frame is important as it allows you to access only a certain part of the data frame.

Tags:

dataframe

r

datatable

subset

I have a dataframe like the one below:

x <- data.table(Tickers=c("A","A","A","B","B","B","B","D","D","D","D"),
                Type=c("put","call","put","call","call","put","call","put","call","put","call"),
                Strike=c(35,37.5,37.5,10,11,11,12,40,40,42,42),
                Other=sample(20,11))

    Tickers Type Strike Other
 1:       A  put   35.0     6
 2:       A call   37.5     5
 3:       A  put   37.5    13
 4:       B call   10.0    15
 5:       B call   11.0    12
 6:       B  put   11.0     4
 7:       B call   12.0    20
 8:       D  put   40.0     7
 9:       D call   40.0    11
10:       D  put   42.0    10
11:       D call   42.0     1

I am trying to analyze a subset of the data. The subset I would like to take is data where the ticker and strike are the same. But I also only want to grab this data if both a put and a call exists under type. With the data above for example, I would like to return the following result:

x[c(2,3,5,6,8:11),]

   Tickers Type Strike Other
1:       A call   37.5     5
2:       A  put   37.5    13
3:       B call   11.0    12
4:       B  put   11.0     4
5:       D  put   40.0     7
6:       D call   40.0    11
7:       D  put   42.0    10
8:       D call   42.0     1

I'm not sure what the best way to go about doing this. My thought process is that I should create another column vector like

x$id <- paste(x$Tickers,x$Strike,sep="_")

Then use this vector to only pull values where there are multiple ids.

x[x$id %in% x$id[duplicated(x$id)],]

   Tickers Type Strike Other     id
1:       A call   37.5     5 A_37.5
2:       A  put   37.5    13 A_37.5
3:       B call   11.0    12   B_11
4:       B  put   11.0     4   B_11
5:       D  put   40.0     7   D_40
6:       D call   40.0    11   D_40
7:       D  put   42.0    10   D_42
8:       D call   42.0     1   D_42

I'm not sure how efficient this is, as my actual data consists of a lot more rows. Also, this solution does not check for the type condition of there being one put and one call.

also the wording of the title could be a lot better, I apologize

EDIT::: having checked out this post Finding ALL duplicate rows, including "elements with smaller subscripts"

I could also use this solution:

x$id <- paste(x$Tickers,x$Strike,sep="_")
x[duplicated(x$id) | duplicated(x$id,fromLast=T),]

295

asked May 11 '18 14:05

road_to_quantdom

2 Answers

You could try something like:

x[, select := (.N >= 2 & all(c("put", "call") %in% unique(Type))), by = .(Tickers, Strike)][which(select)]

#   Tickers Type Strike Other select
#1:       A call   37.5    17   TRUE
#2:       A  put   37.5    16   TRUE
#3:       B call   11.0    11   TRUE
#4:       B  put   11.0    20   TRUE
#5:       D  put   40.0     1   TRUE
#6:       D call   40.0    12   TRUE
#7:       D  put   42.0     6   TRUE
#8:       D call   42.0     2   TRUE

Another idea might be a merge:

x[x, on = .(Tickers, Strike), select := (length(Type) >= 2 & all(c("put", "call") %in% Type)),by = .EACHI][which(select)]

I'm not entirely sure how to get around the group-by operations since you want to make sure for each group they have both "call" and "put". I was thinking about using keys, but haven't been able to incorporate the "call"/"put" aspect.

answered Oct 18 '22 13:10

Mike H.

An edit to your data to give a case where both put and call does not exist (I changed the very last "call" to "put"):

x <- data.table(Tickers=c("A","A","A","B","B","B","B","D","D","D","D"),
            Type=c("put","call","put","call","call","put","call","put","call","put","put"),
            Strike=c(35,37.5,37.5,10,11,11,12,40,40,42,42),
            Other=sample(20,11))

Since you are using data.table, you can use the built in counter .N along with by variables to count groups and subset with that. If by counting Type you can reliably determine there is both put and call, this could work:

x[, `:=`(n = .N, types = uniqueN(Type)), by = c('Tickers', 'Strike')][n > 1 & types == 2]

The part enclosed in the first set of [] does the counting, and then the [n > 1 & types == 2] does the subsetting.

answered Oct 18 '22 15:10

Brian Stamper

Related questions
                            
                                Add Cook's distance levels to ggplot2
                            
                                Keep formatting when exporting table with DT (DataTables buttons extension)
                            
                                ggplot2: change break points of discrete scale to be between two break points
                            
                                Select raster in ggplot near coastline
                            
                                R - arules apriori Error in length(obj) : Method length not implemented for class rules
                            
                                How to pass by argument to dplyr join function within a function?
                            
                                Do the values returned by rgeos::gCentroid() and sf::st_centroid() differ?
                            
                                iterate through a list of data frames and mutate columns
                            
                                Fill NA values with sequence by group
                            
                                Rolling conditional count in R
                            
                                Attach a knitted tempfile to email in R shiny
                            
                                Get attribute of element in rendered UI with JavaScript in Shiny
                            
                                Replace letters in a template string in R
                            
                                Plotly, add border around points created with add_markers
                            
                                Open hyperlink on click on an ggplot/plotly chart
                            
                                Catch warnings from functions in R and still get their return-value?
                            
                                In dplyr, is it possible to specify where to add a new column using mutate?
                            
                                Skipping specific rows and columns in R
                            
                                Adding labels to Cluster
                            
                                Print first few lines of a text file in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With