How to subset dataframe based on a "not equal to" criteria applied to a large number of columns?

Tags:

I'm new to R and currently trying to subset my data according to my predefined exclusion criteria for analysis. I'm presently trying to remove all cases that have dementia, as coded by the ICD-10. Problem is that there are multiple variables containing information on each individual's disease status (~70 variables), although as they are coded in the same way, the same condition can be applied to all of them.

Some simulated data:

#Create dataframe containing simulated data
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005,1006,1007,1008,1009,1010,1011),
                    disease_code_1 = c('I802','H356','G560','D235','B178','F011','F023','C761','H653','A049','J679'),
                    disease_code_2 = c('A071','NA','G20','NA','NA','A049','NA','NA','G300','G308','A045'),
                    disease_code_3 = c('H250','NA','NA','I802','NA','A481','NA','NA','NA','NA','D352'))

#data is structured as below:

     ID disease_code_1 disease_code_2 disease_code_3
1  1001           I802           A071           H250
2  1002           H356             NA             NA
3  1003           G560            G20             NA
4  1004           D235             NA           I802
5  1005           B178             NA             NA
6  1006           F011           A049           A481
7  1007           F023             NA             NA
8  1008           C761             NA             NA
9  1009           H653           G300             NA
10 1010           A049           G308             NA
11 1011           J679           A045           D352

Here, I'm trying to remove any case that has a 'dementia code' across any of the "disease_code" variables.

#Remove cases with dementia from dataframe (e.g. F023, G20)
Newdata_df <- subset(df, (2:4 != "F023"|"G20"|"F009"|"F002"|"F001"|"F000"|"F00"|    
                    "G309"| "G308"|"G301"|"G300"|"G30"| "F01"|"F018"|"F013"|
                    "F012"| "F011"| "F010"|"F01"))

The error that I recieve is:

Error in 2:4 != "F023" | "G20" : 
  operations are possible only for numeric, logical or complex types

Ideally, the subsetted dataframe would look like this:

     ID disease_code_1 disease_code_2 disease_code_3
1  1001           I802           A071           H250
2  1002           H356             NA             NA
4  1004           D235             NA           I802
5  1005           B178             NA             NA
8  1008           C761             NA             NA
11 1011           J679           A045           D352

I know that there is an error in my code although I'm not sure how exactly to fix it. I've tried a few other ways (using dplyr) although haven't had any luck so far.

Any help is greatly appreciated!

265

asked Mar 29 '19 12:03

M_Oxford

2 Answers

We can create a vector with the codes to be removed and use rowSums to remove, i.e.

codes_to_remove <- c("F023", "G20", "F009", "F002", "F001", "F000", "F00", "G309", "G308",
                "G301", "G300", "G30", "F01", "F018", "F013", "F012", "F011", "F010", "F01")

df[rowSums(sapply(df[-1], `%in%`, codes_to_remove)) == 0,]

which gives,

     ID disease_code_1 disease_code_2 disease_code_3
1  1001           I802           A071           H250
2  1002           H356             NA             NA
4  1004           D235             NA           I802
5  1005           B178             NA             NA
8  1008           C761             NA             NA
11 1011           J679           A045           D352

191

answered Oct 20 '22 17:10

Sotos

One dplyr possibility could be:

df %>%
 filter_at(vars(2:4), all_vars(! . %in% c("F023","G20","F009","F002","F001","F000","F00",    
            "G309", "G308","G301","G300","G30", "F01","F018","F013",
            "F012", "F011", "F010","F01")))

    ID disease_code_1 disease_code_2 disease_code_3
1 1001           I802           A071           H250
2 1002           H356             NA             NA
3 1004           D235             NA           I802
4 1005           B178             NA             NA
5 1008           C761             NA             NA
6 1011           J679           A045           D352

In this case, it checks whether any of the columns 2:4 contains any of the given codes.

Or:

df %>%
 filter_at(vars(contains("disease_code")), all_vars(! . %in% c("F023","G20","F009","F002","F001","F000","F00",    
            "G309", "G308","G301","G300","G30", "F01","F018","F013",
            "F012", "F011", "F010","F01")))

In this case, it checks whether any of the columns with names disease_code contains any of the given codes.

answered Oct 20 '22 16:10

tmfmnk

Related questions
                            
                                data.table with two string columns of set elements, extract unique rows with each row unsorted
                            
                                How does glmnet compute the maximal lambda value?
                            
                                Data Manipulation in R: 'X' must be atomic
                            
                                Rcpp How to convert IntegerVector to NumericVector
                            
                                R knitr Add linebreak in table header kable()
                            
                                R generate all possible interaction variables
                            
                                geom_text with facet_wrap in ggplot2 when group specified
                            
                                Using a reactive expression in an if statement in shiny
                            
                                Dplyr - Filter if any variable is equal to a value
                            
                                ggplot, ggplotly, scale_y_continuous, ylim and percentage
                            
                                Adding value after every nth element of vector in R
                            
                                Group by one column, select row with minimum in one column for every pair of columns
                            
                                Why is bam from mgcv slow for some data?
                            
                                Decrease margins between plots when using cowplot
                            
                                Installing R on Linux: configure: WARNING: you cannot build PDF versions of the R manuals
                            
                                How to correctly convert NaN to NA
                            
                                using tidyr unnest with NULL values
                            
                                Find column number that satisfies condition
                            
                                curl package not available for several R packages
                            
                                Change legend title ggplot2 [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to subset dataframe based on a "not equal to" criteria applied to a large number of columns?

Tags:

dataframe

r

filter

subset

M_Oxford

People also ask

2 Answers

Sotos

tmfmnk

Recent Activity

Donate For Us