Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to subset dataframe based on a "not equal to" criteria applied to a large number of columns?

I'm new to R and currently trying to subset my data according to my predefined exclusion criteria for analysis. I'm presently trying to remove all cases that have dementia, as coded by the ICD-10. Problem is that there are multiple variables containing information on each individual's disease status (~70 variables), although as they are coded in the same way, the same condition can be applied to all of them.

Some simulated data:

#Create dataframe containing simulated data
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005,1006,1007,1008,1009,1010,1011),
                    disease_code_1 = c('I802','H356','G560','D235','B178','F011','F023','C761','H653','A049','J679'),
                    disease_code_2 = c('A071','NA','G20','NA','NA','A049','NA','NA','G300','G308','A045'),
                    disease_code_3 = c('H250','NA','NA','I802','NA','A481','NA','NA','NA','NA','D352'))

#data is structured as below:

     ID disease_code_1 disease_code_2 disease_code_3
1  1001           I802           A071           H250
2  1002           H356             NA             NA
3  1003           G560            G20             NA
4  1004           D235             NA           I802
5  1005           B178             NA             NA
6  1006           F011           A049           A481
7  1007           F023             NA             NA
8  1008           C761             NA             NA
9  1009           H653           G300             NA
10 1010           A049           G308             NA
11 1011           J679           A045           D352


Here, I'm trying to remove any case that has a 'dementia code' across any of the "disease_code" variables.

#Remove cases with dementia from dataframe (e.g. F023, G20)
Newdata_df <- subset(df, (2:4 != "F023"|"G20"|"F009"|"F002"|"F001"|"F000"|"F00"|    
                    "G309"| "G308"|"G301"|"G300"|"G30"| "F01"|"F018"|"F013"|
                    "F012"| "F011"| "F010"|"F01"))

The error that I recieve is:

Error in 2:4 != "F023" | "G20" : 
  operations are possible only for numeric, logical or complex types

Ideally, the subsetted dataframe would look like this:

     ID disease_code_1 disease_code_2 disease_code_3
1  1001           I802           A071           H250
2  1002           H356             NA             NA
4  1004           D235             NA           I802
5  1005           B178             NA             NA
8  1008           C761             NA             NA
11 1011           J679           A045           D352

I know that there is an error in my code although I'm not sure how exactly to fix it. I've tried a few other ways (using dplyr) although haven't had any luck so far.

Any help is greatly appreciated!

like image 265
M_Oxford Avatar asked Mar 29 '19 12:03

M_Oxford


People also ask

How do I subset a Dataframe based on column value in R?

How to subset the data frame (DataFrame) by column value and name in R? By using R base df[] notation, or subset() you can easily subset the R Data Frame (data. frame) by column value or by column name.

How do I subset a Dataframe from a Dataframe in R?

Subset a Data Frame with Base R Extract[] To specify a logical expression for the rows parameter, use the standard R operators. If subsetting is done by only rows or only columns, then leave the other value blank. For example, to subset the d data frame only by rows, the general form reduces to d[rows,] .

What can we use to create a subset of a Dataframe in pandas?

Subset a Dataframe using Python iloc() This line of code selects row number 2, 3 and 6 along with column number 3 and 5. Using iloc saves you from writing the complete labels of rows and columns. You can also use iloc() to select rows or columns individually just like loc() after replacing the labels with integers.


2 Answers

We can create a vector with the codes to be removed and use rowSums to remove, i.e.

codes_to_remove <- c("F023", "G20", "F009", "F002", "F001", "F000", "F00", "G309", "G308",
                "G301", "G300", "G30", "F01", "F018", "F013", "F012", "F011", "F010", "F01")

df[rowSums(sapply(df[-1], `%in%`, codes_to_remove)) == 0,]

which gives,

     ID disease_code_1 disease_code_2 disease_code_3
1  1001           I802           A071           H250
2  1002           H356             NA             NA
4  1004           D235             NA           I802
5  1005           B178             NA             NA
8  1008           C761             NA             NA
11 1011           J679           A045           D352
like image 191
Sotos Avatar answered Oct 20 '22 17:10

Sotos


One dplyr possibility could be:

df %>%
 filter_at(vars(2:4), all_vars(! . %in% c("F023","G20","F009","F002","F001","F000","F00",    
            "G309", "G308","G301","G300","G30", "F01","F018","F013",
            "F012", "F011", "F010","F01")))

    ID disease_code_1 disease_code_2 disease_code_3
1 1001           I802           A071           H250
2 1002           H356             NA             NA
3 1004           D235             NA           I802
4 1005           B178             NA             NA
5 1008           C761             NA             NA
6 1011           J679           A045           D352

In this case, it checks whether any of the columns 2:4 contains any of the given codes.

Or:

df %>%
 filter_at(vars(contains("disease_code")), all_vars(! . %in% c("F023","G20","F009","F002","F001","F000","F00",    
            "G309", "G308","G301","G300","G30", "F01","F018","F013",
            "F012", "F011", "F010","F01")))

In this case, it checks whether any of the columns with names disease_code contains any of the given codes.

like image 40
tmfmnk Avatar answered Oct 20 '22 16:10

tmfmnk