I have to remove columns in my dataframe which has over 4000 columns and 180 rows.The conditions I want to set in to remove the column in the dataframe are: (i) Remove the column if there are less then two values/entries in that column (ii) Remove the column if there are no two consecutive(one after the other) values in the column. (iii) Remove the column having all values as NA. I have provided with conditions on which a column is to be deleted. The aim here is not just to find a column by its name like in "How do you delete a column in data.table?". I Illustrate as follows: <pre class="prettyprint"><code>A B C D E 0.018 NA NA NA NA 0.017 NA NA NA NA 0.019 NA NA NA NA 0.018 0.034 NA NA NA 0.018 NA NA NA NA 0.015 NA NA NA 0.037 0.016 NA NA NA 0.031 0.019 NA 0.4 NA 0.025 0.016 0.03 NA NA 0.035 0.018 NA NA NA 0.035 0.017 NA NA NA 0.043 0.023 NA NA NA 0.040 0.022 NA NA NA 0.042 </code></pre> Desired dataframe: <pre class="prettyprint"><code>A E 0.018 NA 0.017 NA 0.019 NA 0.018 NA 0.018 NA 0.015 0.037 0.016 0.031 0.019 0.025 0.016 0.035 0.018 0.035 0.017 0.043 0.023 0.040 0.022 0.042 </code></pre> How can I incoporate these three conditions in one code. I would appreciate your help in this regard. Reproducible example <pre class="prettyprint"><code>structure(list(Month = c("Jan-2000", "Feb-2000", "Mar-2000", "Apr-2000", "May-2000", "Jun-2000"), A.G.L.SJ.INVS...LON..DEAD...13.08.15 = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), ABACUS.GROUP.DEAD...18.02.09 = c(0.00829384766220866, 0.00332213653674028, 0, 0, NA, NA), ABB.R..IRS. = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), .Names = c("Month", "A.G.L.SJ.INVS...LON..DEAD...13.08.15", "ABACUS.GROUP.DEAD...18.02.09", "ABB.R..IRS."), class = c("data.table", "data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000001c90788>) </code></pre>

I feel like this is all over-complicated. Condition 2 already includes all the rest of the conditions, as if there are at least two non-<code>NA</code> values in a column, obviously the whole column aren't <code>NA</code>s. And if there are at least two consecutive values in a column, then obviously this column contains more than one value. So instead of 3 conditions, this all sums up into a single condition (I prefer not to run many functions per column, rather after running <code>diff</code> per column- vecotrize the whole thing): <pre class="prettyprint"><code>cond <- colSums(is.na(sapply(df, diff))) < nrow(df) - 1 </code></pre> This works because if there are no consecutive values in a column, the whole column will become <code>NA</code>s. Then, just <pre class="prettyprint"><code>df[, cond, drop = FALSE] # A E # 1 0.018 NA # 2 0.017 NA # 3 0.019 NA # 4 0.018 NA # 5 0.018 NA # 6 0.015 0.037 # 7 0.016 0.031 # 8 0.019 0.025 # 9 0.016 0.035 # 10 0.018 0.035 # 11 0.017 0.043 # 12 0.023 0.040 # 13 0.022 0.042 </code></pre> <hr> Per your edit, it seems like you have a <code>data.table</code> object and you also have a <code>Date</code> column so the code would need some modifications. <pre class="prettyprint"><code>cond <- df[, lapply(.SD, function(x) sum(is.na(diff(x)))) < .N - 1, .SDcols = -1] df[, c(TRUE, cond), with = FALSE] </code></pre> Some explanations: <ul> <li>We want to ignore the first column in our calculations so we specify <code>.SDcols = -1</code> when operating on our <code>.SD</code> (which means Sub Data in <code>data.table</code>is) </li> <li> <code>.N</code> is just the rows count (similar to <code>nrow(df)</code> </li> <li>Next step is to subset by condition. We need not forget to grab the first column too so we start with <code>c(TRUE,...</code> </li> <li>Finally, <code>data.table</code> works with non standard evaluation by default, hence, if you want to select column as if you would in a <code>data.frame</code> you will need to specify <code>with = FALSE</code> </li> </ul> <hr> A better way though, would be just to remove the column by reference using <code>:= NULL</code> <pre class="prettyprint"><code>cond <- c(FALSE, df[, lapply(.SD, function(x) sum(is.na(diff(x)))) == .N - 1, .SDcols = -1]) df[, which(cond) := NULL] </code></pre>

Remove columns of dataframe based on conditions in R

Tags:

dataframe

r

multiple-columns

data.table

I have to remove columns in my dataframe which has over 4000 columns and 180 rows.The conditions I want to set in to remove the column in the dataframe are: (i) Remove the column if there are less then two values/entries in that column (ii) Remove the column if there are no two consecutive(one after the other) values in the column. (iii) Remove the column having all values as NA. I have provided with conditions on which a column is to be deleted. The aim here is not just to find a column by its name like in "How do you delete a column in data.table?". I Illustrate as follows:

A       B    C   D  E
0.018  NA    NA  NA NA
0.017  NA    NA  NA NA
0.019  NA    NA  NA NA
0.018  0.034 NA  NA NA
0.018  NA    NA  NA NA
0.015  NA    NA  NA 0.037
0.016  NA    NA  NA 0.031
0.019  NA    0.4 NA 0.025
0.016  0.03  NA  NA 0.035
0.018  NA    NA  NA 0.035
0.017  NA    NA  NA 0.043
0.023  NA    NA  NA 0.040
0.022  NA    NA  NA 0.042

Desired dataframe:

A       E
0.018   NA
0.017   NA
0.019   NA
0.018   NA
0.018   NA
0.015   0.037
0.016   0.031
0.019   0.025
0.016   0.035
0.018   0.035
0.017   0.043
0.023   0.040
0.022   0.042

How can I incoporate these three conditions in one code. I would appreciate your help in this regard. Reproducible example

structure(list(Month = c("Jan-2000", "Feb-2000", "Mar-2000", 
"Apr-2000", "May-2000", "Jun-2000"), A.G.L.SJ.INVS...LON..DEAD...13.08.15 = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), ABACUS.GROUP.DEAD...18.02.09 = c(0.00829384766220866, 
0.00332213653674028, 0, 0, NA, NA), ABB.R..IRS. = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), .Names = c("Month", 
"A.G.L.SJ.INVS...LON..DEAD...13.08.15", "ABACUS.GROUP.DEAD...18.02.09", 
"ABB.R..IRS."), class = c("data.table", "data.frame"), row.names = c(NA, 
-6L), .internal.selfref = <pointer: 0x0000000001c90788>)

342

asked Jan 20 '16 14:01

Aquarius

1 Answers

I feel like this is all over-complicated. Condition 2 already includes all the rest of the conditions, as if there are at least two non-NA values in a column, obviously the whole column aren't NAs. And if there are at least two consecutive values in a column, then obviously this column contains more than one value. So instead of 3 conditions, this all sums up into a single condition (I prefer not to run many functions per column, rather after running diff per column- vecotrize the whole thing):

cond <- colSums(is.na(sapply(df, diff))) < nrow(df) - 1

This works because if there are no consecutive values in a column, the whole column will become NAs.

Then, just

df[, cond, drop = FALSE]
#        A     E
# 1  0.018    NA
# 2  0.017    NA
# 3  0.019    NA
# 4  0.018    NA
# 5  0.018    NA
# 6  0.015 0.037
# 7  0.016 0.031
# 8  0.019 0.025
# 9  0.016 0.035
# 10 0.018 0.035
# 11 0.017 0.043
# 12 0.023 0.040
# 13 0.022 0.042

Per your edit, it seems like you have a data.table object and you also have a Date column so the code would need some modifications.

cond <- df[, lapply(.SD, function(x) sum(is.na(diff(x)))) < .N - 1, .SDcols = -1] 
df[, c(TRUE, cond), with = FALSE]

Some explanations:

We want to ignore the first column in our calculations so we specify .SDcols = -1 when operating on our .SD (which means Sub Data in data.tableis)
.N is just the rows count (similar to nrow(df)
Next step is to subset by condition. We need not forget to grab the first column too so we start with c(TRUE,...
Finally, data.table works with non standard evaluation by default, hence, if you want to select column as if you would in a data.frame you will need to specify with = FALSE

A better way though, would be just to remove the column by reference using := NULL

cond <- c(FALSE, df[, lapply(.SD, function(x) sum(is.na(diff(x)))) == .N - 1, .SDcols = -1])
df[, which(cond) := NULL]

170

answered Sep 21 '22 23:09

David Arenburg

Related questions
                            
                                R: Find variables supplied to functions with the '...' argument with exists()
                            
                                Write custom classifier in R and predict function
                            
                                How to change node and link colors in R googleVis sankey chart
                            
                                Merge rows with equal and unequal data
                            
                                Group similar numbers of a vector
                            
                                Use of $ and %% operators in R
                            
                                Plot density with ggplot2 without line on x-axis
                            
                                Change directory in R
                            
                                For each row, get column names where data is equal to a certain value
                            
                                Truncate but NOT Round in R [duplicate]
                            
                                Creating a data partition using caret and data.table
                            
                                elegant way to loop over chunks with remainder in r?
                            
                                Removing data from one dataframe that exists in another dataframe R
                            
                                Creating a Shiny app with real time data
                            
                                R: Stratified random sample proportion of unique ID's by grouping variable
                            
                                How to extract data from a RasterBrick?
                            
                                Figure size in R Markdown
                            
                                parsing quotes out of "NA" strings
                            
                                Create a popup dialog box interactive
                            
                                Saving output of confusionMatrix as a .csv table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With