I have a dataframe where some of the values are NULL or Empty. I would like to remove these columns in which all values are NULL or empty. Columns should be removed from the dataframe, do not hidden only.
My head(df) looks like data=
VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7
1 2R+ 52 1.05 0 0 30
2 2R+ 169 1.02 0 0 40
3 2R+ 83 NA 0 0 40
4 2R+ 98 1.16 0 0 40
5 2R+ 154 1.11 0 0 40
6 2R+ 111 NA 0 0 15
The dataframe contains more than 200 variables, variables are empty and zero values do not occur sequentially.
I tried to estimate the average Col and select the column is Null or empty, by analogy with the removal of "NA" (see here), but it does not work.
df <- df[,colSums(is.na(df))<nrow(df)]
I got an error : 'x' must be an array of at least two dimensions
Can anyone give me some help? Thanks!
If we need to drop such columns that contain NA, we can use the axis=column s parameter of DataFrame. dropna() to specify deleting the columns. By default, it removes the column where one or more values are missing.
DataFrame. dropna() is used to drop/remove columns with NaN / None values. Python doesn't support Null hence any missing data is represented as None or NaN values.
To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.
Drop all rows having at least one null value DataFrame. dropna() method is your friend. When you call dropna() over the whole DataFrame without specifying any arguments (i.e. using the default behaviour) then the method will drop all rows with at least one missing value.
We can use Filter
Filter(function(x) !(all(x=="")), df)
# Var1 Var3
#1 2R+ 52
#2 2R+ 169
#3 2R+ 83
#4 2R+ 98
#5 2R+ NA
#6 2R+ 111
#7 2R+ 94
#8 2R+ 116
#9 2R+ 86
NOTE: It should also work if all the elements are NA for a particular column
df$Var3 <- NA
Filter(function(x) !(all(x=="")), df)
# Var1
#1 2R+
#2 2R+
#3 2R+
#4 2R+
#5 2R+
#6 2R+
#7 2R+
#8 2R+
#9 2R+
Based on the updated dataset, if we need to remove the columns with only 0 values, then change the code to
Filter(function(x) !(all(x==""|x==0)), df2)
# VAR1 VAR3 VAR4 VAR7
#1 2R+ 52 1.05 30
#2 2R+ 169 1.02 40
#3 2R+ 83 NA 40
#4 2R+ 98 1.16 40
#5 2R+ 154 1.11 40
#6 2R+ 111 NA 15
df2 <- structure(list(VAR1 = c("2R+", "2R+", "2R+", "2R+", "2R+", "2R+"
), VAR2 = c("", "", "", "", "", ""), VAR3 = c(52L, 169L, 83L,
98L, 154L, 111L), VAR4 = c(1.05, 1.02, NA, 1.16, 1.11, NA), VAR5 = c(0L,
0L, 0L, 0L, 0L, 0L), VAR6 = c(0L, 0L, 0L, 0L, 0L, 0L), VAR7 = c(30L,
40L, 40L, 40L, 40L, 15L)), .Names = c("VAR1", "VAR2", "VAR3",
"VAR4", "VAR5", "VAR6", "VAR7"), row.names = c("1", "2", "3",
"4", "5", "6"), class = "data.frame")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With