Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transform each column factors in a column containing just `0` or `1`

Tags:

dataframe

r

dplyr

I'm trying to transform each of my column factors in a column containing just 0 or 1. Probably there is a function for that, or someone else already asked, but I couldn't found it. Here is a simple example to try to show what I need:

test = data.frame(my_groups = c("A", "A", "A", "B", "B", "C", "C", "C", "C"),
                  measure1 = c(1:9))

#as result:
#     group_A   group_B  group_C   measure1
# 1         1        0         0          1
# 1         1        0         0          2
# 1         1        0         0          3
# 1         0        1         0          4
# 1         0        1         0          5
# 1         0        0         1          6
# 1         0        0         1          7
# 1         0        0         1          8
# 1         0        0         1          9

Any hint on how can I do that?

like image 267
DR15 Avatar asked Dec 03 '21 19:12

DR15


People also ask

How to change all columns to Factors in R?

To convert the data type of all columns from integer to factor, we can use lapply function with factor function.

How do I convert a character to a factor in R?

To convert a single factor vector to a character vector we use the as. character() function of the R Language and pass the required factor vector as an argument.

What is the difference between integer and factor in R?

Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.

Why do we use as factor in R?

factor() function in R is used to convert a vector object to a factor.

How do I convert only the character columns to factor columns?

The following code shows how to convert all character columns in a data frame from character to factor: By using the apply () and sapply () functions, we were able to convert only the character columns to factor columns and leave all other columns unchanged.

How to selectively apply data transforms to different columns?

It can be challenging when you have a dataset with mixed types and you want to selectively apply data transforms to some, but not all, input features. Thankfully, the scikit-learn Python machine learning library provides the ColumnTransformer that allows you to selectively apply data transforms to different columns in your dataset.

How do I use the columntransformer?

To use the ColumnTransformer, you must specify a list of transformers. Each transformer is a three-element tuple that defines the name of the transformer, the transform to apply, and the column indices to apply it to. For example: For example, the ColumnTransformer below applies a OneHotEncoder to columns 0 and 1. ...

How to transform column names in a Microsoft Excel table?

The easiest way to transform column names is by using the Table.TransformColumnNames function. This function is useful when applying a similar transformation to each of your columns. Examples are adding a prefix, capitalizing the first letters of a word, replacing underscores etc. Let’s look at a few examples. 1.1. Replacing Characters


Video Answer


4 Answers

We may use dummy_cols from fastDummies

library(fastDummies)
library(dplyr)
test %>% 
    rename(group = 'my_groups') %>%
    dummy_cols('group', remove_selected_columns = TRUE) %>%    
    select(starts_with('group'), measure1)

-output

 group_A group_B group_C measure1
1       1       0       0        1
2       1       0       0        2
3       1       0       0        3
4       0       1       0        4
5       0       1       0        5
6       0       0       1        6
7       0       0       1        7
8       0       0       1        8
9       0       0       1        9
like image 140
akrun Avatar answered Nov 15 '22 05:11

akrun


Fortunately, there's a one-function Base R solution.

This type of problem happens a lot, and model.matrix() is built exactly for this.

# the "+ 0" is to avoid adding a column for the intercept.

model.matrix(~ my_groups + measure1 + 0, data=test)

Output:

  my_groupsA my_groupsB my_groupsC measure1
1          1          0          0        1
2          1          0          0        2
3          1          0          0        3
4          0          1          0        4
5          0          1          0        5
6          0          0          1        6
7          0          0          1        7
8          0          0          1        8
9          0          0          1        9
like image 32
Jason Avatar answered Nov 15 '22 07:11

Jason


Here's a base R solution, constructing the matrix using expand.grid, then adding the required names.

res <- data.frame( t( unique( matrix( as.numeric( do.call("==", expand.grid(
   test$my_groups, test$my_groups) ) ), dim(test)[1] ) ) ), test$measure1 )

colnames(res) <- c( paste0( "group_", unique(test$my_groups) ), colnames(test)[2] )

res
  group_A group_B group_C measure1
1       1       0       0        1
2       1       0       0        2
3       1       0       0        3
4       0       1       0        4
5       0       1       0        5
6       0       0       1        6
7       0       0       1        7
8       0       0       1        8
9       0       0       1        9
like image 21
Andre Wildberg Avatar answered Nov 15 '22 05:11

Andre Wildberg


We can try this using dplyr or purrr.

library(tidyverse)

test = data.frame(my_groups = c("A", "A", "A", "B", "B", "C", "C", "C", "C"),
                  measure1 = c(1:9))

dummyfy <- 
as_mapper(~{
  len_row <- vector('numeric', nrow(test))
  len_row[.] <- c(1)
  len_row}
)

data <- pivot_wider(test, names_from =  my_groups, values_from = measure1)
#> Warning: Values are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = length` to identify where the duplicates arise
#> * Use `values_fn = {summary_fun}` to summarise duplicates

map(data, ~reduce(., c)) %>%
  map_dfr(dummyfy) %>% 
  bind_cols(test[-1])
#> # A tibble: 9 × 4
#>       A     B     C measure1
#>   <dbl> <dbl> <dbl>    <int>
#> 1     1     0     0        1
#> 2     1     0     0        2
#> 3     1     0     0        3
#> 4     0     1     0        4
#> 5     0     1     0        5
#> 6     0     0     1        6
#> 7     0     0     1        7
#> 8     0     0     1        8
#> 9     0     0     1        9

#equivalent using across:

data %>% summarise(across(everything(), ~reduce(., c) %>% dummyfy)) %>% bind_cols(test[-1])
#> # A tibble: 9 × 4
#>       A     B     C measure1
#>   <dbl> <dbl> <dbl>    <int>
#> 1     1     0     0        1
#> 2     1     0     0        2
#> 3     1     0     0        3
#> 4     0     1     0        4
#> 5     0     1     0        5
#> 6     0     0     1        6
#> 7     0     0     1        7
#> 8     0     0     1        8
#> 9     0     0     1        9

Created on 2021-12-03 by the reprex package (v2.0.1)

like image 37
jpdugo17 Avatar answered Nov 15 '22 05:11

jpdugo17