I'm trying to transform each of my column factors in a column containing just 0
or 1
. Probably there is a function for that, or someone else already asked, but I couldn't found it. Here is a simple example to try to show what I need:
test = data.frame(my_groups = c("A", "A", "A", "B", "B", "C", "C", "C", "C"),
measure1 = c(1:9))
#as result:
# group_A group_B group_C measure1
# 1 1 0 0 1
# 1 1 0 0 2
# 1 1 0 0 3
# 1 0 1 0 4
# 1 0 1 0 5
# 1 0 0 1 6
# 1 0 0 1 7
# 1 0 0 1 8
# 1 0 0 1 9
Any hint on how can I do that?
To convert the data type of all columns from integer to factor, we can use lapply function with factor function.
To convert a single factor vector to a character vector we use the as. character() function of the R Language and pass the required factor vector as an argument.
Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
factor() function in R is used to convert a vector object to a factor.
The following code shows how to convert all character columns in a data frame from character to factor: By using the apply () and sapply () functions, we were able to convert only the character columns to factor columns and leave all other columns unchanged.
It can be challenging when you have a dataset with mixed types and you want to selectively apply data transforms to some, but not all, input features. Thankfully, the scikit-learn Python machine learning library provides the ColumnTransformer that allows you to selectively apply data transforms to different columns in your dataset.
To use the ColumnTransformer, you must specify a list of transformers. Each transformer is a three-element tuple that defines the name of the transformer, the transform to apply, and the column indices to apply it to. For example: For example, the ColumnTransformer below applies a OneHotEncoder to columns 0 and 1. ...
The easiest way to transform column names is by using the Table.TransformColumnNames function. This function is useful when applying a similar transformation to each of your columns. Examples are adding a prefix, capitalizing the first letters of a word, replacing underscores etc. Let’s look at a few examples. 1.1. Replacing Characters
We may use dummy_cols
from fastDummies
library(fastDummies)
library(dplyr)
test %>%
rename(group = 'my_groups') %>%
dummy_cols('group', remove_selected_columns = TRUE) %>%
select(starts_with('group'), measure1)
-output
group_A group_B group_C measure1
1 1 0 0 1
2 1 0 0 2
3 1 0 0 3
4 0 1 0 4
5 0 1 0 5
6 0 0 1 6
7 0 0 1 7
8 0 0 1 8
9 0 0 1 9
Fortunately, there's a one-function Base R solution.
This type of problem happens a lot, and model.matrix()
is built exactly for this.
# the "+ 0" is to avoid adding a column for the intercept.
model.matrix(~ my_groups + measure1 + 0, data=test)
Output:
my_groupsA my_groupsB my_groupsC measure1
1 1 0 0 1
2 1 0 0 2
3 1 0 0 3
4 0 1 0 4
5 0 1 0 5
6 0 0 1 6
7 0 0 1 7
8 0 0 1 8
9 0 0 1 9
Here's a base R solution, constructing the matrix using expand.grid
, then adding the required names.
res <- data.frame( t( unique( matrix( as.numeric( do.call("==", expand.grid(
test$my_groups, test$my_groups) ) ), dim(test)[1] ) ) ), test$measure1 )
colnames(res) <- c( paste0( "group_", unique(test$my_groups) ), colnames(test)[2] )
res
group_A group_B group_C measure1
1 1 0 0 1
2 1 0 0 2
3 1 0 0 3
4 0 1 0 4
5 0 1 0 5
6 0 0 1 6
7 0 0 1 7
8 0 0 1 8
9 0 0 1 9
We can try this using dplyr
or purrr
.
library(tidyverse)
test = data.frame(my_groups = c("A", "A", "A", "B", "B", "C", "C", "C", "C"),
measure1 = c(1:9))
dummyfy <-
as_mapper(~{
len_row <- vector('numeric', nrow(test))
len_row[.] <- c(1)
len_row}
)
data <- pivot_wider(test, names_from = my_groups, values_from = measure1)
#> Warning: Values are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = length` to identify where the duplicates arise
#> * Use `values_fn = {summary_fun}` to summarise duplicates
map(data, ~reduce(., c)) %>%
map_dfr(dummyfy) %>%
bind_cols(test[-1])
#> # A tibble: 9 × 4
#> A B C measure1
#> <dbl> <dbl> <dbl> <int>
#> 1 1 0 0 1
#> 2 1 0 0 2
#> 3 1 0 0 3
#> 4 0 1 0 4
#> 5 0 1 0 5
#> 6 0 0 1 6
#> 7 0 0 1 7
#> 8 0 0 1 8
#> 9 0 0 1 9
#equivalent using across:
data %>% summarise(across(everything(), ~reduce(., c) %>% dummyfy)) %>% bind_cols(test[-1])
#> # A tibble: 9 × 4
#> A B C measure1
#> <dbl> <dbl> <dbl> <int>
#> 1 1 0 0 1
#> 2 1 0 0 2
#> 3 1 0 0 3
#> 4 0 1 0 4
#> 5 0 1 0 5
#> 6 0 0 1 6
#> 7 0 0 1 7
#> 8 0 0 1 8
#> 9 0 0 1 9
Created on 2021-12-03 by the reprex package (v2.0.1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With