I have a dataframe <code>df</code>: <pre class="prettyprint"><code>userID Score Task_Alpha Task_Beta Task_Charlie Task_Delta 3108 -8.00 Easy Easy Easy Easy 3207 3.00 Hard Easy Match Match 3350 5.78 Hard Easy Hard Hard 3961 10.00 Easy NA Hard Hard 4021 10.00 Easy Easy NA Hard 1. userID is factor variable 2. Score is numeric 3. All the 'Task_' features are factor variables with possible values 'Hard', 'Easy', 'Match' or NA </code></pre> I want to create new columns per <code>userID</code> that contain the counts of occurrence for each possible state of the <code>Task_</code> feature. For the above toy example, the required output would be three new columns to be appended at the end of the <code>df</code> as below: <pre class="prettyprint"><code>userID Hard Match Easy 3108 0 0 4 3207 1 2 1 3350 3 0 1 3961 2 0 1 4021 1 0 2 </code></pre> Update: This question is not a duplicate, an associated part of the original question has been moved to: R How to counting the factors in ordered sequence

You can compare the dataframe <code>df</code> to each value in a <code>map*</code> or <code>*apply</code> function, compute the row-wise sums of the resulting boolean matrix, then combine the output with the original dataframe: <pre class="prettyprint lang-r prettyprint-override"><code>library(dplyr) library(purrr) facs <- c("Easy", "Match", "Hard") bind_cols(df, set_names(map_dfc(facs, ~ rowSums(df == ., na.rm = T)), facs)) #### OUTPUT #### userID Score Task_Alpha Task_Beta Task_Charlie Task_Delta Easy Match Hard 1 3108 -8.00 Easy Easy Easy Easy 4 0 0 2 3207 3.00 Hard Easy Match Match 1 2 1 3 3350 5.78 Hard Easy Hard Hard 1 0 3 4 3961 10.00 Easy <NA> Hard Hard 1 0 2 5 4021 10.00 Easy Easy <NA> Hard 2 0 1 </code></pre>

Answer to the first part can be obtained by using <code>apply</code> row-wise and count the occurrence of factor level in each row using <code>table</code> <pre class="prettyprint"><code>cbind(df[1], t(apply(df[-c(1, 2)], 1, function(x) table(factor(x, levels = c("Easy", "Hard", "Match")))))) # userID Easy Hard Match #1 3108 4 0 0 #2 3207 1 1 2 #3 3350 1 3 0 #4 3961 1 2 0 #5 4021 2 1 0 </code></pre> <hr> In <code>tidyverse</code>, we can convert the data to long format, drop <code>NA</code> values, <code>count</code> occurrence of <code>userID</code> and <code>value</code> and get the data back to wide format. <pre class="prettyprint"><code>library(dplyr) library(tidyr) df %>% pivot_longer(cols = starts_with("Task"), values_drop_na = TRUE) %>% count(userID, value) %>% pivot_wider(names_from = value, values_from = n, values_fill = list(n = 0)) </code></pre> data <pre class="prettyprint"><code>df <- structure(list(userID = c(3108L, 3207L, 3350L, 3961L, 4021L), Score = c(-8, 3, 5.78, 10, 10), Task_Alpha = structure(c(1L, 2L, 2L, 1L, 1L), .Label = c("Easy", "Hard"), class = "factor"), Task_Beta = structure(c(1L, 1L, 1L, NA, 1L), .Label = "Easy", class = "factor"), Task_Charlie = structure(c(1L, 3L, 2L, 2L, NA), .Label = c("Easy", "Hard", "Match"), class = "factor"), Task_Delta = structure(c(1L, 3L, 2L, 2L, 2L), .Label = c("Easy", "Hard", "Match"), class = "factor")), class = "data.frame", row.names = c(NA, -5L)) </code></pre>

R how to create columns/features based on existing data

Q: How do you sum a column based on another column in R?

How to find the sum of a column values up to a value in another column in R? To find the sum of a column values up to a particular value in another column, we can use cumsum function with sum function.

Q: How do you create a new variable in R based on condition?

Create New Variables in R with mutate() and case_when() Often you may want to create a new variable in a data frame in R based on some condition. Fortunately this is easy to do using the mutate() and case_when() functions from the dplyr package.

Q: How do I reference a column of data in R?

You can reference a column of an R data frame via the column name. If the data was loaded from a CSV file, the column name is the name given to that column in the first line (the header line) of the CSV file.

Tags:

r

count

dplyr

strsplit

I have a dataframe df:

userID Score  Task_Alpha Task_Beta Task_Charlie Task_Delta 
3108  -8.00   Easy       Easy      Easy         Easy    
3207   3.00   Hard       Easy      Match        Match
3350   5.78   Hard       Easy      Hard         Hard
3961   10.00  Easy       NA        Hard         Hard
4021   10.00  Easy       Easy      NA           Hard


1. userID is factor variable
2. Score is numeric
3. All the 'Task_' features are factor variables with possible values 'Hard', 'Easy', 'Match' or NA

I want to create new columns per userID that contain the counts of occurrence for each possible state of the Task_ feature. For the above toy example, the required output would be three new columns to be appended at the end of the df as below:

userID Hard Match Easy
3108   0    0     4
3207   1    2     1
3350   3    0     1
3961   2    0     1
4021   1    0     2

Update: This question is not a duplicate, an associated part of the original question has been moved to: R How to counting the factors in ordered sequence

398

asked Nov 07 '19 09:11

Sandy

3 Answers

You can compare the dataframe df to each value in a map* or *apply function, compute the row-wise sums of the resulting boolean matrix, then combine the output with the original dataframe:

library(dplyr)
library(purrr)

facs <- c("Easy", "Match", "Hard")

bind_cols(df, set_names(map_dfc(facs, ~ rowSums(df == ., na.rm = T)), facs))

#### OUTPUT ####

  userID Score Task_Alpha Task_Beta Task_Charlie Task_Delta Easy Match Hard
1   3108 -8.00       Easy      Easy         Easy       Easy    4     0    0
2   3207  3.00       Hard      Easy        Match      Match    1     2    1
3   3350  5.78       Hard      Easy         Hard       Hard    1     0    3
4   3961 10.00       Easy      <NA>         Hard       Hard    1     0    2
5   4021 10.00       Easy      Easy         <NA>       Hard    2     0    1

198

answered Dec 23 '22 23:12

RoCh

library(data.table)
DT <- fread("userID Score  Task_Alpha Task_Beta Task_Charlie Task_Delta 
3108  -8.00   Easy       Easy      Easy         Easy    
3207   3.00   Hard       Easy      Match        Match
3350   5.78   Hard       Easy      Hard         Hard
3961   10.00  Easy       NA        Hard         Hard
4021   10.00  Easy       Easy      NA           Hard
")

DT.melt <- melt( DT, id.vars = "userID", measure.vars = patterns( task = "^Task_") )
dcast( DT.melt, userID ~ value, fun.aggregate = length )

#    userID NA Easy Hard Match
# 1:   3108  0    4    0     0
# 2:   3207  0    1    1     2
# 3:   3350  0    1    3     0
# 4:   3961  1    1    2     0
# 5:   4021  1    2    1     0

answered Dec 24 '22 00:12

Wimpel

Answer to the first part can be obtained by using apply row-wise and count the occurrence of factor level in each row using table

cbind(df[1], t(apply(df[-c(1, 2)], 1, function(x) 
           table(factor(x, levels = c("Easy", "Hard", "Match"))))))


#  userID Easy Hard Match
#1   3108    4    0     0
#2   3207    1    1     2
#3   3350    1    3     0
#4   3961    1    2     0
#5   4021    2    1     0

In tidyverse, we can convert the data to long format, drop NA values, count occurrence of userID and value and get the data back to wide format.

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(cols = starts_with("Task"), values_drop_na = TRUE) %>%
  count(userID, value) %>%
  pivot_wider(names_from = value, values_from = n, values_fill = list(n = 0))

data

df <- structure(list(userID = c(3108L, 3207L, 3350L, 3961L, 4021L), 
Score = c(-8, 3, 5.78, 10, 10), Task_Alpha = structure(c(1L, 
2L, 2L, 1L, 1L), .Label = c("Easy", "Hard"), class = "factor"), 
Task_Beta = structure(c(1L, 1L, 1L, NA, 1L), .Label = "Easy", class = "factor"), 
Task_Charlie = structure(c(1L, 3L, 2L, 2L, NA), .Label = c("Easy", 
"Hard", "Match"), class = "factor"), Task_Delta = structure(c(1L, 
3L, 2L, 2L, 2L), .Label = c("Easy", "Hard", "Match"), class = "factor")), 
class = "data.frame", row.names = c(NA, -5L))

answered Dec 24 '22 00:12

Ronak Shah

Related questions
                            
                                Stacked barplot with colour gradients for each bar
                            
                                Error in osmar::get_osm() downloading OSM data fails: SYSTEM or PUBLIC, the URI is missing
                            
                                Singularity in backsolve at level 0, block 1 in LME model
                            
                                RDS file size difference between ggplot2 objects created inside vs. outside function
                            
                                Split and re-concatenate a string
                            
                                Retrieve Census tract from Coordinates [closed]
                            
                                dplyr lag with n from column values
                            
                                Center leaflet in a rmarkdown document
                            
                                Fixing the order of a Sankey flow graph in R / networkD3 package
                            
                                How to convert the result of xtabs() into dataframe in R? [duplicate]
                            
                                name character vectors with same name of list
                            
                                How to make in R matrix of intersections and unions over categories?
                            
                                How to split all strings in a column AND include prefix in all the new data
                            
                                Remove *all* duplicate rows, unless there's a "similar" row
                            
                                Create flag indicating if year variable is in the range of start:end variables in data.table
                            
                                Filter top n largest groups in data.frame
                            
                                Function to find if a value is greater than all prior values in a vector
                            
                                Compare two rows of a data.table and show only columns with differences [duplicate]
                            
                                Split string with repeated delimiters
                            
                                Create new variable by multiple conditions via mutate case_when

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With