Handling factor variables in dplyr

Q: What does factor () do in R?

What is Factor in R? Factor in R is a variable used to categorize and store the data, having a limited number of different values. It stores the data as a vector of integer values. Factor in R is also known as a categorical variable that stores both string and integer data values as levels.

Q: How do you set a variable as a factor in R?

In R, you can convert multiple numeric variables to factor using lapply function. The lapply function is a part of apply family of functions. They perform multiple iterations (loops) in R. In R, categorical variables need to be set as factor variables.

Q: What does FCT mean in R?

fct: means a factor variable, also known as a categorical variable. int: means a quantitative variable that takes only integer or whole number values.

Tags:

r

dplyr

I have a data frame which contains an event history and I want to check its integrity by checking whether the last event for each ID number matches the current value in the system for that ID number. The data are coded as factors. The following toy data frame is a minimal example:

df <-data.frame(ID=c(1,1,1,1,2,2,2,3,3),
                 current.grade=as.factor(c("Senior","Senior","Senior","Senior",
                                         "Junior","Junior","Junior",
                                         "Sophomore","Sophomore")),
                 grade.history=as.factor(c("Freshman","Sophomore","Junior","Senior",
                                   "Freshman","Sophomore","Junior",
                                   "Freshman","Sophomore")))

which gives output

> df
  ID current.grade grade.history
1  1        Senior      Freshman
2  1        Senior     Sophomore
3  1        Senior        Junior
4  1        Senior        Senior
5  2        Junior      Freshman
6  2        Junior     Sophomore
7  2        Junior        Junior
8  3     Sophomore      Freshman
9  3     Sophomore     Sophomore
> str(df)
'data.frame':   9 obs. of  3 variables:
 $ ID           : num  1 1 1 1 2 2 2 3 3
 $ current.grade: Factor w/ 3 levels "Junior","Senior",..: 2 2 2 2 1 1 1 3 3
 $ grade.history: Factor w/ 4 levels "Freshman","Junior",..: 1 4 2 3 1 4 2 1 4

I want to use dplyr to extract the last value in grade.history and check it against current.grade:

df.summary <- df %>%
  group_by(ID) %>%
  summarize(current.grade.last=last(current.grade),
            grade.history.last=last(grade.history))

However, dplyr seems to convert the factors to integers, so I get this:

> df.summary
Source: local data frame [3 x 3]

  ID current.grade.last grade.history.last
1  1                  2                  3
2  2                  1                  2
3  3                  3                  4
> str(df.summary)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   3 obs. of  3 variables:
 $ ID                : num  1 2 3
 $ current.grade.last: int  2 1 3
 $ grade.history.last: int  3 2 4

Note that the values don't line up because the original factors had different level sets. What's the right way to do this with dplyr?

I'm using R version 3.1.1 and dplyr version 0.3.0.2

229

asked Jan 10 '15 17:01

tcquinn

1 Answers

Another way to approach this is to put your factor levels in their natural order, in this case Freshman, Sophomore, Junior, Senior, and then select the highest value for each ID using the which.max function for indexing. If you do it this way, you won't have to worry about whether your columns are ordered from lowest to highest grade for each ID (as you do if you use the last function).

library(dplyr)

df <-data.frame(ID=c(1,1,1,1,2,2,2,3,3),
                current.grade=as.factor(c("Senior","Senior","Senior","Senior",
                                          "Junior","Junior","Junior",
                                          "Sophomore","Sophomore")),
                grade.history=as.factor(c("Freshman","Sophomore","Junior","Senior",
                                          "Freshman","Sophomore","Junior",
                                          "Freshman","Sophomore")))


# Ordered vector of grades
gradeLookup = c("Freshman", "Sophomore", "Junior", "Senior")

# Reset the values in the grade columns to the ordering in gradeLookup
df[,-1] = lapply(df[,-1], function(x) {
  factor(x, levels=gradeLookup)
})

# For each ID, select the values of current.grade and grade.history at the maximum
# value of grade.history
df %>% group_by(ID) %>%
  summarise(current.grade.last = current.grade[which.max(grade.history)],
            grade.history.last = grade.history[which.max(grade.history)])

  ID current.grade.last grade.history.last
1  1             Senior             Senior
2  2             Junior             Junior
3  3          Sophomore          Sophomore

UPDATE 2: Since you want to sort and capture the last value (rather than the maximum value) by column, rather than whole rows, try this:

df %>% group_by(ID) %>%
  summarise(current.grade.last = current.grade[length(grade.history)],
            grade.history.last = grade.history[length(grade.history)])

END UPDATE 2

Does your data include a time variable, like year, term, or academic year? If so, you can dispense with current.grade and direclty select the value of grade.history at the most recent year of attendance. This will give you each student's last grade level. For example (assuming your time variable is called year):

df %>% group_by(ID) %>%
  summarise(last.grade = grade.history[which.max(year)])

UPDATE 1: I'm not sure what's causing your code to return the numerical code for each level, rather than the level label. It's not just an issue with the last function (you can see this if you do last(df$grade.history)). However, if you want to sort by time-stamp and then return the last row, the code below will keep the level labels. slice returns the rows you specify within each value of ID. In this case we specify the last row by using n(), which returns the total number of rows for each value of ID.

df.summary <- df %>%
  group_by(ID) %>%
  slice(n())

answered Oct 16 '22 09:10

eipi10

Related questions
                            
                                data.table .. notation with functions in j
                            
                                Efficient way to plot data on an irregular grid
                            
                                How to upload Jupyter notebooks using blogdown on netlify?
                            
                                How to display values in a "numericInput" in shiny as percentage?
                            
                                How to make bullets incrementally appear in flexdashboard storyboard for RMarkdown?
                            
                                Why did R's sorting change data imported with load() after an upgrade from 3.5.2 to 4.0.0?
                            
                                R performance with data reshaping
                            
                                R + ggplot : how to use a custom smoother (Gaussian Process)
                            
                                Save R plot to web server
                            
                                Different NA actions for coefficients and summary of linear model in R
                            
                                using Reduce/do.call with ifelse
                            
                                Makefile with multiple inputs and outputs
                            
                                Show two measurement units on axis ticks in ggplot2
                            
                                RStudio, knitr, pandoc, Word: how do I get a 'new page' in my docx?
                            
                                call a SQL Server Stored Procedure with Parameter in R
                            
                                Can I make partial plots for DecisionTreeClassifier in scikit-learn (and R)
                            
                                is it possible to use renjin within rstudio
                            
                                Load, detach, re-load anomaly
                            
                                Shiny & RStudio Server: "Error during WebSocket handshake: Unexpected response code: 404"
                            
                                NOTE in R CRAN check ---No repository set, so cyclic dependency check skipped [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With