I understand this is a very basic question but I don't understand what levels mean in R. For reference, I have done a simple script to read CSV table, filter on one of the fields, pass this on to a new variable and clear the memory allocated for the first variable. If I call unique() on the field on which I filtered, I see that the results were indeed filtered but there is one additional line showing 'Levels' corresponding to data that is in the original dataset. Example: <pre class="prettyprint"><code>df = read.csv(path, sep=",", header=TRUE) df_intrate = df[df$AssetClass == "ASSET CLASS A", ] rm(df) gc() unique(df_intrate$AssetClass) </code></pre> Results: <pre class="prettyprint"><code>[1] ASSET CLASS A Levels: ASSET CLASS E ASSET CLASS D ASSET CLASS C ASSET CLASS B ASSET CLASS A </code></pre> Is the structural information from <code>df</code> somehow preserved in df_intrate despite R studio showing that df_intrate is indeed the expected number of rows for <code>ASSET CLASS A</code> ?

<blockquote> Is the structural information from df somehow preserved in df_intrate despite R studio showing that df_intrate is indeed the expected number of rows for ASSET CLASS A ? </blockquote> Yes. This is how categorical variables, called factors, are stored in R - both the levels, a vector of all possible values, and the actual values taken, are stored: <pre class="prettyprint"><code>x = factor(c('a', 'b', 'c', 'a', 'b', 'b')) x # [1] a b c a b b # Levels: a b c y = x[1] # [1] a # Levels: a b c </code></pre> You can get rid of unused levels with <code>droplevels()</code>, or by re-applying the <code>factor</code> function, creating a new factor out of only what is present: <pre class="prettyprint"><code>droplevels(y) # [1] a # Levels: a factor(y) # [1] a # Levels: a </code></pre> You can also use <code>droplevels</code> on a data frame to drop all unused levels from all factor columns: <pre class="prettyprint"><code>dat = data.frame(x = x) str(dat) # 'data.frame': 6 obs. of 1 variable: # $ x: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 2 str(dat[1, ]) # Factor w/ 3 levels "a","b","c": 1 str(droplevels(dat[1, ])) # Factor w/ 1 level "a": 1 </code></pre> <hr> Though unrelated to your current issue, we should also mention that <code>factor</code> has an optional <code>levels</code> argument which can be used to specify the levels of a factor and the order in which they should go. This can be useful if you want a specific order (perhaps for plotting or modeling), or if there are more possible levels than are actually present and you want to include them. If you don't specify the <code>levels</code>, the default will be alphabetical order. <pre class="prettyprint"><code>x = c("agree", "disagree", "agree", "neutral", "strongly agree") factor(x) # [1] agree disagree agree neutral strongly agree # Levels: agree disagree neutral strongly agree ## not a good order factor(x, levels = c("disagree", "neutral", "agree", "strongly agree")) # [1] agree disagree agree neutral strongly agree # Levels: disagree neutral agree strongly agree ## better order factor(x, levels = c("strongly disagree", "disagree", "neutral", "agree", "strongly agree")) # [1] agree disagree agree neutral strongly agree # Levels: strongly disagree disagree neutral agree strongly agree ## good order, more levels than are actually present </code></pre> You can use <code>?reorder</code> and <code>?relevel</code> (or just <code>factor</code> again) to change the order of levels for an already created factor.

What are levels in R?

Tags:

r

levels

I understand this is a very basic question but I don't understand what levels mean in R.

For reference, I have done a simple script to read CSV table, filter on one of the fields, pass this on to a new variable and clear the memory allocated for the first variable. If I call unique() on the field on which I filtered, I see that the results were indeed filtered but there is one additional line showing 'Levels' corresponding to data that is in the original dataset.

Example:

df = read.csv(path, sep=",", header=TRUE)
df_intrate = df[df$AssetClass == "ASSET CLASS A", ]

rm(df)
gc()

unique(df_intrate$AssetClass)

Results:

[1] ASSET CLASS A
Levels: ASSET CLASS E ASSET CLASS D ASSET CLASS C ASSET CLASS B ASSET CLASS A

Is the structural information from df somehow preserved in df_intrate despite R studio showing that df_intrate is indeed the expected number of rows for ASSET CLASS A ?

658

asked Oct 19 '17 13:10

ApplePie

1 Answers

Is the structural information from df somehow preserved in df_intrate despite R studio showing that df_intrate is indeed the expected number of rows for ASSET CLASS A ?

Yes. This is how categorical variables, called factors, are stored in R - both the levels, a vector of all possible values, and the actual values taken, are stored:

x = factor(c('a', 'b', 'c', 'a', 'b', 'b'))
x
# [1] a b c a b b
# Levels: a b c

y = x[1]
# [1] a
# Levels: a b c

You can get rid of unused levels with droplevels(), or by re-applying the factor function, creating a new factor out of only what is present:

droplevels(y)
# [1] a
# Levels: a

factor(y)
# [1] a
# Levels: a

You can also use droplevels on a data frame to drop all unused levels from all factor columns:

dat = data.frame(x = x)
str(dat)
# 'data.frame': 6 obs. of  1 variable:
#  $ x: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 2

str(dat[1, ])
# Factor w/ 3 levels "a","b","c": 1

str(droplevels(dat[1, ]))
# Factor w/ 1 level "a": 1

Though unrelated to your current issue, we should also mention that factor has an optional levels argument which can be used to specify the levels of a factor and the order in which they should go. This can be useful if you want a specific order (perhaps for plotting or modeling), or if there are more possible levels than are actually present and you want to include them. If you don't specify the levels, the default will be alphabetical order.

x = c("agree", "disagree", "agree", "neutral", "strongly agree")
factor(x)
# [1] agree         disagree      agree         neutral       strongly agree
# Levels: agree disagree neutral strongly agree
## not a good order

factor(x, levels = c("disagree", "neutral", "agree", "strongly agree"))
# [1] agree          disagree       agree          neutral        strongly agree
# Levels: disagree neutral agree strongly agree
## better order

factor(x, levels = c("strongly disagree", "disagree", "neutral", "agree", "strongly agree"))
# [1] agree          disagree       agree          neutral        strongly agree
# Levels: strongly disagree disagree neutral agree strongly agree
## good order, more levels than are actually present

You can use ?reorder and ?relevel (or just factor again) to change the order of levels for an already created factor.

answered Oct 04 '22 02:10

Gregor Thomas

Related questions
                            
                                Calculation p-values of a f-statistic with R
                            
                                Is it possible to get RStudio to show function arguments and descriptions for custom functions?
                            
                                How to suppress R startup message?
                            
                                How to benefit from `.BY` in data.table?
                            
                                Errors when using RStudio's Git tools
                            
                                Recursively send list variables to the global environment
                            
                                Assigning values in first rows of groups in a data.table
                            
                                How can I change the size of the strip on facets in a ggplot?
                            
                                R: Offer 5 seconds to demand a pause. If no pause demanded, resume the process
                            
                                Identifying specific differences between two data sets in R
                            
                                LinkedIn API: GET public profile from e-mail
                            
                                Add textbox to facet wrapped layout in ggplot2
                            
                                How is jitter determined in ggplot?
                            
                                Shiny: passing input$var to aes() in ggplot2
                            
                                R: apply a function to every element of two variables respectively
                            
                                R: Plot trees from h2o.randomForest() and h2o.gbm()
                            
                                Survival on binned data
                            
                                Set opacity of background map with ggmap
                            
                                How to rotate 180 degrees an mtext() in R
                            
                                RMarkdown button to show or hide code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With