Let's create some factors first: <pre class="prettyprint"><code>F1 <- factor(c(1,2,20,10,25,3)) F2 <- factor(paste0(F1, " years")) F3 <- F2 levels(F3) <- paste0(sort(F1), " years") F4 <- factor(paste0(F1, " years"), levels=paste0(sort(F1), " years")) </code></pre> then take a look at them: <pre class="prettyprint"><code>> F1 [1] 1 2 20 10 25 3 Levels: 1 2 3 10 20 25 > F2 [1] 1 years 2 years 20 years 10 years 25 years 3 years Levels: 1 years 10 years 2 years 20 years 25 years 3 years > F3 [1] 1 years 3 years 10 years 2 years 20 years 25 years Levels: 1 years 2 years 3 years 10 years 20 years 25 years > F4 [1] 1 years 2 years 20 years 10 years 25 years 3 years Levels: 1 years 2 years 3 years 10 years 20 years 25 years </code></pre> First I note that the "expected" order of the levels in F2 is not similar to F1. Taking a look at <code>factor</code> documentation reveals why: the levels are created by first sorting the input. In the case of F2, these are the strings, where sorting takes length into account (?). What is harder for me to understand is the difference in setting the levels between F3 and F4. In F3 I set the levels after the factor is created while in F4 I set them explicitly when creating the factor. In F3, the use of levels()<- isn't purely a relabel of the levels, but neither does it reorder them the way I expected. Can someone explain the difference?

<code>F1</code> uses numeric sorting, as you figured out yourself. <code>F2</code> uses lexicographic sorting, first comparing the first character, breaking ties using the second, and so on, which is why <code>"10 years"</code> is between <code>"1 years"</code> and <code>"2 years"</code>. <code>F4</code> is created from a character vector, but with an explicit list of possible factors. So that list is taken (without sorting) and identified with the numbers 1 through 6. Then every item of your input is compared against the set of possible levels, and the associated number is stored. After all, a factor is simply a bunch of numbers (<code>as.numeric</code> will show them to you) associated with a list of levels used for printing. So <code>F4</code> gets printed just like <code>F2</code>, but its levels are sorted differently. <code>F3</code> was created from F2, so its levels were unsorted initially. The assignment only replaces the set of level names, not the numbers in the vector. So you can think of this as renaming existing levels. If you look at the numbers, they will match those from <code>F2</code>, whereas the names associated, and the order of names in particular, matches that from <code>F4</code>. As your question claims that this was not purely a relabel: yes, it is a pure relabel, you obtain <code>F3</code> from <code>F2</code> using the following changes (in both rows of the printout): <ul> <li>10 → 2</li> <li>2 → 3</li> <li>20 → 10</li> <li>25 → 20</li> <li>3 → 25</li> </ul> The <code>str</code> function is also a good tool to look at the internal representation of a factor.

Setting levels when creating a factor vs. `levels()<-`

Q: How do you set factor levels?

One way to change the level order is to use factor() on the factor and specify the order directly. In this example, the function ordered() could be used instead of factor() . Another way to change the order is to use relevel() to make a particular level first in the list.

Q: What are levels in factors?

Factor levels are all of the values that the factor can take (recall that a categorical variable has a set number of groups). In a designed experiment, the treatments represent each combination of factor levels.

Q: How many levels do factors need?

A factor must have at least two levels. If a factor only had one level then the effect of the factor could not be assessed.

Q: What are levels in factors in R?

Factors in R. Factors are data structures in R that store categorical data. They have a levels attribute that holds all the possible values that elements of the factor can take. R factors can be of any type. They only allow values permitted by the levels.

Tags:

r

factors

Let's create some factors first:

F1 <- factor(c(1,2,20,10,25,3))
F2 <- factor(paste0(F1, " years"))
F3 <- F2
levels(F3) <- paste0(sort(F1), " years")
F4 <- factor(paste0(F1, " years"), levels=paste0(sort(F1), " years"))

then take a look at them:

> F1
[1] 1  2  20 10 25 3 
Levels: 1 2 3 10 20 25

> F2
[1] 1 years  2 years  20 years 10 years 25 years 3 years 
Levels: 1 years 10 years 2 years 20 years 25 years 3 years

> F3
[1] 1 years  3 years  10 years 2 years  20 years 25 years
Levels: 1 years 2 years 3 years 10 years 20 years 25 years

> F4
[1] 1 years  2 years  20 years 10 years 25 years 3 years 
Levels: 1 years 2 years 3 years 10 years 20 years 25 years

First I note that the "expected" order of the levels in F2 is not similar to F1. Taking a look at factor documentation reveals why: the levels are created by first sorting the input. In the case of F2, these are the strings, where sorting takes length into account (?).

What is harder for me to understand is the difference in setting the levels between F3 and F4. In F3 I set the levels after the factor is created while in F4 I set them explicitly when creating the factor. In F3, the use of levels()<- isn't purely a relabel of the levels, but neither does it reorder them the way I expected.

Can someone explain the difference?

584

asked Jul 20 '12 21:07

mindless.panda

1 Answers

F1 uses numeric sorting, as you figured out yourself.

F2 uses lexicographic sorting, first comparing the first character, breaking ties using the second, and so on, which is why "10 years" is between "1 years" and "2 years".

F4 is created from a character vector, but with an explicit list of possible factors. So that list is taken (without sorting) and identified with the numbers 1 through 6. Then every item of your input is compared against the set of possible levels, and the associated number is stored. After all, a factor is simply a bunch of numbers (as.numeric will show them to you) associated with a list of levels used for printing. So F4 gets printed just like F2, but its levels are sorted differently.

F3 was created from F2, so its levels were unsorted initially. The assignment only replaces the set of level names, not the numbers in the vector. So you can think of this as renaming existing levels. If you look at the numbers, they will match those from F2, whereas the names associated, and the order of names in particular, matches that from F4.

As your question claims that this was not purely a relabel: yes, it is a pure relabel, you obtain F3 from F2 using the following changes (in both rows of the printout):

10 → 2
2 → 3
20 → 10
25 → 20
3 → 25

The str function is also a good tool to look at the internal representation of a factor.

answered Sep 19 '22 05:09

MvG

Related questions
                            
                                Creating a ts time series with missing values from a data frame
                            
                                ggplot increase border line thickness
                            
                                Programmatically switch package in `::` call in R
                            
                                Summarize data at different aggregate levels - R and tidyverse
                            
                                adding multiple layers to a ggplot with a function
                            
                                How to duplicate last row by group (ID)?
                            
                                How to edit and save changes made on Shiny dataTable using DT package
                            
                                Plot Histogram with Points Instead of Bars
                            
                                Scraping a wiki page for the "Periodic table" and all the links
                            
                                R: ggplot2, can I make the facet/strip text wrap around?
                            
                                Reading the last n lines from a huge text file
                            
                                What are the advantages of the "apply" functions? When are they better to use than "for" loops, and when are they not? [duplicate]
                            
                                Why can cosine similarity between two vectors be negative?
                            
                                Get plot() bounding box values
                            
                                Select last value in a row, by row
                            
                                Simple combinatorics in R
                            
                                Average values of a point dataset to a grid dataset
                            
                                Column names on each page with xtable in Sweave
                            
                                How extract regression results from lme, lmer, glmer to Latex?
                            
                                How to curry a ... argument by position in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With