Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Reorder factor levels with data table (for use with Plotly)

tl;dr

Is is possible to reorder the factor levels of a column in a data table by reference? As far as I can tell, not in existing answers:

  • This works, but not by reference (link): table$x <- factor(table$x, levels = c("giraffes", "orangutans", "monkeys"))
  • This works by reference, but had unanticipated effects on the other columns of the table (link): setattr(mydt$value,"levels",c(...))

The accepted answer provides a way that works without side-effects, using := notation.

The background for the question was the need to reorder a bar chart in Plotly, which prescribes reordering factor levels (link).


Original question

I need to reorder the factor levels of a column in a data table frame and I would like to do it in a native data table way, if there is one. I know I can do it the data frame way, but this requires hard coded column names. I would like to use dynamic column names, and if possible, updating by reference. I found a way of reordering a column by reference (link), but it does not reorder the other columns accordingly, nor am I able to use a dynamic column reference.

I have the following data (CO2 emissions per capita):

library(data.table)
print(data)
        indicator        country year     value
1: EN.ATM.CO2E.PC         Canada 2011 15.639760
2: EN.ATM.CO2E.PC          China 2011  7.241515
3: EN.ATM.CO2E.PC European Union 2011  7.079374
4: EN.ATM.CO2E.PC          India 2011  1.476686
5: EN.ATM.CO2E.PC   Saudi Arabia 2011 17.702307
6: EN.ATM.CO2E.PC  United States 2011 16.972417

class(data)
[1] "data.table" "data.frame"

print(str(lapply(data, class)))
List of 8
 $ indicator     : chr "factor"
 $ country       : chr "factor"
 $ year          : chr "factor"
 $ value         : chr "numeric"

So if I want to reverse sort by the country column, I can do (2):

col <- "country"
column.levels <- levels(data[[col]])
column.levels <- sort(column.levels, T)
data$country <- factor(data$country, levels = column.levels)

But here, country is hard coded, and sorting is not by reference. I anticipate using this code in a function that should work with different data sets, with other column names. So how do I get it to work with a column name stored in a variable, and possibly by reference?

For reference, the data is intended to be presented in a bar chart using Plotly. where I would like to manipulate the order. As discussed here (2), the way to do that is by reordering factor levels.

Help is very much appreciated!


Clarification re: factor() vs. setattr()

Thanks @Uwe for the feedback. Let me try to be clearer wrt. what I think is the problem. I managed to select columns dynamically, that was just a simple matter of using data[["column"]] instead of data$country, but I must have done something wrong the first time I tried. That leaves the problem of reordering levels by reference.

Here's the dput of data:

dput(data)
structure(list(indicator = structure(c(1L, 1L, 1L, 1L, 1L, 1L
), .Label = "EN.ATM.CO2E.PC", class = "factor"), country = structure(1:6, .Label = c("Canada", 
"China", "European Union", "India", "Saudi Arabia", "United States"
), class = "factor"), year = structure(c(1L, 1L, 1L, 1L, 1L, 
1L), class = "factor", .Label = "2011"), value = c(15.6397596234201, 
7.24151541889549, 7.07937396032502, 1.47668634979755, 17.7023072439215, 
16.9724170879273)), .Names = c("indicator", "country", "year", 
"value"), sorted = c("indicator", "country", "year"), class = c("data.table", 
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x00000000001f0788>)

Here's what happens when I use factor to reorder the country column:

column.levels <- levels(data[["country"]])
column.levels <- sort(column.levels, T)
data[["country"]] <- factor(data[["country"]], levels = column.levels)

levels(data[["country"]]) #just as desired
[1] "United States"  "Saudi Arabia"   "India"          "European Union" "China"          "Canada" 

print(data) #nominal order unchanged
        indicator        country year     value
1: EN.ATM.CO2E.PC         Canada 2011 15.639760
2: EN.ATM.CO2E.PC          China 2011  7.241515
3: EN.ATM.CO2E.PC European Union 2011  7.079374
4: EN.ATM.CO2E.PC          India 2011  1.476686
5: EN.ATM.CO2E.PC   Saudi Arabia 2011 17.702307
6: EN.ATM.CO2E.PC  United States 2011 16.972417

The table is unmolested, but the levels are reordered as required.

But here's what happens using setattr():

setattr(data[["country"]], "levels", column.levels)
#Also tried this, same result: data[,setattr(country, "levels", column.levels)]

levels(data[["country"]]) #well this looks good...
[1] "United States"  "Saudi Arabia"   "India"          "European Union" "China"          "Canada" 

print(data) #but this is absurd...
        indicator        country year     value
1: EN.ATM.CO2E.PC  United States 2011 15.639760
2: EN.ATM.CO2E.PC   Saudi Arabia 2011  7.241515
3: EN.ATM.CO2E.PC          India 2011  7.079374
4: EN.ATM.CO2E.PC European Union 2011  1.476686
5: EN.ATM.CO2E.PC          China 2011 17.702307
6: EN.ATM.CO2E.PC         Canada 2011 16.972417

So factor() leaves the nominal order of the data unchanged. But setattr() actually changes the nominal order of the country column, causing havoc. So what's going wrong here? I'm perplexed by the difference in behaviour. Is it possible to use setattr() or another method that reorders column factor levels by reference? Hope I'm being clear now!

like image 360
salient.salamander Avatar asked Oct 16 '17 17:10

salient.salamander


1 Answers

To modify a column of a data.table object by reference, i.e., without copying the whole object, the := operator can be used as follows:

col <- "country"
DT[, (col) := factor(get(col), levels = rev(levels(get(col))))]
str(DT)
Classes ‘data.table’ and 'data.frame':    6 obs. of  4 variables:
 $ indicator: Factor w/ 1 level "EN.ATM.CO2E.PC": 1 1 1 1 1 1
 $ country  : Factor w/ 6 levels "United States",..: 6 5 4 3 2 1
 $ year     : Factor w/ 1 level "2011": 1 1 1 1 1 1
 $ value    : num  15.64 7.24 7.08 1.48 17.7 ...
DT
        indicator        country year     value
1: EN.ATM.CO2E.PC         Canada 2011 15.639760
2: EN.ATM.CO2E.PC          China 2011  7.241515
3: EN.ATM.CO2E.PC European Union 2011  7.079374
4: EN.ATM.CO2E.PC          India 2011  1.476686
5: EN.ATM.CO2E.PC   Saudi Arabia 2011 17.702307
6: EN.ATM.CO2E.PC  United States 2011 16.972417

Note that DT is used as name of the data.table object to avoid name conflicts with the data() function.

As factor() sorts the levels alphabetically by default, rev() is used to reverse the order of the existing factor levels.

The column name is given in variable col. Therefore, get() is used to access the columns. Alternatively, this could be written as

DT[, (col) := lapply(.SD, factor, levels = rev(levels(DT[[col]]))), .SDcols = col]

using the special symbol .SD and the .SDcols parameter.

To verify that DT is updated by reference, address(DT) can be used.


Why does setattr() not work as expected?

setattr() seems only to change the labels of the levels but not the numbering of the levels as the OP wants.

DT
        indicator        country year     value
1: EN.ATM.CO2E.PC         Canada 2011 15.639760
2: EN.ATM.CO2E.PC          China 2011  7.241515
3: EN.ATM.CO2E.PC European Union 2011  7.079374
4: EN.ATM.CO2E.PC          India 2011  1.476686
5: EN.ATM.CO2E.PC   Saudi Arabia 2011 17.702307
6: EN.ATM.CO2E.PC  United States 2011 16.972417
DT[, as.integer(country)]
[1] 1 2 3 4 5 6
setattr(DT[[col]], "levels", rev(levels(DT[[col]])))
DT
        indicator        country year     value
1: EN.ATM.CO2E.PC  United States 2011 15.639760
2: EN.ATM.CO2E.PC   Saudi Arabia 2011  7.241515
3: EN.ATM.CO2E.PC          India 2011  7.079374
4: EN.ATM.CO2E.PC European Union 2011  1.476686
5: EN.ATM.CO2E.PC          China 2011 17.702307
6: EN.ATM.CO2E.PC         Canada 2011 16.972417
DT[, as.integer(country)]
[1] 1 2 3 4 5 6

If the above code is used, the numbering of the levels is changed accordingly:

DT[, (col) := factor(get(col), levels = rev(levels(get(col))))]
DT[, as.integer(country)]
[1] 6 5 4 3 2 1

(As DT is modified in place, please, use always a fresh copy of DT)

like image 54
Uwe Avatar answered Oct 12 '22 23:10

Uwe