Is is possible to reorder the factor levels of a column in a data table by reference? As far as I can tell, not in existing answers:
table$x <- factor(table$x, levels = c("giraffes", "orangutans", "monkeys"))
setattr(mydt$value,"levels",c(...))
The accepted answer provides a way that works without side-effects, using :=
notation.
The background for the question was the need to reorder a bar chart in Plotly, which prescribes reordering factor levels (link).
I need to reorder the factor levels of a column in a data table frame and I would like to do it in a native data table way, if there is one. I know I can do it the data frame way, but this requires hard coded column names. I would like to use dynamic column names, and if possible, updating by reference. I found a way of reordering a column by reference (link), but it does not reorder the other columns accordingly, nor am I able to use a dynamic column reference.
I have the following data (CO2 emissions per capita):
library(data.table)
print(data)
indicator country year value
1: EN.ATM.CO2E.PC Canada 2011 15.639760
2: EN.ATM.CO2E.PC China 2011 7.241515
3: EN.ATM.CO2E.PC European Union 2011 7.079374
4: EN.ATM.CO2E.PC India 2011 1.476686
5: EN.ATM.CO2E.PC Saudi Arabia 2011 17.702307
6: EN.ATM.CO2E.PC United States 2011 16.972417
class(data)
[1] "data.table" "data.frame"
print(str(lapply(data, class)))
List of 8
$ indicator : chr "factor"
$ country : chr "factor"
$ year : chr "factor"
$ value : chr "numeric"
So if I want to reverse sort by the country column, I can do (2):
col <- "country"
column.levels <- levels(data[[col]])
column.levels <- sort(column.levels, T)
data$country <- factor(data$country, levels = column.levels)
But here, country is hard coded, and sorting is not by reference. I anticipate using this code in a function that should work with different data sets, with other column names. So how do I get it to work with a column name stored in a variable, and possibly by reference?
For reference, the data is intended to be presented in a bar chart using Plotly. where I would like to manipulate the order. As discussed here (2), the way to do that is by reordering factor levels.
Help is very much appreciated!
factor()
vs. setattr()
Thanks @Uwe for the feedback. Let me try to be clearer wrt. what I think is the problem. I managed to select columns dynamically, that was just a simple matter of using data[["column"]]
instead of data$country
, but I must have done something wrong the first time I tried. That leaves the problem of reordering levels by reference.
Here's the dput of data:
dput(data)
structure(list(indicator = structure(c(1L, 1L, 1L, 1L, 1L, 1L
), .Label = "EN.ATM.CO2E.PC", class = "factor"), country = structure(1:6, .Label = c("Canada",
"China", "European Union", "India", "Saudi Arabia", "United States"
), class = "factor"), year = structure(c(1L, 1L, 1L, 1L, 1L,
1L), class = "factor", .Label = "2011"), value = c(15.6397596234201,
7.24151541889549, 7.07937396032502, 1.47668634979755, 17.7023072439215,
16.9724170879273)), .Names = c("indicator", "country", "year",
"value"), sorted = c("indicator", "country", "year"), class = c("data.table",
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x00000000001f0788>)
Here's what happens when I use factor to reorder the country column:
column.levels <- levels(data[["country"]])
column.levels <- sort(column.levels, T)
data[["country"]] <- factor(data[["country"]], levels = column.levels)
levels(data[["country"]]) #just as desired
[1] "United States" "Saudi Arabia" "India" "European Union" "China" "Canada"
print(data) #nominal order unchanged
indicator country year value
1: EN.ATM.CO2E.PC Canada 2011 15.639760
2: EN.ATM.CO2E.PC China 2011 7.241515
3: EN.ATM.CO2E.PC European Union 2011 7.079374
4: EN.ATM.CO2E.PC India 2011 1.476686
5: EN.ATM.CO2E.PC Saudi Arabia 2011 17.702307
6: EN.ATM.CO2E.PC United States 2011 16.972417
The table is unmolested, but the levels are reordered as required.
But here's what happens using setattr():
setattr(data[["country"]], "levels", column.levels)
#Also tried this, same result: data[,setattr(country, "levels", column.levels)]
levels(data[["country"]]) #well this looks good...
[1] "United States" "Saudi Arabia" "India" "European Union" "China" "Canada"
print(data) #but this is absurd...
indicator country year value
1: EN.ATM.CO2E.PC United States 2011 15.639760
2: EN.ATM.CO2E.PC Saudi Arabia 2011 7.241515
3: EN.ATM.CO2E.PC India 2011 7.079374
4: EN.ATM.CO2E.PC European Union 2011 1.476686
5: EN.ATM.CO2E.PC China 2011 17.702307
6: EN.ATM.CO2E.PC Canada 2011 16.972417
So factor() leaves the nominal order of the data unchanged. But setattr() actually changes the nominal order of the country column, causing havoc. So what's going wrong here? I'm perplexed by the difference in behaviour. Is it possible to use setattr() or another method that reorders column factor levels by reference? Hope I'm being clear now!
To modify a column of a data.table object by reference, i.e., without copying the whole object, the :=
operator can be used as follows:
col <- "country"
DT[, (col) := factor(get(col), levels = rev(levels(get(col))))]
str(DT)
Classes ‘data.table’ and 'data.frame': 6 obs. of 4 variables: $ indicator: Factor w/ 1 level "EN.ATM.CO2E.PC": 1 1 1 1 1 1 $ country : Factor w/ 6 levels "United States",..: 6 5 4 3 2 1 $ year : Factor w/ 1 level "2011": 1 1 1 1 1 1 $ value : num 15.64 7.24 7.08 1.48 17.7 ...
DT
indicator country year value 1: EN.ATM.CO2E.PC Canada 2011 15.639760 2: EN.ATM.CO2E.PC China 2011 7.241515 3: EN.ATM.CO2E.PC European Union 2011 7.079374 4: EN.ATM.CO2E.PC India 2011 1.476686 5: EN.ATM.CO2E.PC Saudi Arabia 2011 17.702307 6: EN.ATM.CO2E.PC United States 2011 16.972417
Note that DT
is used as name of the data.table object to avoid name conflicts with the data()
function.
As factor()
sorts the levels alphabetically by default, rev()
is used to reverse the order of the existing factor levels.
The column name is given in variable col
. Therefore, get()
is used to access the columns. Alternatively, this could be written as
DT[, (col) := lapply(.SD, factor, levels = rev(levels(DT[[col]]))), .SDcols = col]
using the special symbol .SD
and the .SDcols
parameter.
To verify that DT
is updated by reference, address(DT)
can be used.
setattr()
not work as expected?setattr()
seems only to change the labels of the levels but not the numbering of the levels as the OP wants.
DT
indicator country year value 1: EN.ATM.CO2E.PC Canada 2011 15.639760 2: EN.ATM.CO2E.PC China 2011 7.241515 3: EN.ATM.CO2E.PC European Union 2011 7.079374 4: EN.ATM.CO2E.PC India 2011 1.476686 5: EN.ATM.CO2E.PC Saudi Arabia 2011 17.702307 6: EN.ATM.CO2E.PC United States 2011 16.972417
DT[, as.integer(country)]
[1] 1 2 3 4 5 6
setattr(DT[[col]], "levels", rev(levels(DT[[col]])))
DT
indicator country year value 1: EN.ATM.CO2E.PC United States 2011 15.639760 2: EN.ATM.CO2E.PC Saudi Arabia 2011 7.241515 3: EN.ATM.CO2E.PC India 2011 7.079374 4: EN.ATM.CO2E.PC European Union 2011 1.476686 5: EN.ATM.CO2E.PC China 2011 17.702307 6: EN.ATM.CO2E.PC Canada 2011 16.972417
DT[, as.integer(country)]
[1] 1 2 3 4 5 6
If the above code is used, the numbering of the levels is changed accordingly:
DT[, (col) := factor(get(col), levels = rev(levels(get(col))))]
DT[, as.integer(country)]
[1] 6 5 4 3 2 1
(As DT
is modified in place, please, use always a fresh copy of DT
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With