I have got a huge 1000 x 100000 dataframe like following to recode to numberic values.
myd <- data.frame (v1 = sample (c("AA", "AB", "BB", NA), 10, replace = T),
v2 = sample (c("CC", "CG", "GG", NA), 10, replace = T),
v3 = sample (c("AA", "AT", "TT", NA) , 10, replace = T),
v4 = sample (c("AA", "AT", "TT", NA) , 10, replace = T),
v5 = sample (c("CC", "CA", "AA", NA) , 10, replace = T)
)
myd
v1 v2 v3 v4 v5
1 AB CC <NA> <NA> AA
2 AB CG TT TT AA
3 AA GG AT AT CA
4 <NA> <NA> <NA> AT <NA>
5 AA <NA> AA <NA> CA
6 BB <NA> TT TT CC
7 AA GG AA AT CA
8 <NA> GG <NA> AT CA
9 AA <NA> AT <NA> CC
10 AA GG TT AA CC
Each variables have potentially four unique values.
unique(myd$v1)
[1] AB AA <NA> BB
Levels: AA AB BB
unique(myd$v2)
[1] CC CG GG <NA>
Levels: CC CG GG
Such unique values can be any combination however consists of two alphabets (-except NA). For example "A", "B" in first case will make combination "AA", "AB", "BB". The numberical code for these would be 1, 0, -1 respectively. Similarly for second case alphabets "C", "G" makes "CC", "CG", "GG", thus the numberical codes would be 1, 0, -1 respectively. Thus the above myd need to be recoded to:
myd
v1 v2 v3 v4 v5
1 0 1 <NA> <NA> 1
2 0 0 -1 -1 1
3 1 -1 0 0 0
4 <NA> <NA> <NA> 0 <NA>
5 1 <NA> 1 < NA> 0
6 -1 <NA> -1 -1 -1
7 1 -1 1 0 0
8 <NA> -1 <NA> 0 0
9 1 <NA> 0 <NA> -1
10 1 -1 -1 1 -1
Recoding a categorical variable The easiest way is to use revalue() or mapvalues() from the plyr package. This will code M as 1 and F as 2 , and put it in a new column.
Recoding your data means changing the values of a variable so that they represent something else. Within SPSS Statistics there is more than one type of recode that can be performed. In this video Jarlath Quinn demonstrates how to:- Recode into the same variables, overwriting an existing variable.
To create a new variable or to transform an old variable into a new one, usually, is a simple task in R. The common function to use is newvariable <- oldvariable .
To recode missing values; or recode specific indicators that represent missing values, we can use normal subsetting and assignment operations. For example, we can recode missing values in vector x with the mean values in x by first subsetting the vector to identify NA s and then assign these elements a value.
I will post a different solution -- (skip to data.table
for the superfast approach!)
If you want to recode AA, AB, BB
, to 1,0,-1
etc you can use indexing (along with the factor to numeric solution). This will let you have a different recoding if you wish!
simple_recode <- function(.x, new_codes){
new_codes[as.numeric(.x)]
}
as.data.frame(lapply( myd, simple_recode, new_codes = 1:-1))
factor
You can simply relabel the letters by calling factor
with the new levels as labels
as.data.frame(lapply(myd, factor, labels = 1:-1))
data.table
for efficiencyIf your data is big, then I suggest a data.table
approach which will be memory and time efficient.
library(data.table)
DT <- as.data.table(myd)
as.data.table(DT[,lapply(.SD, simple_recode, new_codes = 1:-1))])
Or, more efficiently
as.data.table(DT[, lapply(.SD, setattr, 'levels', 1:-1)])
Or, even more efficiently (modifying the levels in place, and avoiding the as.data.table call)
for(name in names(DT)){
setattr(DT[[name]],'levels',1:-1)
}
setattr
modifies by reference so no copying.
As demonstrated on this big dataset
# some big data (100 columns, 1e6 rows)
big <- replicate(100, factor(sample(c('AA','AB','BB', NA), 1e6, T)), simplify = F)
bigDT <- as.data.table(big)
system.time({
for(name in names(big)){
setattr(big[[name]],'levels',1:-1)
}
}))
## user system elapsed
## 0 0 0
You can take advantage of the fact that your data are factors, which have numeric indices underneath them.
For example:
> as.numeric(myd$v1)
[1] 2 2 1 NA 1 3 1 NA 1 1
The numeric values correspond to the levels()
of the factor:
> levels(myd$v1)
[1] "AA" "AB" "BB"
So 1 == AA
, 2 == AB
, 3 == BB
...and so on.
So you can simply convert your data to numeric, and apply the necessary maths to get your data scaled how you want it. So we can subtract by 2, and then multiply by -1 to get your results:
(sapply(myd, as.numeric) - 2) * -1
#-----
v1 v2 v3 v4 v5
[1,] 0 1 NA NA 1
[2,] 0 0 -1 -1 1
[3,] 1 -1 0 0 0
[4,] NA NA NA 0 NA
[5,] 1 NA 1 NA 0
[6,] -1 NA -1 -1 -1
[7,] 1 -1 1 0 0
[8,] NA -1 NA 0 0
[9,] 1 NA 0 NA -1
[10,] 1 -1 -1 1 -1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With