<p>I have got a huge 1000 x 100000 dataframe like following to recode to numberic values.</p> <pre class="prettyprint"><code>myd <- data.frame (v1 = sample (c("AA", "AB", "BB", NA), 10, replace = T), v2 = sample (c("CC", "CG", "GG", NA), 10, replace = T), v3 = sample (c("AA", "AT", "TT", NA) , 10, replace = T), v4 = sample (c("AA", "AT", "TT", NA) , 10, replace = T), v5 = sample (c("CC", "CA", "AA", NA) , 10, replace = T) ) myd v1 v2 v3 v4 v5 1 AB CC <NA> <NA> AA 2 AB CG TT TT AA 3 AA GG AT AT CA 4 <NA> <NA> <NA> AT <NA> 5 AA <NA> AA <NA> CA 6 BB <NA> TT TT CC 7 AA GG AA AT CA 8 <NA> GG <NA> AT CA 9 AA <NA> AT <NA> CC 10 AA GG TT AA CC </code></pre> <p>Each variables have potentially four unique values.</p> <pre class="prettyprint"><code>unique(myd$v1) [1] AB AA <NA> BB Levels: AA AB BB unique(myd$v2) [1] CC CG GG <NA> Levels: CC CG GG </code></pre> <p>Such unique values can be any combination however consists of two alphabets (-except NA). For example "A", "B" in first case will make combination "AA", "AB", "BB". The numberical code for these would be 1, 0, -1 respectively. Similarly for second case alphabets "C", "G" makes "CC", "CG", "GG", thus the numberical codes would be 1, 0, -1 respectively. Thus the above myd need to be recoded to:</p> <pre class="prettyprint"><code> myd v1 v2 v3 v4 v5 1 0 1 <NA> <NA> 1 2 0 0 -1 -1 1 3 1 -1 0 0 0 4 <NA> <NA> <NA> 0 <NA> 5 1 <NA> 1 < NA> 0 6 -1 <NA> -1 -1 -1 7 1 -1 1 0 0 8 <NA> -1 <NA> 0 0 9 1 <NA> 0 <NA> -1 10 1 -1 -1 1 -1 </code></pre>

<p>I will post a different solution -- (skip to <code>data.table</code> for the superfast approach!)</p> <p>If you want to recode <code>AA, AB, BB</code>, to <code>1,0,-1</code> etc you can use indexing (along with the factor to numeric solution). This will let you have a different recoding if you wish!</p> <h3>self made recode function</h3> <pre class="prettyprint"><code>simple_recode <- function(.x, new_codes){ new_codes[as.numeric(.x)] } as.data.frame(lapply( myd, simple_recode, new_codes = 1:-1)) </code></pre> <h3>use <code>factor</code> </h3> <p>You can simply relabel the letters by calling <code>factor</code> with the new levels as <code>labels</code></p> <pre class="prettyprint"><code>as.data.frame(lapply(myd, factor, labels = 1:-1)) </code></pre> <h3> <code>data.table</code> for efficiency</h3> <p>If your data is big, then I suggest a <code>data.table</code> approach which will be memory and time efficient.</p> <pre class="prettyprint"><code>library(data.table) DT <- as.data.table(myd) as.data.table(DT[,lapply(.SD, simple_recode, new_codes = 1:-1))]) </code></pre> <p>Or, more efficiently</p> <pre class="prettyprint"><code>as.data.table(DT[, lapply(.SD, setattr, 'levels', 1:-1)]) </code></pre> <p>Or, <strong>even more efficiently</strong> (modifying the levels in place, and avoiding the as.data.table call)</p> <pre class="prettyprint"><code> for(name in names(DT)){ setattr(DT[[name]],'levels',1:-1) } </code></pre> <p><code>setattr</code> modifies by reference so no copying.</p> <h3>Virtually Instantaneous approach using data.table and setattr</h3> <p>As demonstrated on this <em>big</em> dataset</p> <pre class="prettyprint"><code># some big data (100 columns, 1e6 rows) big <- replicate(100, factor(sample(c('AA','AB','BB', NA), 1e6, T)), simplify = F) bigDT <- as.data.table(big) system.time({ for(name in names(big)){ setattr(big[[name]],'levels',1:-1) } })) ## user system elapsed ## 0 0 0 </code></pre>

<p>You can take advantage of the fact that your data are factors, which have numeric indices underneath them.</p> <p>For example:</p> <pre class="prettyprint"><code>> as.numeric(myd$v1) [1] 2 2 1 NA 1 3 1 NA 1 1 </code></pre> <p>The numeric values correspond to the <code>levels()</code> of the factor: </p> <pre class="prettyprint"><code>> levels(myd$v1) [1] "AA" "AB" "BB" </code></pre> <p>So 1 == <code>AA</code>, 2 == <code>AB</code>, 3 == <code>BB</code>...and so on.</p> <p>So you can simply convert your data to numeric, and apply the necessary maths to get your data scaled how you want it. So we can subtract by 2, and then multiply by -1 to get your results:</p> <pre class="prettyprint"><code>(sapply(myd, as.numeric) - 2) * -1 #----- v1 v2 v3 v4 v5 [1,] 0 1 NA NA 1 [2,] 0 0 -1 -1 1 [3,] 1 -1 0 0 0 [4,] NA NA NA 0 NA [5,] 1 NA 1 NA 0 [6,] -1 NA -1 -1 -1 [7,] 1 -1 1 0 0 [8,] NA -1 NA 0 0 [9,] 1 NA 0 NA -1 [10,] 1 -1 -1 1 -1 </code></pre>

recoding data in r

Q: How do I recode missing data in R?

To recode missing values; or recode specific indicators that represent missing values, we can use normal subsetting and assignment operations. For example, we can recode missing values in vector x with the mean values in x by first subsetting the vector to identify NA s and then assign these elements a value.

Tags:

dataframe

r

data.table

I have got a huge 1000 x 100000 dataframe like following to recode to numberic values.

myd <- data.frame (v1 = sample (c("AA", "AB", "BB", NA), 10, replace = T),
                   v2 = sample (c("CC", "CG", "GG", NA), 10, replace = T),
                   v3 = sample (c("AA", "AT", "TT", NA) , 10, replace = T),
                   v4 = sample (c("AA", "AT", "TT", NA) , 10, replace = T),
                   v5 = sample (c("CC", "CA", "AA", NA) , 10, replace = T)
                   )
myd
     v1   v2   v3   v4   v5
1    AB   CC <NA> <NA>   AA
2    AB   CG   TT   TT   AA
3    AA   GG   AT   AT   CA
4  <NA> <NA> <NA>   AT <NA>
5    AA <NA>   AA <NA>   CA
6    BB <NA>   TT   TT   CC
7    AA   GG   AA   AT   CA
8  <NA>   GG <NA>   AT   CA
9    AA <NA>   AT <NA>   CC
10   AA   GG   TT   AA   CC

Each variables have potentially four unique values.

unique(myd$v1)

[1] AB   AA   <NA> BB  
Levels: AA AB BB

unique(myd$v2)

[1] CC   CG   GG   <NA>
  Levels: CC CG GG

Such unique values can be any combination however consists of two alphabets (-except NA). For example "A", "B" in first case will make combination "AA", "AB", "BB". The numberical code for these would be 1, 0, -1 respectively. Similarly for second case alphabets "C", "G" makes "CC", "CG", "GG", thus the numberical codes would be 1, 0, -1 respectively. Thus the above myd need to be recoded to:

 myd
         v1   v2   v3    v4      v5
    1    0   1     <NA>  <NA>    1
    2    0   0     -1    -1      1
    3    1   -1     0    0       0
    4  <NA>  <NA>  <NA>   0     <NA>
    5    1  <NA>    1  < NA>      0
    6   -1  <NA>    -1    -1      -1
    7    1   -1    1      0        0
    8  <NA>   -1   <NA>   0        0
    9    1  <NA>    0    <NA>     -1
    10   1   -1    -1     1       -1

886

asked Sep 17 '12 15:09

fprd

2 Answers

I will post a different solution -- (skip to data.table for the superfast approach!)

If you want to recode AA, AB, BB, to 1,0,-1 etc you can use indexing (along with the factor to numeric solution). This will let you have a different recoding if you wish!

self made recode function

simple_recode <- function(.x, new_codes){
  new_codes[as.numeric(.x)]
 }

as.data.frame(lapply( myd, simple_recode, new_codes = 1:-1))

use `factor`

You can simply relabel the letters by calling factor with the new levels as labels

as.data.frame(lapply(myd, factor, labels = 1:-1))

`data.table` for efficiency

If your data is big, then I suggest a data.table approach which will be memory and time efficient.

library(data.table)
DT <- as.data.table(myd)
as.data.table(DT[,lapply(.SD, simple_recode, new_codes = 1:-1))])

Or, more efficiently

as.data.table(DT[, lapply(.SD, setattr, 'levels', 1:-1)])

Or, even more efficiently (modifying the levels in place, and avoiding the as.data.table call)

 for(name in names(DT)){
    setattr(DT[[name]],'levels',1:-1)
     }

setattr modifies by reference so no copying.

Virtually Instantaneous approach using data.table and setattr

As demonstrated on this big dataset

# some big data (100 columns, 1e6 rows)
big  <- replicate(100, factor(sample(c('AA','AB','BB', NA), 1e6, T)), simplify = F)
bigDT <- as.data.table(big)

system.time({
  for(name in names(big)){
    setattr(big[[name]],'levels',1:-1)
     }
  }))

##  user  system elapsed 
##    0        0       0

answered Oct 19 '22 22:10

mnel

You can take advantage of the fact that your data are factors, which have numeric indices underneath them.

For example:

> as.numeric(myd$v1)
 [1]  2  2  1 NA  1  3  1 NA  1  1

The numeric values correspond to the levels() of the factor:

> levels(myd$v1)
[1] "AA" "AB" "BB"

So 1 == AA, 2 == AB, 3 == BB...and so on.

So you can simply convert your data to numeric, and apply the necessary maths to get your data scaled how you want it. So we can subtract by 2, and then multiply by -1 to get your results:

(sapply(myd, as.numeric) - 2) * -1
#-----
      v1 v2 v3 v4 v5
 [1,]  0  1 NA NA  1
 [2,]  0  0 -1 -1  1
 [3,]  1 -1  0  0  0
 [4,] NA NA NA  0 NA
 [5,]  1 NA  1 NA  0
 [6,] -1 NA -1 -1 -1
 [7,]  1 -1  1  0  0
 [8,] NA -1 NA  0  0
 [9,]  1 NA  0 NA -1
[10,]  1 -1 -1  1 -1

answered Oct 20 '22 00:10

Chase

Related questions
                            
                                use dplyr mutate() in programming
                            
                                Is it possible to add a third dummy variable using ifelse() in R?
                            
                                insert rows between dates by group
                            
                                dplyr::count() multiple columns
                            
                                R: How to recode multiple variables at once
                            
                                Geographical distance by group - Applying a function on each pair of rows
                            
                                Create a matrix of zeros and ones from R
                            
                                How to create dummies based on two columns in R
                            
                                Multiply values across each column by weight in another data.frame in R
                            
                                Convert table into matrix by column names [duplicate]
                            
                                Remove anything within a pair of parentheses using gsub in R
                            
                                Write using mouse on R plot?
                            
                                R repeat elements of data frame
                            
                                Dummy for first new element in a series
                            
                                adding spread data to dotplots in R
                            
                                I want to run a R code at a specific time
                            
                                How to replace '(' , ')' using sub in R?
                            
                                How to change Xlab,Ylab and values of XY-axis color and font size in R plot
                            
                                Aggregate data in R
                            
                                Vertical lines between points with ggplot2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

recoding data in r

Tags:

dataframe

r

data.table

fprd

People also ask

2 Answers

self made recode function

use `factor`

`data.table` for efficiency

Virtually Instantaneous approach using data.table and setattr

mnel

Chase

Recent Activity

Donate For Us

recoding data in r

Tags:

dataframe

r

data.table

fprd

People also ask

2 Answers

self made recode function

use factor

data.table for efficiency

Virtually Instantaneous approach using data.table and setattr

mnel

Chase

Related questions

Recent Activity

Donate For Us

use `factor`

`data.table` for efficiency