Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert from category to numeric in r

Tags:

r

Here is my problem:

I have a table with categories and I want to rank them:

category
dog
cat
fish
dog
dog

What I want is to add a column and to rank them:

category       rank    
dog             1  
cat             2
fish            3
dog             1
dog             1
  • Sorry for the terrible table (help in writing normal tables in stack overflow would be great, too)
  • Any ideas about how to add the rank column?

Thanks!

like image 752
Oshrat Avatar asked Dec 26 '13 09:12

Oshrat


People also ask

How do I convert to numeric in R studio?

There are two steps for converting factor to numeric: Step 1: Convert the data vector into a factor. The factor() command is used to create and modify factors in R. Step 2: The factor is converted into a numeric vector using as. numeric().

How do I convert a dataset to numeric in R?

To convert columns of an R data frame from integer to numeric we can use lapply function. For example, if we have a data frame df that contains all integer columns then we can use the code lapply(df,as. numeric) to convert all of the columns data type into numeric data type.

How do you convert a categorical variable to a continuous variable in R?

The easiest way to convert categorical variables to continuous is by replacing raw categories with the average response value of the category. cutoff : minimum observations in a category. All the categories having observations less than the cutoff will be a different category.


2 Answers

Just for the sake of completeness and because the solution I posted in a comment is an inefficient (and pretty ugly) fix, I'll post an answer too.

It turned out that OP's starting setting was something like the following:

x = c("cat", "dog", "fish", "dog", "dog", "cat", "fish", "catfish")
x = factor(x)

At the end, a manually specified numerical categorization of x was wanted. As an example, let's suppose that the following matching is wanted:

cat -> 1, dog -> 2, fish -> 3, catfish -> 4

So, some alternatives:

sapply(as.character(x), switch, "cat" = 1, "dog" = 2, "fish" = 3, "catfish" = 4, 
                                                                USE.NAMES = F)
#[1] 1 2 3 2 2 1 3 4

match(x, c("cat", "dog", "fish", "catfish")) #note that match's internal 'do_match' 
                                             #calls 'match_transform' that coerces
                                             #`factor` to `character`, so no need
                                             #for 'as.character(x)'
                                  #(http://svn.r-project.org/R/trunk/src/main/unique.c)
#[1] 1 2 3 2 2 1 3 4

local({    #just to not change 'x'
levels(x) = list("cat" = 1, "dog" = 2, "fish" = 3, "catfish" = 4)
as.numeric(x)
})
#[1] 1 2 3 2 2 1 3 4

library(fastmatch)
fmatch(x, c("cat", "dog", "fish", "catfish"))  #a faster alternative to 'match'
#[1] 1 2 3 2 2 1 3 4

And a benchmarking on a larger vector:

X = rep(as.character(x), 1e5)
X = factor(X)
f1 = function() sapply(as.character(X), switch, 
            "cat" = 1, "dog" = 2, "fish" = 3, "catfish" = 4, USE.NAMES = F)
f2 = function() match(X, c("cat", "dog", "fish", "catfish")) 
f3 = function() {levels(X) = list("cat" = 1, "dog" = 2, "fish" = 3, "catfish" = 4) ;
                                                       as.numeric(X)}
library(fastmatch)
f4 = function() fmatch(X, c("cat", "dog", "fish", "catfish"))

library(microbenchmark)
microbenchmark(f1(), f2(), f3(), f4(), times = 10)
#Unit: milliseconds
# expr         min          lq      median         uq       max neval
# f1() 1745.111666 1816.675337 1961.809102 2107.98236 2896.0291    10
# f2()   22.043657   22.786647   23.987263   31.45057  111.9600    10
# f3()   32.704779   32.919150   38.865853   47.67281  134.2988    10
# f4()    8.814958    8.823309    9.856188   19.66435  104.2827    10
sum(f1() != f2())
#[1] 0
sum(f2() != f3())
#[1] 0
sum(f3() != f4())
#[1] 0
like image 93
alexis_laz Avatar answered Sep 27 '22 16:09

alexis_laz


I assume that if you write "ranks" you mean ranks. I further assume you want to rank according to number of occurrence.

cats <- factor(c("dog", "cat", "fish", "dog", "dog"))

#see help("rank") for other possibilities to break ties
ranks <- rank(-table(cats), ties.method="first")

DF <- data.frame(category=cats, rank=ranks[as.character(cats)])

print(DF)
#   category rank
# 1      dog    1
# 2      cat    2
# 3     fish    3
# 4      dog    1
# 5      dog    1
like image 21
Roland Avatar answered Sep 27 '22 18:09

Roland