Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"NAs introduced by coercion" during Cluster Analysis in R

Guys I'm new to this language ,I'm running cluster analysis on a data frame but when I calculate the distance I get this warning "NAs introduced by coercion". What does this mean?

d <- dist(as.matrix(mydata1))

  Warning message:
In dist(as.matrix(mydata1)) : NAs introduced by coercion

My data sample is

Metafamily     Total         July cpc      July cse_pla    July offline   July organic  
xerox 8560     275.829417    0.20943223    0.032628862     0.169210813    0.1130048 
office-supplie  246.9125664  0.057833047   0.020209909     0.535358617    0.136165617

In this apart from Metafamily column all columns are numeric in class.

Guys please help me out from this issue.

like image 791
Ravee Avatar asked Oct 08 '13 10:10

Ravee


Video Answer


1 Answers

It's that first column that creates the issue:

> a <- c("1", "2",letters[1:5], "3")
> as.numeric(a)
[1]  1  2 NA NA NA NA NA  3
Warning message:
NAs introduced by coercion 

Inside dist there must be a coercion to numeric, which generates the NA as above.

I'd suggestion to apply dist without the first column or better move that to rownames if possible, because the result will be different:

> dist(df)
          1         2         3         4
2 1.8842186                              
3 1.9262360 1.2856110                    
4 3.2137871 1.7322788 2.9838920          
5 1.3299455 0.9872963 1.9158079 1.8889050
Warning message:
In dist(df) : NAs introduced by coercion
> dist(df[-1])
         1        2        3        4
2 1.538458                           
3 1.572765 1.049697                  
4 2.624046 1.414400 2.436338         
5 1.085896 0.806124 1.564251 1.542284

btw: you don't need as.matrix when calling dist. It'll do that anyway internally.

EDIT: using rownames

rownames(df) <- df$id

> df
  id       var1       var2
A  A -0.6264538 -0.8204684
B  B  0.1836433  0.4874291
C  C -0.8356286  0.7383247
D  D  1.5952808  0.5757814
E  E  0.3295078 -0.3053884

> dist(df[-1]) # you colud also remove the 1st col at all, using df$id <- NULL.
         A        B        C        D
B 1.538458                           
C 1.572765 1.049697                  
D 2.624046 1.414400 2.436338         
E 1.085896 0.806124 1.564251 1.542284
like image 196
Michele Avatar answered Nov 07 '22 08:11

Michele