Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does str() show incorrect info for factor levels after creating a sub-matrix in R?

Tags:

dataframe

r

I have the following data frame in R with 274569 rows and 15 columns:

> str(x2)
'data.frame':   274569 obs. of  15 variables:
 $ ykod : int  99 99 99 99 99 99 99 99 99 99 ...
 $ yad  : Factor w/ 43 levels "BAKUGAN","BARBIE",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ per  : Factor w/ 3 levels "2 AYLIK","3 AYLIK",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ donem: int  201106 201106 201106 201106 201106 201106 201106 201106 201106 201106 ...
 $ sayi : int  201106 201106 201106 201106 201106 201106 201106 201106 201106 201106 ...
 $ mkod : int  359 361 362 363 366 847 849 850 1505 1506 ...
 $ mad  : Factor w/ 11045 levels "    Hilal Gida           ",..: 5163 3833 10840 8284 10839 2633 10758 10293 6986 6984 ...
 $ mtip : Factor w/ 30 levels "Abone Bürosu                                      ",..: 20 20 20 20 20 2 2 2 11 11 ...
 $ kanal: Factor w/ 2 levels "OB","SS": 2 2 2 2 2 2 2 2 1 1 ...
 $ bkod : int  110006 110006 110006 110006 110006 110006 110006 110006 110006 110006 ...
 $ bad  : Factor w/ 213 levels "4. Levent","500 Evler",..: 25 25 25 25 25 25 25 25 25 25 ...
 $ bolge: Factor w/ 12 levels "Adana Şehiriçi",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ sevk : int  5 2 2 2 10 0 4 3 13 32 ...
 $ iade : int  0 2 1 2 4 0 3 2 0 8 ...
 $ satis: int  5 0 1 0 6 0 1 1 13 24 ...

I create a sub-matrix and display its structure:

> msub <- x2[x2$ykod == 99,]
> str(msub)
'data.frame':   14367 obs. of  15 variables:
 $ ykod : int  99 99 99 99 99 99 99 99 99 99 ...
 $ yad  : Factor w/ 43 levels "BAKUGAN","BARBIE",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ per  : Factor w/ 3 levels "2 AYLIK","3 AYLIK",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ donem: int  201106 201106 201106 201106 201106 201106 201106 201106 201106 201106 ...
 $ sayi : int  201106 201106 201106 201106 201106 201106 201106 201106 201106 201106 ...
 $ mkod : int  359 361 362 363 366 847 849 850 1505 1506 ...
 $ mad  : Factor w/ 11045 levels "    Hilal Gida           ",..: 5163 3833 10840 8284 10839 2633 10758 10293 6986 6984 ...
 $ mtip : Factor w/ 30 levels "Abone Bürosu                                      ",..: 20 20 20 20 20 2 2 2 11 11 ...
 $ kanal: Factor w/ 2 levels "OB","SS": 2 2 2 2 2 2 2 2 1 1 ...
 $ bkod : int  110006 110006 110006 110006 110006 110006 110006 110006 110006 110006 ...
 $ bad  : Factor w/ 213 levels "4. Levent","500 Evler",..: 25 25 25 25 25 25 25 25 25 25 ...
 $ bolge: Factor w/ 12 levels "Adana Şehiriçi",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ sevk : int  5 2 2 2 10 0 4 3 13 32 ...
 $ iade : int  0 2 1 2 4 0 3 2 0 8 ...
 $ satis: int  5 0 1 0 6 0 1 1 13 24 ...

Now I have a sub-matrix with 14367 rows and 15 columns, but the levels of factors are still there. They should have been decreased. For example, for yad, there should be only one factor.

How can I easily make str() to show correct info for factor levels so that when I type str(msub) it gives me correct values?

like image 270
Mehper C. Palavuzlar Avatar asked Nov 30 '22 07:11

Mehper C. Palavuzlar


2 Answers

This is expected behavior. Factor levels that have no representation in your subset do not "disappear" until you tell them to. As of recently, you can use droplevels().

like image 115
Roman Luštrik Avatar answered Dec 04 '22 03:12

Roman Luštrik


In fact str is showing you the correct structural information: the factor has the ability to have the levels shown. Imagine concatenating two of your submatrices where one contained some of the levels and the other another set: it would be somewhat of a hassle to merge this! This is simply how factors work in R.

If you want to know which factors are 'present' in your data, one of the options is using table to count the occurrences.

If you want your factor reduced, so it only contains the levels that are actually present, you can reapply factor to it:

myfact<-factor(rep(1:2,5), levels=1:3, labels=letters[1:3])
myfact
# [1] a b a b a b a b a b
#Levels: a b c
factor(myfact)
# [1] a b a b a b a b a b
#Levels: a b

You can simply apply this to all the factor columns of your data.frame to get what you say you want.

like image 22
Nick Sabbe Avatar answered Dec 04 '22 02:12

Nick Sabbe