Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does R know what kind of data from a given data-frame to consider as a factor? [closed]

Tags:

r

Considering the given data from the Titanic dataset available on Kaggle (https://www.kaggle.com/c/titanic/data), I am trying to find out what the data type of each of the column is on R. It return a factor datatype for Name of passengers,gender and ticket number. It returns a number datatype for age. Why doesn't it consider the list of ages to be an integer or even a factor? The ages do repeat themselves in the data set. Can't they considered as different levels?

I used the str() function to return the datatypes in R.

str(test.survived)
 $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
 $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...

.

str(test.survived)

Output:

    'data.frame':   418 obs. of  12 variables:
 $ survived   : Factor w/ 1 level "None": 1 1 1 1 1 1 1 1 1 1 ...
 $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
 $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
 $ Name       : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210  
                409 273 414 182 370 85 58 5 104 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
 $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
 $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
 $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
 $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 
                262 159 85 101 270 ...
 $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
 $ Cabin      : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 
                 ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...

From what I understand, factors are used for datasets that have duplicate values, hence categorizing them into levels. Just like the ticket number, and the cabin type, age also has duplicates. But R doesn't consider age to be a factor and assigns it a number datatype. I understand it can't be an integer type since there are some floating data values in there. But why not factor?

like image 943
user11599101 Avatar asked Jun 04 '19 13:06

user11599101


People also ask

How will you check the data type of variables in a data frame in R?

There are several ways to check data type in R. We can make use of the “typeof()” function, “class()” function and even the “str()” function to check the data type of an entire dataframe.

How do you check if a variable is a factor in R?

factor() Function. is. factor() function in R Language is used to check if the object passed to the function is a Factor or not.

For which type of data do we need to use factors in R?

15.1 Introduction. In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.

What is a data frame in R Why would you use a data frame?

A data frame is the most common way of storing data in R and, generally, is the data structure most often used for data analyses. Under the hood, a data frame is a list of equal-length vectors. Each element of the list can be thought of as a column and the length of each element of the list is the number of rows.


1 Answers

What the data is read as will depend on the function you use to do so as well as any arguments you specify.

If you used something like read.csv(), then that uses the function type.convert() to set the data type for each column. From the notes:

Given a vector, the function attempts to convert it to logical, integer, numeric or complex, and failing that converts a character vector to factor unless as.is = TRUE. The first type that can accept all the non-missing values is chosen.

The function goes through class types in that order to work out what the column should be. So a factor type will only be used if a numeric category can't be assigned. In this instance it is a numeric column.

More info

Often, people don't want their character columns read in as factors. To avoid this, use stringsAsFactors = FALSE when reading in the csv.

If you want your numeric column to be factors, then you can use

test.survived$Age <- as.factor(test.survived$Age)

for example.

like image 108
Jaccar Avatar answered Oct 22 '22 11:10

Jaccar