Considering the given data from the Titanic dataset available on Kaggle (https://www.kaggle.com/c/titanic/data), I am trying to find out what the data type of each of the column is on R. It return a factor datatype for Name of passengers,gender and ticket number. It returns a number datatype for age. Why doesn't it consider the list of ages to be an integer or even a factor? The ages do repeat themselves in the data set. Can't they considered as different levels?
I used the str()
function to return the datatypes in R.
str(test.survived)
$ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
$ Ticket : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
.
str(test.survived)
Output:
'data.frame': 418 obs. of 12 variables:
$ survived : Factor w/ 1 level "None": 1 1 1 1 1 1 1 1 1 1 ...
$ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
$ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
$ Name : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210
409 273 414 182 370 85 58 5 104 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
$ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
$ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
$ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
$ Ticket : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139
262 159 85 101 270 ...
$ Fare : num 7.83 7 9.69 8.66 12.29 ...
$ Cabin : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1
...
$ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
From what I understand, factors are used for datasets that have duplicate values, hence categorizing them into levels. Just like the ticket number, and the cabin type, age also has duplicates. But R doesn't consider age to be a factor and assigns it a number datatype. I understand it can't be an integer type since there are some floating data values in there. But why not factor?
There are several ways to check data type in R. We can make use of the “typeof()” function, “class()” function and even the “str()” function to check the data type of an entire dataframe.
factor() Function. is. factor() function in R Language is used to check if the object passed to the function is a Factor or not.
15.1 Introduction. In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.
A data frame is the most common way of storing data in R and, generally, is the data structure most often used for data analyses. Under the hood, a data frame is a list of equal-length vectors. Each element of the list can be thought of as a column and the length of each element of the list is the number of rows.
What the data is read as will depend on the function you use to do so as well as any arguments you specify.
If you used something like read.csv()
, then that uses the function type.convert()
to set the data type for each column. From the notes:
Given a vector, the function attempts to convert it to logical, integer, numeric or complex, and failing that converts a character vector to factor unless as.is = TRUE. The first type that can accept all the non-missing values is chosen.
The function goes through class types in that order to work out what the column should be. So a factor type will only be used if a numeric category can't be assigned. In this instance it is a numeric column.
More info
Often, people don't want their character columns read in as factors. To avoid this, use stringsAsFactors = FALSE
when reading in the csv.
If you want your numeric column to be factors, then you can use
test.survived$Age <- as.factor(test.survived$Age)
for example.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With