How does R know what kind of data from a given data-frame to consider as a factor? [closed]

Tags:

r

Considering the given data from the Titanic dataset available on Kaggle (https://www.kaggle.com/c/titanic/data), I am trying to find out what the data type of each of the column is on R. It return a factor datatype for Name of passengers,gender and ticket number. It returns a number datatype for age. Why doesn't it consider the list of ages to be an integer or even a factor? The ages do repeat themselves in the data set. Can't they considered as different levels?

I used the str() function to return the datatypes in R.

str(test.survived)
 $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
 $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...

str(test.survived)

Output:

    'data.frame':   418 obs. of  12 variables:
 $ survived   : Factor w/ 1 level "None": 1 1 1 1 1 1 1 1 1 1 ...
 $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
 $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
 $ Name       : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210  
                409 273 414 182 370 85 58 5 104 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
 $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
 $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
 $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
 $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 
                262 159 85 101 270 ...
 $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
 $ Cabin      : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 
                 ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...

From what I understand, factors are used for datasets that have duplicate values, hence categorizing them into levels. Just like the ticket number, and the cabin type, age also has duplicates. But R doesn't consider age to be a factor and assigns it a number datatype. I understand it can't be an integer type since there are some floating data values in there. But why not factor?

943

asked Jun 04 '19 13:06

user11599101

1 Answers

What the data is read as will depend on the function you use to do so as well as any arguments you specify.

If you used something like read.csv(), then that uses the function type.convert() to set the data type for each column. From the notes:

Given a vector, the function attempts to convert it to logical, integer, numeric or complex, and failing that converts a character vector to factor unless as.is = TRUE. The first type that can accept all the non-missing values is chosen.

The function goes through class types in that order to work out what the column should be. So a factor type will only be used if a numeric category can't be assigned. In this instance it is a numeric column.

More info

Often, people don't want their character columns read in as factors. To avoid this, use stringsAsFactors = FALSE when reading in the csv.

If you want your numeric column to be factors, then you can use

test.survived$Age <- as.factor(test.survived$Age)

for example.

108

answered Oct 22 '22 11:10

Jaccar

Related questions
                            
                                How can I prevent my Shiny App from disconnecting in a open-source shiny-server?
                            
                                Difference between categorical variables (factors) and dummy variables
                            
                                Spread an integer over several rows as many times as it is divided by a constant
                            
                                ggplot legend key color and transparency
                            
                                How do I fix the abline warning, only using first two coefficients?
                            
                                Deparse, substitute with three-dots arguments
                            
                                How to plot density of values above and below zero in ggplot?
                            
                                Use R data.frame object in d3.js using r2d3
                            
                                Combine dataframes for means and sd's into one dataframe with sd in brackets after the mean
                            
                                Maps with R: Can't change the projection for points/coordinates
                            
                                Regression model point estimation
                            
                                Group a dataframe based on sequence breaks in a column?
                            
                                shrink plot width to make more room for ggrepel labels
                            
                                Getting Text After a Word--R Webscraping
                            
                                Is there an R function to get the unique edges in an undirected (not directed) network?
                            
                                custom grouped dplyr function (sample_n)
                            
                                How to convert dataframe to matrix without column names
                            
                                Using purrr rowwise instead of apply() on whole row
                            
                                Changing the font size of figure captions in RMarkdown pdf output
                            
                                Question on how to draw back-to-back plot using R and ggplot2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With