If I have a file with many column, the data are all numbers, how can I know whether a specific column is categorical or quantitative data?. Is there an area of study for this kind of problem? If not, what are some heuristics that can be used to determine?
Some heuristics that I can think of:
some_threshold
, there is higher chance to be categorical data.I am using R, but the question doesn't need to be R specific.
This assumes someone coded the data correctly.
Perhaps you are suggesting the data were not coded or labeled correctly, that it was all entered as numeric and some of it really is categorical. In that case, I do not know how one could tell with any certainty. Categorical data can have decimals places and can be negative.
The question I would ask myself in such a situation is what difference does it make how I treat the data?
If you are interested in the second scenario perhaps you should ask your question on Stack Exchange.
my.data <- read.table(text = '
aa bb cc dd
10 100 1000 1
20 200 2000 2
30 300 3000 3
40 400 4000 4
50 500 5000 5
60 600 6000 6
', header = TRUE, colClasses = c('numeric', 'character', 'numeric', 'character'))
my.data
# one way
str(my.data)
'data.frame': 6 obs. of 4 variables:
$ aa: num 10 20 30 40 50 60
$ bb: chr "100" "200" "300" "400" ...
$ cc: num 1000 2000 3000 4000 5000 6000
$ dd: chr "1" "2" "3" "4" ...
Here is a way to record the information:
my.class <- rep('empty', ncol(my.data))
for(i in 1:ncol(my.data)) {
my.class[i] <- class(my.data[,i])
}
> my.class
[1] "numeric" "character" "numeric" "character"
EDIT
Here is a way to record class
for each column without using a for-loop
:
my.class <- sapply(my.data, class)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With