How to determine column to be Quantitative or Categorical data?

Question

If I have a file with many column, the data are all numbers, how can I know whether a specific column is categorical or quantitative data?. Is there an area of study for this kind of problem? If not, what are some heuristics that can be used to determine?

Some heuristics that I can think of:

Likely to be categorical data

make a summary of the unique value, if it's < some_threshold, there is higher chance to be categorical data.
if the data is highly concentrate (low std.)
if the unique value are highly sequential, and starts from 1
if all the value in column has fixed length (may be ID/Date)
if it has a very small p-value at Benford's Law
if it has a very small p-value at the Chi-square test against the result column

Likely to be quantitative data

if the column has floating number
if the column has sparse value
if the column has negative value

Other

Maybe quantitative data are more likely to be near/next to quantitative data (vice-versa)

I am using R, but the question doesn't need to be R specific.

Mark Miller · Accepted Answer

This assumes someone coded the data correctly.

Perhaps you are suggesting the data were not coded or labeled correctly, that it was all entered as numeric and some of it really is categorical. In that case, I do not know how one could tell with any certainty. Categorical data can have decimals places and can be negative.

The question I would ask myself in such a situation is what difference does it make how I treat the data?

If you are interested in the second scenario perhaps you should ask your question on Stack Exchange.

my.data <- read.table(text = '
    aa     bb      cc     dd
    10    100    1000      1
    20    200    2000      2
    30    300    3000      3
    40    400    4000      4
    50    500    5000      5
    60    600    6000      6
', header = TRUE, colClasses = c('numeric', 'character', 'numeric', 'character'))

my.data

# one way
str(my.data)

'data.frame':   6 obs. of  4 variables:
 $ aa: num  10 20 30 40 50 60
 $ bb: chr  "100" "200" "300" "400" ...
 $ cc: num  1000 2000 3000 4000 5000 6000
 $ dd: chr  "1" "2" "3" "4" ...

Here is a way to record the information:

my.class <- rep('empty', ncol(my.data))

for(i in 1:ncol(my.data)) {
    my.class[i] <- class(my.data[,i])
}

> my.class
[1] "numeric"   "character" "numeric"   "character"

EDIT

Here is a way to record class for each column without using a for-loop:

my.class <- sapply(my.data, class)

How to determine column to be Quantitative or Categorical data?

Tags:

r

machine-learning

Likely to be categorical data

Likely to be quantitative data

Other

muyueh

1 Answers

Mark Miller

Recent Activity

Donate For Us

How to determine column to be Quantitative or Categorical data?

Tags:

r

machine-learning

Likely to be categorical data

Likely to be quantitative data

Other

muyueh

1 Answers

Mark Miller

Related questions

Recent Activity

Donate For Us