Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to determine column to be Quantitative or Categorical data?

If I have a file with many column, the data are all numbers, how can I know whether a specific column is categorical or quantitative data?. Is there an area of study for this kind of problem? If not, what are some heuristics that can be used to determine?

Some heuristics that I can think of:

Likely to be categorical data

  • make a summary of the unique value, if it's < some_threshold, there is higher chance to be categorical data.
  • if the data is highly concentrate (low std.)
  • if the unique value are highly sequential, and starts from 1
  • if all the value in column has fixed length (may be ID/Date)
  • if it has a very small p-value at Benford's Law
  • if it has a very small p-value at the Chi-square test against the result column

Likely to be quantitative data

  • if the column has floating number
  • if the column has sparse value
  • if the column has negative value

Other

  • Maybe quantitative data are more likely to be near/next to quantitative data (vice-versa)

I am using R, but the question doesn't need to be R specific.

like image 490
muyueh Avatar asked Feb 16 '14 08:02

muyueh


1 Answers

This assumes someone coded the data correctly.

Perhaps you are suggesting the data were not coded or labeled correctly, that it was all entered as numeric and some of it really is categorical. In that case, I do not know how one could tell with any certainty. Categorical data can have decimals places and can be negative.

The question I would ask myself in such a situation is what difference does it make how I treat the data?

If you are interested in the second scenario perhaps you should ask your question on Stack Exchange.

my.data <- read.table(text = '
    aa     bb      cc     dd
    10    100    1000      1
    20    200    2000      2
    30    300    3000      3
    40    400    4000      4
    50    500    5000      5
    60    600    6000      6
', header = TRUE, colClasses = c('numeric', 'character', 'numeric', 'character'))

my.data

# one way
str(my.data)

'data.frame':   6 obs. of  4 variables:
 $ aa: num  10 20 30 40 50 60
 $ bb: chr  "100" "200" "300" "400" ...
 $ cc: num  1000 2000 3000 4000 5000 6000
 $ dd: chr  "1" "2" "3" "4" ...

Here is a way to record the information:

my.class <- rep('empty', ncol(my.data))

for(i in 1:ncol(my.data)) {
    my.class[i] <- class(my.data[,i])
}

> my.class
[1] "numeric"   "character" "numeric"   "character"

EDIT

Here is a way to record class for each column without using a for-loop:

my.class <- sapply(my.data, class)
like image 154
Mark Miller Avatar answered Sep 27 '22 16:09

Mark Miller