Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which data mining algorithm would you suggest for this particular scenario?

This is not a directly programming related question, but it's about selecting the right data mining algorithm.

I want to infer the age of people from their first names, from the region they live, and if they have an internet product or not. The idea behind it is that:

  • there are names that are old-fashioned or popular in a particular decade (celebrities, politicians etc.) (this may not hold in the USA, but in the country of interest that's true),
  • young people tend to live in highly populated regions whereas old people prefer countrysides, and
  • Internet is used more by young people than by old people.

I am not sure if those assumptions hold, but I want to test that. So what I have is 100K observations from our customer database with

  • approx. 500 different names (nominal input variable with too many classes)
  • 20 different regions (nominal input variable)
  • Internet Yes/No (binary input variable)
  • 91 distinct birthyears (numerical target variable with range: 1910-1992)

Because I have so many nominal inputs, I don't think regression is a good candidate. Because the target is numerical, I don't think decision tree is a good option either. Can anyone suggest me a method that is applicable for such a scenario?

like image 214
ercan Avatar asked Mar 01 '10 16:03

ercan


2 Answers

I think you could design discrete variables that reflect the split you are trying to determine. It doesn't seem like you need a regression on their exact age.

One possibility is to cluster the ages, and then treat the clusters as discrete variables. Should this not be appropriate, another possibility is to divide the ages into bins of equal distribution.

One technique that could work very well for your purposes is, instead of clustering or partitioning the ages directly, cluster or partition the average age per name. That is to say, generate a list of all of the average ages, and work with this instead. (There may be some statistical problems in the classifier if you the discrete categories here are too fine-grained, though).

However, the best case is if you have a clear notion of what age range you consider appropriate for 'young' and 'old'. Then, use these directly.

like image 82
John with waffle Avatar answered Oct 19 '22 04:10

John with waffle


New answer

I would try using regression, but in the manner that I specify. I would try binarizing each variable (if this is the correct term). The Internet variable is binary, but I would make it into two separate binary values. I will illustrate with an example because I feel it will be more illuminating. For my example, I will just use three names (Gertrude, Jennifer, and Mary) and the internet variable.

I have 4 women. Here are their data:

Gertrude, Internet, 57
Jennifer, Internet, 23
Gertrude, No Internet, 60
Mary, No Internet, 35

I would generate a matrix, A, like this (each row represents a respective woman in my list):

[[1,0,0,1,0], 
 [0,1,0,1,0],
 [1,0,0,0,1],
 [0,0,1,0,1]]

The first three columns represent the names and the latter two Internet/No Internet. Thus, the columns represent

[Gertrude, Jennifer, Mary, Internet, No Internet]

You can keep doing this with more names (500 columns for the names), and for the regions (20 columns for those). Then you will just be solving the standard linear algebra problem A*x=b where b for the above example is

b=[[57],
   [23],
   [60],
   [35]]

You may be worried that A will now be a huge matrix, but it is a huge, extremely sparse matrix and thus can be stored very efficiently in a sparse matrix form. Each row has 3 1's in it and the rest are 0. You can then just solve this with a sparse matrix solver. You will want to do some sort of correlation test on the resulting predicting ages to see how effective it is.

like image 33
Justin Peel Avatar answered Oct 19 '22 03:10

Justin Peel