Has there been any research in the field of data-mining regarding classifying data which has a one to many relationship?
For example of a problem like this, say I am trying to predict which students are going to drop out of university based on their class grades and personal information. Obviously there is a one to many relationship between the students personal information and the grades they achieved in their classes.
Obvious approaches include:
Aggregate - The multiple records could be aggregated together in some way reducing the problem to a basic classification problem. In the case of the student classification, the average of their grades could be combined with their personal data. While this solution is simple, often key information is lost. For example what if most students who take organic chemistry and get below a C- end up dropping out even if their average is above a B+ rating.
Voting - Create multiple classifiers (often weak ones) and have them cast votes to determine the overall class of the data in question. This would be like if two classifiers were built, one for the student's course data and one for their personal data. Each course record would be passed to the course classifier and based on the grade and the course name, the classifier would predict whether the student would drop out using that course record alone. The personal data record would be classified using the personal data classifier. Then all the class record predictions along with the personal information record prediction would be voted together. This voting could be done in a number of different ways, but most likely would take into account how accurate the classifiers are and how certain the classifier was of the vote. Clearly this scheme allows for more complicated classification patterns than aggregation, yet there is a lot of extra complexity involved. Also if the voting is not performed well, accuracy can easily suffer.
So I am looking for other possible solutions to the classification of data with a one to many relationship.
Why wouldn't you treat each grade as a separate feature of the same model?
student['age'] = 23
student['gender'] = 'male'
...
student['grade_in_organic_chemistry'] = 'B+'
student['grade_in_classical_physics'] = 'A-'
I guess I'm not seeing why you would want to "aggregate" or join together multiple classifiers when the grades can just be distinct features?
(Please excuse the lame psuedocode above, but just trying to demonstrate my point)
While this is probably sub-optimal compared to specialized methods, you could probably use an SVM with correction for unbalanced class as in the following example (using the Python library scikit-learn):
http://scikit-learn.sourceforge.net/auto_examples/svm/plot_weighted_classes.html
In practice, I have had good results with fairly unbalanced classes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With