Supervised Learning on Coding Style - Feature Selection (Scikit Learn)

Question

I am researching whether or not it is possible to automate the scoring of student's code based off of coding style. This includes things like avoiding duplicate code, commented out code, bad naming of variables and more.

We are trying to learn based off of past semester's composition scores (ranging from 1-3), which leads nicely to supervised learning. The basic idea is that we extract features from a student's submissions, and make a feature_vector, then run it through logistic regression using scikit-learn. We also have tried various things including running PCA on the feature vectors to reduce dimensionality.

Our classifier is simply guessing the most frequent class, which is a score of 2. I believe that it's because our features are simply NOT predictive in any way. Is there any other possible reason for a supervised learning algorithm to only guess the dominant class? Is there any way to prevent this?

As I believe it's due to the features not being predictive, is there a way to determine what a "good" feature would be? (And by good, I mean discriminable or predictive).

Note: As a side experiment, we tested how consistent the past grades were by having readers grade assignments that had already been graded. Only 55% of them gave the same composition score (1-3) for the projects. This might mean this dataset is simply not classifiable because humans can't even grade consistently. Any tips on other ideas? Or whether or not that is in fact the case?

Features include: Number of lines of duplicate code, average function length, number of 1 character variables, number of lines that include commented out code, maximum line length, count of unused imports, unused variables, unused parameters. A few more... We visualized all of our features and found that while the average is correlated with the score, the variation is really large (not promising).

Edit: Scope of our project: we are only trying to learn from one particular project (with skeleton code given) in one class. We don't need to generalize as of yet.

mattnedrich · Accepted Answer

Just a thought - Andrew Ng teaches a Machine Learning course on Coursera (https://www.coursera.org/course/ml). There are several programming assignments that students submit throughout the class. I remember reading (though unfortunately I can't find the article now) that there was some ongoing research that was attempting to cluster student submitted programming assignments from the class, with the intuition that there are common mistakes that students make on the assignments.

Not sure if this helps you, but perhaps treating this as an unsupervised learning problem might make more sense (e.g., just looking for similarities in different code samples with the intuition that the code samples that are similar should receive a similar score).

Supervised Learning on Coding Style - Feature Selection (Scikit Learn)

Tags:

machine-learning

statistics

supervised-learning

feature-selection

feature-extraction

stogers

1 Answers

mattnedrich

Recent Activity

Donate For Us

Supervised Learning on Coding Style - Feature Selection (Scikit Learn)

Tags:

machine-learning

statistics

supervised-learning

feature-selection

feature-extraction

stogers

1 Answers

mattnedrich

Related questions

Recent Activity

Donate For Us