This is a question about linear regression with ngrams, using Tf-IDF (term frequency - inverse document frequency). To do this, I am using numpy sparse matrices and sklearn for linear regression.
I have 53 cases and over 6000 features when using unigrams. The predictions are based on cross validation using LeaveOneOut.
When I create a tf-idf sparse matrix of only unigram scores, I get slightly better predictions than when I create a tf-idf sparse matrix of unigram+bigram scores. The more columns I add to the matrix (columns for trigram, quadgram, quintgrams, etc.), the less accurate the regression prediction.
Is this common? How is this possible? I would have thought that the more features, the better.
It's not common for bigrams to perform worse than unigrams, but there are situations where it may happen. In particular, adding extra features may lead to overfitting. Tf-idf is unlikely to alleviate this, as longer n-grams will be rarer, leading to higher idf values.
I'm not sure what kind of variable you're trying to predict, and I've never done regression on text, but here's some comparable results from literature to get you thinking:
Exactly what is happening depends on your training set; it might simply be too small.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With