Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Classifying text documents with random forests

I've a set of 4k text documents. They belong to 10 different classes. I'm trying to see how random forest method performs classification. The issue is my feature extraction class extracts 200k features.(A combination of words,bigrams,collocations etc.) This is highly sparse data and random forest implementation in sklearn does not work with sparse data inputs.

Q. What are my options here? Reduce number of features ? How ? Q. Is there any implementation of random forest out there which work with sparse array.

My relevant code is as follows:

import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time
#import pylab as pl

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from special_analyzer import *


data_train  =  load_files(RAW_DATA_SRC_TR)
data_test   =  load_files(RAW_DATA_SRC_TS)
# split a training set and a test set
y_train, y_test = data_train.target, data_test.target

vectorizer = CountVectorizer( analyzer=SpecialAnalyzer()) # SpecialAnalyzer is my class extracting features from text
X_train = vectorizer.fit_transform(data_train.data)



rf = RandomForestClassifier(max_depth=10,max_features=10)
rf.fit(X_train,y_train)
like image 941
Yantra Avatar asked Feb 10 '14 22:02

Yantra


1 Answers

Several options: take only the most 10000 most popular features by passing max_features=10000 to CountVectorizer and convert the results to a dense numpy array with the to array method:

X_train_array = X_train.toarray()

Otherwise reduce the dimensionality to 100 or 300 dimensions with:

pca = TruncatedSVD(n_components=300)
X_reduced_train = pca.fit_transform(X_train)

However in my experience I could never make a RF work better than a well tuned linear model (such as logistic regression with grid searched regularization parameter) on the original sparse data (possibly with TF-IDF normalization).

like image 160
ogrisel Avatar answered Oct 03 '22 11:10

ogrisel