Classifying text documents with random forests

Question

I've a set of 4k text documents. They belong to 10 different classes. I'm trying to see how random forest method performs classification. The issue is my feature extraction class extracts 200k features.(A combination of words,bigrams,collocations etc.) This is highly sparse data and random forest implementation in sklearn does not work with sparse data inputs.

Q. What are my options here? Reduce number of features ? How ? Q. Is there any implementation of random forest out there which work with sparse array.

My relevant code is as follows:

import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time
#import pylab as pl

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from special_analyzer import *


data_train  =  load_files(RAW_DATA_SRC_TR)
data_test   =  load_files(RAW_DATA_SRC_TS)
# split a training set and a test set
y_train, y_test = data_train.target, data_test.target

vectorizer = CountVectorizer( analyzer=SpecialAnalyzer()) # SpecialAnalyzer is my class extracting features from text
X_train = vectorizer.fit_transform(data_train.data)



rf = RandomForestClassifier(max_depth=10,max_features=10)
rf.fit(X_train,y_train)

ogrisel · Accepted Answer

Several options: take only the most 10000 most popular features by passing max_features=10000 to CountVectorizer and convert the results to a dense numpy array with the to array method:

X_train_array = X_train.toarray()

Otherwise reduce the dimensionality to 100 or 300 dimensions with:

pca = TruncatedSVD(n_components=300)
X_reduced_train = pca.fit_transform(X_train)

However in my experience I could never make a RF work better than a well tuned linear model (such as logistic regression with grid searched regularization parameter) on the original sparse data (possibly with TF-IDF normalization).

Classifying text documents with random forests

Tags:

python

scikit-learn

sparse-matrix

random-forest

Yantra

1 Answers

ogrisel

Recent Activity

Donate For Us

Classifying text documents with random forests

Tags:

python

scikit-learn

sparse-matrix

random-forest

Yantra

1 Answers

ogrisel

Related questions

Recent Activity

Donate For Us