I've a set of 4k text documents. They belong to 10 different classes. I'm trying to see how random forest method performs classification. The issue is my feature extraction class extracts 200k features.(A combination of words,bigrams,collocations etc.) This is highly sparse data and random forest implementation in sklearn does not work with sparse data inputs.
Q. What are my options here? Reduce number of features ? How ? Q. Is there any implementation of random forest out there which work with sparse array.
My relevant code is as follows:
import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time
#import pylab as pl
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from special_analyzer import *
data_train = load_files(RAW_DATA_SRC_TR)
data_test = load_files(RAW_DATA_SRC_TS)
# split a training set and a test set
y_train, y_test = data_train.target, data_test.target
vectorizer = CountVectorizer( analyzer=SpecialAnalyzer()) # SpecialAnalyzer is my class extracting features from text
X_train = vectorizer.fit_transform(data_train.data)
rf = RandomForestClassifier(max_depth=10,max_features=10)
rf.fit(X_train,y_train)
Several options: take only the most 10000 most popular features by passing max_features=10000
to CountVectorizer
and convert the results to a dense numpy array with the to array method:
X_train_array = X_train.toarray()
Otherwise reduce the dimensionality to 100 or 300 dimensions with:
pca = TruncatedSVD(n_components=300)
X_reduced_train = pca.fit_transform(X_train)
However in my experience I could never make a RF work better than a well tuned linear model (such as logistic regression with grid searched regularization parameter) on the original sparse data (possibly with TF-IDF normalization).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With