Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining bag of words and other features in one model using sklearn and pandas

Tags:

I am trying to model the score that a post receives, based on both the text of the post, and other features (time of day, length of post, etc.)

I am wondering how to best combine these different types of features into one model. Right now, I have something like the following (stolen from here and here).

import pandas as pd ...  def features(p):     terms = vectorizer(p[0])     d = {'feature_1': p[1], 'feature_2': p[2]}     for t in terms:         d[t] = d.get(t, 0) + 1     return d  posts = pd.read_csv('path/to/csv')  # Create vectorizer for function to use vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2)).build_tokenizer() y = posts["score"].values.astype(np.float32)  vect = DictVectorizer()  # This is the part I want to fix temp = zip(list(posts.message), list(posts.feature_1), list(posts.feature_2)) tokenized = map(lambda x: features(x), temp) X = vect.fit_transform(tokenized) 

It seems very silly to extract all of the features I want out of the pandas dataframe, just to zip them all back together. Is there a better way of doing this step?

The CSV looks something like the following:

ID,message,feature_1,feature_2 1,'This is the text',4,7 2,'This is more text',3,2 ... 
like image 877
Jeremy Avatar asked Jun 04 '15 20:06

Jeremy


1 Answers

You could do everything with your map and lambda:

tokenized=map(lambda msg, ft1, ft2: features([msg,ft1,ft2]), posts.message,posts.feature_1, posts.feature_2) 

This saves doing your interim temp step and iterates through the 3 columns.

Another solution would be convert the messages into their CountVectorizer sparse matrix and join this matrix with the feature values from the posts dataframe (this skips having to construct a dict and produces a sparse matrix similar to what you would get with DictVectorizer):

import scipy as sp posts = pd.read_csv('post.csv')  # Create vectorizer for function to use vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2)) y = posts["score"].values.astype(np.float32)   X = sp.sparse.hstack((vectorizer.fit_transform(posts.message),posts[['feature_1','feature_2']].values),format='csr') X_columns=vectorizer.get_feature_names()+posts[['feature_1','feature_2']].columns.tolist()   posts Out[38]:     ID              message  feature_1  feature_2  score 0   1   'This is the text'          4          7     10 1   2  'This is more text'          3          2      9 2   3   'More random text'          3          2      9  X_columns Out[39]:  [u'is',  u'is more',  u'is the',  u'more',  u'more random',  u'more text',  u'random',  u'random text',  u'text',  u'the',  u'the text',  u'this',  u'this is',  'feature_1',  'feature_2']  X.toarray() Out[40]:  array([[1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 4, 7],        [1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 3, 2],        [0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 3, 2]]) 

Additionally sklearn-pandas has DataFrameMapper which does what you're looking for too:

from sklearn_pandas import DataFrameMapper mapper = DataFrameMapper([     (['feature_1', 'feature_2'], None),     ('message',CountVectorizer(binary=True, ngram_range=(1, 2))) ]) X=mapper.fit_transform(posts)  X Out[71]:  array([[4, 7, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1],        [3, 2, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1],        [3, 2, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0]]) 

Note:X is not sparse when using this last method.

X_columns=mapper.features[0][0]+mapper.features[1][1].get_feature_names()  X_columns Out[76]:  ['feature_1',  'feature_2',  u'is',  u'is more',  u'is the',  u'more',  u'more random',  u'more text',  u'random',  u'random text',  u'text',  u'the',  u'the text',  u'this',  u'this is'] 
like image 108
khammel Avatar answered Sep 21 '22 14:09

khammel