I am trying to model the score that a post receives, based on both the text of the post, and other features (time of day, length of post, etc.)
I am wondering how to best combine these different types of features into one model. Right now, I have something like the following (stolen from here and here).
import pandas as pd ... def features(p): terms = vectorizer(p[0]) d = {'feature_1': p[1], 'feature_2': p[2]} for t in terms: d[t] = d.get(t, 0) + 1 return d posts = pd.read_csv('path/to/csv') # Create vectorizer for function to use vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2)).build_tokenizer() y = posts["score"].values.astype(np.float32) vect = DictVectorizer() # This is the part I want to fix temp = zip(list(posts.message), list(posts.feature_1), list(posts.feature_2)) tokenized = map(lambda x: features(x), temp) X = vect.fit_transform(tokenized)
It seems very silly to extract all of the features I want out of the pandas dataframe, just to zip them all back together. Is there a better way of doing this step?
The CSV looks something like the following:
ID,message,feature_1,feature_2 1,'This is the text',4,7 2,'This is more text',3,2 ...
You could do everything with your map and lambda:
tokenized=map(lambda msg, ft1, ft2: features([msg,ft1,ft2]), posts.message,posts.feature_1, posts.feature_2)
This saves doing your interim temp step and iterates through the 3 columns.
Another solution would be convert the messages into their CountVectorizer sparse matrix and join this matrix with the feature values from the posts dataframe (this skips having to construct a dict and produces a sparse matrix similar to what you would get with DictVectorizer):
import scipy as sp posts = pd.read_csv('post.csv') # Create vectorizer for function to use vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2)) y = posts["score"].values.astype(np.float32) X = sp.sparse.hstack((vectorizer.fit_transform(posts.message),posts[['feature_1','feature_2']].values),format='csr') X_columns=vectorizer.get_feature_names()+posts[['feature_1','feature_2']].columns.tolist() posts Out[38]: ID message feature_1 feature_2 score 0 1 'This is the text' 4 7 10 1 2 'This is more text' 3 2 9 2 3 'More random text' 3 2 9 X_columns Out[39]: [u'is', u'is more', u'is the', u'more', u'more random', u'more text', u'random', u'random text', u'text', u'the', u'the text', u'this', u'this is', 'feature_1', 'feature_2'] X.toarray() Out[40]: array([[1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 4, 7], [1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 3, 2], [0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 3, 2]])
Additionally sklearn-pandas has DataFrameMapper which does what you're looking for too:
from sklearn_pandas import DataFrameMapper mapper = DataFrameMapper([ (['feature_1', 'feature_2'], None), ('message',CountVectorizer(binary=True, ngram_range=(1, 2))) ]) X=mapper.fit_transform(posts) X Out[71]: array([[4, 7, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [3, 2, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1], [3, 2, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0]])
Note:X is not sparse when using this last method.
X_columns=mapper.features[0][0]+mapper.features[1][1].get_feature_names() X_columns Out[76]: ['feature_1', 'feature_2', u'is', u'is more', u'is the', u'more', u'more random', u'more text', u'random', u'random text', u'text', u'the', u'the text', u'this', u'this is']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With