I am working on a project that has user reviews on products. I am using TfidfVectorizer to extract features from my dataset apart from some other features that I have extracted manually.
df = pd.read_csv('reviews.csv', header=0)
FEATURES = ['feature1', 'feature2']
reviews = df['review']
reviews = reviews.values.flatten()
vectorizer = TfidfVectorizer(min_df=1, decode_error='ignore', ngram_range=(1, 3), stop_words='english', max_features=45)
X = vectorizer.fit_transform(reviews)
idf = vectorizer.idf_
features = vectorizer.get_feature_names()
FEATURES += features
inverse = vectorizer.inverse_transform(X)
for i, row in df.iterrows():
for f in features:
df.set_value(i, f, False)
for inv in inverse[i]:
df.set_value(i, inv, True)
train_df, test_df = train_test_split(df, test_size = 0.2, random_state=700)
The above code works fine. But when I change the max_features
from 45 to anything higher I get an error on tran_test_split
line.
Traceback as follows:
Traceback (most recent call last):
File "analysis.py", line 120, in <module>
train_df, test_df = train_test_split(df, test_size = 0.2, random_state=700)
File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1906, in train_test_split
arrays = indexable(*arrays)
File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 201, in indexable
check_consistent_length(*result)
File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 173, in check_consistent_length
uniques = np.unique([_num_samples(X) for X in arrays if X is not None])
File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 112, in _num_samples
'estimator %s' % x)
TypeError: Expected sequence or array-like, got estimator
I am not sure what exactly is changing when I change increase the max_features
size.
Let me know if you need more data or if I have missed something
I know this is old, but I had the same issue and while the answer from @shahins works, I wanted something that would keep the dataframe object so I can have my indexing in the train/test splits.
Rename the dataframe column fit as something (anything) else:
df = df.rename(columns = {'fit': 'fit_feature'})
It isn't actually the number of features that is the issue, it is one feature in particular that is causing the problem. I'm guessing you are getting the word "fit" as one of your text features (and it didn't show up with the lower max_features
threshold).
Looking at the sklearn source code, it checks to make sure you are not passing an sklearn estimator by testing to see if the any of your objects have a "fit" attribute. The code is checking for the fit
method of an sklearn estimator, but will also raise an exception when you have a fit
column of the dataframe (remember df.fit
and df['fit']
both select the "fit" column).
I had this issue and I tried something like this and it worked for me:
train_test_split(df.as_matrix(), test_size = 0.2, random_state=700)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With