Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TypeError: Expected sequence or array-like, got estimator

I am working on a project that has user reviews on products. I am using TfidfVectorizer to extract features from my dataset apart from some other features that I have extracted manually.

df = pd.read_csv('reviews.csv', header=0)

FEATURES = ['feature1', 'feature2']
reviews = df['review']
reviews = reviews.values.flatten()

vectorizer = TfidfVectorizer(min_df=1, decode_error='ignore', ngram_range=(1, 3), stop_words='english', max_features=45)

X = vectorizer.fit_transform(reviews)
idf = vectorizer.idf_
features = vectorizer.get_feature_names()
FEATURES += features
inverse =  vectorizer.inverse_transform(X)
  
for i, row in df.iterrows():
   for f in features:
      df.set_value(i, f, False)
      for inv in inverse[i]:
        df.set_value(i, inv, True)

train_df, test_df = train_test_split(df, test_size = 0.2, random_state=700)

The above code works fine. But when I change the max_features from 45 to anything higher I get an error on tran_test_split line.

Traceback as follows:

Traceback (most recent call last):
  File "analysis.py", line 120, in <module>
    train_df, test_df = train_test_split(df, test_size = 0.2, random_state=700)
  File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1906, in train_test_split
    arrays = indexable(*arrays)
  File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 201, in indexable
    check_consistent_length(*result)
  File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 173, in check_consistent_length
    uniques = np.unique([_num_samples(X) for X in arrays if X is not None])
  File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 112, in _num_samples
    'estimator %s' % x)
TypeError: Expected sequence or array-like, got estimator

I am not sure what exactly is changing when I change increase the max_features size.

Let me know if you need more data or if I have missed something

like image 616
Deepak Puthraya Avatar asked Sep 28 '16 11:09

Deepak Puthraya


2 Answers

I know this is old, but I had the same issue and while the answer from @shahins works, I wanted something that would keep the dataframe object so I can have my indexing in the train/test splits.

Solution:

Rename the dataframe column fit as something (anything) else:

df = df.rename(columns = {'fit': 'fit_feature'})

Why it works:

It isn't actually the number of features that is the issue, it is one feature in particular that is causing the problem. I'm guessing you are getting the word "fit" as one of your text features (and it didn't show up with the lower max_features threshold).

Looking at the sklearn source code, it checks to make sure you are not passing an sklearn estimator by testing to see if the any of your objects have a "fit" attribute. The code is checking for the fit method of an sklearn estimator, but will also raise an exception when you have a fit column of the dataframe (remember df.fit and df['fit'] both select the "fit" column).

like image 133
elz Avatar answered Oct 25 '22 12:10

elz


I had this issue and I tried something like this and it worked for me:

train_test_split(df.as_matrix(), test_size = 0.2, random_state=700)
like image 3
happyhuman Avatar answered Oct 25 '22 13:10

happyhuman