Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using multiple features with scikit-learn

I'm working on text classification using scikit-learn. Things work well with a single feature, but introducing multiple features is giving me errors. I think the problem is that I'm not formatting the data in the way that the classifier expects.

For example, this works fine:

data = np.array(df['feature1'])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)

classifier = Pipeline(...)

classifier.fit(X_train, Y_train)

But this:

data = np.array(df[['feature1', 'feature2']])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)

classifier = Pipeline(...)

classifier.fit(X_train, Y_train)

dies with

Traceback (most recent call last):
  File "/Users/jed/Dropbox/LegalMetric/LegalMetricML/motion_classifier.py", line 157, in <module>
    classifier.fit(X_train, Y_train)
  File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 130, in fit
    Xt, fit_params = self._pre_transform(X, y, **fit_params)
  File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 120, in _pre_transform
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 715, in _count_vocab
    for feature in analyze(doc):
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 229, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 195, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

during the preprocessing stage after classifier.fit() is called. I think the problem is that way I'm formatting the data, but I can't figure out how to get it right.

feature1 and feature2 are both English text strings, as is the target. I'm using LabelEncoder() to encode target, which seems to work fine.

Here's an example of what print data returns, to give you a sense of how it's formatted right now.

[['some short english text'
  'a paragraph of english text']
 ['some more short english text'
  'a second paragraph of english text']
 ['some more short english text'
  'a third paragraph of english text']]
like image 603
James Daily Avatar asked Feb 05 '14 21:02

James Daily


People also ask

Is scikit-learn good for production?

The variety of machine learning techniques in combination with the solid implementations that scikit-learn offers makes it a one-stop-shopping library for machine learning in Python. Moreover, its consistent API, well-tested code and permissive licensing allow us to use it in a production environment.

What is sklearn Feature_extraction?

The sklearn. feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

Can scikit-learn handle big data?

Scikit-learn is steadily evolving with new models, efficiency improvements on speed and memory, and large data capabilities. Although scikit-learn is optimized for smaller data, it does offer a decent set of algorithms for out-of-core classification, regression, clustering and decomposition.


1 Answers

The particular error message makes it seem like your code somewhere expects something to be a str (so that .lower may be called) but instead it is receiving a whole array (probably a whole array of strs).

Can you edit the question to better describe the data and also post the full traceback, not just the small part with the named error?

In the meantime, you can also try

data = df[['feature1', 'feature2']].values

and

df['target'].values

instead of explicitly casting to np.ndarray yourself.

It looks to me like an array is being made where it is 1x1 and the singleton element in the "array" is itself an ndarray.

like image 139
ely Avatar answered Oct 05 '22 00:10

ely