Using multiple features with scikit-learn

Tags:

I'm working on text classification using scikit-learn. Things work well with a single feature, but introducing multiple features is giving me errors. I think the problem is that I'm not formatting the data in the way that the classifier expects.

For example, this works fine:

data = np.array(df['feature1'])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)

classifier = Pipeline(...)

classifier.fit(X_train, Y_train)

But this:

data = np.array(df[['feature1', 'feature2']])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)

classifier = Pipeline(...)

classifier.fit(X_train, Y_train)

dies with

Traceback (most recent call last):
  File "/Users/jed/Dropbox/LegalMetric/LegalMetricML/motion_classifier.py", line 157, in <module>
    classifier.fit(X_train, Y_train)
  File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 130, in fit
    Xt, fit_params = self._pre_transform(X, y, **fit_params)
  File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 120, in _pre_transform
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 715, in _count_vocab
    for feature in analyze(doc):
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 229, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 195, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

during the preprocessing stage after classifier.fit() is called. I think the problem is that way I'm formatting the data, but I can't figure out how to get it right.

feature1 and feature2 are both English text strings, as is the target. I'm using LabelEncoder() to encode target, which seems to work fine.

Here's an example of what print data returns, to give you a sense of how it's formatted right now.

[['some short english text'
  'a paragraph of english text']
 ['some more short english text'
  'a second paragraph of english text']
 ['some more short english text'
  'a third paragraph of english text']]

603

asked Feb 05 '14 21:02

James Daily

1 Answers

The particular error message makes it seem like your code somewhere expects something to be a str (so that .lower may be called) but instead it is receiving a whole array (probably a whole array of strs).

Can you edit the question to better describe the data and also post the full traceback, not just the small part with the named error?

In the meantime, you can also try

data = df[['feature1', 'feature2']].values

and

df['target'].values

instead of explicitly casting to np.ndarray yourself.

It looks to me like an array is being made where it is 1x1 and the singleton element in the "array" is itself an ndarray.

139

answered Oct 05 '22 00:10

ely

Related questions
                            
                                How to install npm package from python script?
                            
                                Printed length of a string in python
                            
                                Understanding Python fork and memory allocation errors
                            
                                Is there a way to suppress unresolved imports in eclipse in a PyDev project?
                            
                                How to create a python package with multiple files without subpackages
                            
                                What python 3 library should I use for MySQL?
                            
                                Django Wizard, multiple forms in one step
                            
                                BeautifulSoup not extracting all html (automatically deleting much of a page's html)
                            
                                Share SciPy Sparse Array Between Process Objects
                            
                                What tools should I use to profile Python code on window 7
                            
                                python: numpy: concatenation of named arrays
                            
                                File handling speed of python 3.3 compared to fortran 77
                            
                                ImportError: numpy.core.multiarray failed to import while using mod_wsgi
                            
                                Need a way to retrieve the current playing song from Zune and Windows Media Player with Python
                            
                                Running Django test with setup.py test and tox
                            
                                How to use python Mock side_effect to act as a Class method in unit test
                            
                                F2PY - Access module parameter from subroutine
                            
                                How to find the highest number less than target value in a list?
                            
                                Using boolean indexing for row and column MultiIndex in Pandas
                            
                                Python-iptables how to optimize code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using multiple features with scikit-learn

Tags:

python

pandas

machine-learning

scikit-learn

James Daily

People also ask

1 Answers

ely

Recent Activity

Donate For Us