Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AttributeError: lower not found; using a Pipeline with a CountVectorizer in scikit-learn

I have a corpus as such:

X_train = [ ['this is an dummy example'] 
      ['in reality this line is very long']
      ...
      ['here is a last text in the training set']
    ]

and some labels:

y_train = [1, 5, ... , 3]

I would like to use Pipeline and GridSearch as follows:

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('reg', SGDRegressor())
])


parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'tfidf__use_idf': (True, False),
    'reg__alpha': (0.00001, 0.000001),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)

grid_search.fit(X_train, y_train)

When I run this, I get an error saying AttributeError: lower not found.

I searched and found a question about this error here, which lead me to believe that there was a problem with my text not being tokenized (which sounded like it hit the nail on the head, since I was using a list of list as input data, where each list contained one single unbroken string).

I cooked up a quick and dirty tokenizer to test this theory:

def my_tokenizer(X):
    newlist = []
    for alist in X:
        newlist.append(alist[0].split(' '))
    return newlist

which does what it is supposed to, but when I use it in the arguments to the CountVectorizer:

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=my_tokenizer)),

...I still get the same error as if nothing happened.

I did notice that I can circumvent the error by commenting out the CountVectorizer in my Pipeline. Which is strange...I didn't think you could use the TfidfTransformer() without first having a data structure to transform...in this case the matrix of counts.

Why do I keep getting this error? Actually, it would be nice to know what this error means! (Was lower called to convert the text to lowercase or something? I can't tell from reading the stack trace). Am I misusing the Pipeline...or is the problem really an issue with the arguments to the CountVectorizer alone?

Any advice would be greatly appreciated.

like image 1000
tumultous_rooster Avatar asked Nov 09 '15 09:11

tumultous_rooster


1 Answers

It's because your dataset is in wrong format, you should pass "An iterable which yields either str, unicode or file objects" into CountVectorizer's fit function (Or into pipeline, doesn't matter). Not iterable over other iterables with texts (as in your code). In your case List is iterable, and you should pass flat list whose members are strings (not another lists).

i.e. your dataset should look like:

X_train = ['this is an dummy example',
      'in reality this line is very long',
      ...
      'here is a last text in the training set'
    ]

Look at this example, very useful: Sample pipeline for text feature extraction and evaluation

like image 180
Ibraim Ganiev Avatar answered Sep 22 '22 05:09

Ibraim Ganiev