Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

unable to use FeatureUnion in scikit-learn due to different dimensions

I'm trying to use FeatureUnion to extract different features from a datastructure, but it fails due to different dimensions: ValueError: blocks[0,:] has incompatible row dimensions


Implementaion

My FeatureUnion is built the following way:

    features = FeatureUnion([
        ('f1', Pipeline([
            ('get', GetItemTransformer('f1')),
            ('transform', vectorizer_f1)
        ])),
        ('f2', Pipeline([
            ('get', GetItemTransformer('f2')),
            ('transform', vectorizer_f1)
        ]))
    ])

GetItemTransformer is used to get different parts of data out of the same structure. The Idea is described here in the scikit-learn issue-tracker.

The Structure itself is stored as {'f1': data_f1, 'f2': data_f2} where data_f1 are different lists with different lengths.


Question

Since the Y-Vector is different to the Data-Fields I assume that the error occurs, but how can I scale the vector to fit in both cases?

like image 474
jwacalex Avatar asked Sep 11 '14 19:09

jwacalex


2 Answers

Here's what worked for me:

class ArrayCaster(BaseEstimator, TransformerMixin):
  def fit(self, x, y=None):
    return self

  def transform(self, data):
    print data.shape
    print np.transpose(np.matrix(data)).shape
    return np.transpose(np.matrix(data))

FeatureUnion([('text', Pipeline([
            ('selector', ItemSelector(key='text')),
            ('vect', CountVectorizer(ngram_range=(1,1), binary=True, min_df=3)),
            ('tfidf', TfidfTransformer())
          ])
        ),

        ('other data', Pipeline([
            ('selector', ItemSelector(key='has_foriegn_char')),
            ('caster', ArrayCaster())
          ])
        )])
like image 127
Josh Avatar answered Sep 27 '22 18:09

Josh


I don't know if this applies to your question, but we ran into the same error in a slightly different situation and just solved it.

Our f1 entries were each lists of 15 numeric values and we needed to do tf-idf on f2. This generated the same error about incompatible row dimensions.

After running it through the debugger, we found that the shapes of our matrices were subtly different going into the hstack() call in FeatureUnion: (2569,) and (2659, 706).

If we cast f1 to a 2D numpy array, the shape changed to (2659, 15) and the hstack call works.

The cast was something like this: f1 = np.array(list(f1)).

like image 27
Jim K. Avatar answered Sep 27 '22 19:09

Jim K.