How to combine features with different dimensions output using scikit-learn

I am using scikit-learn with Pipeline and FeatureUnion to extract features from different inputs. Each sample (instance) in my dataset refers to documents with different lengths. My goal is to compute the top tfidf for each document independently, but I keep getting this error message:

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 2000.

2000 is the size of the training data. This is the main code:

book_summary= Pipeline([
   ('selector', ItemSelector(key='book')),
   ('tfidf', TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True))

book_contents= Pipeline([('selector3', book_content_count())]) 

ppl = Pipeline([
    ('feats', FeatureUnion([
         ('book_summary', book_summary),
         ('book_contents', book_contents)])),
    ('clf', SVC(kernel='linear', class_weight='balanced') ) # classifier with cross fold 5

I wrote two classes to handle each pipeline function. My problem is with book_contents pipeline which is mainly dealing with each sample and return TFidf matrix for each book independently.

class book_content_count(): 
  def count_contents2(self, bookid):
        book = open('C:/TheCorpus/'+str(int(bookid))+'_book.csv', 'r')       
        book_data = pd.read_csv(book, header=0, delimiter=',', encoding='latin1',error_bad_lines=False,dtype=str)
        return corpus

    def transform(self, data_dict, y=None):
        data_dict['bookid'] #from here take the name 
        vec_pipe= Pipeline([('vec', TfidfVectorizer(min_df = 1,lowercase = False, ngram_range = (1,1), use_idf = True, stop_words='english'))])
        Xtr = vec_pipe.fit_transform(text)
        return Xtr

    def fit(self, x, y=None):
        return self

Sample of data (example):

title                         Summary                          bookid
The beauty and the beast      is a traditional fairy tale...    10
ocean at the end of the lane  is a 2013 novel by British        11

Then each id will refer to a text file with the actual contents of these books

I have tried toarray and reshape functions but with no luck. Any idea how to solve this issue. Thanks

1 Answers

You can use Neuraxle's Feature Union with a custom joiner that you would need to code yourself. The joiner is a class passed to Neuraxle's FeatureUnion to merge results together in the way you expected.

1. Import Neuraxle's classes.

from neuraxle.base import NonFittableMixin, BaseStep
from neuraxle.pipeline import Pipeline
from neuraxle.steps.sklearn import SKLearnWrapper
from neuraxle.union import FeatureUnion

2. Define your custom class by inheriting from BaseStep:

class BookContentCount(BaseStep): 

    def transform(self, data_dict, y=None):
        transformed = do_things(...)  # be sure to use SKLearnWrapper if you wrap sklearn items.
        return transformed

    def fit(self, x, y=None):
        return self

3. Create a joiner to join the resuts of the feature union the way you wish:

class CustomJoiner(NonFittableMixin, BaseStep):
    def __init__(self):

    # def fit: is inherited from `NonFittableMixin` and simply returns self.

    def transform(self, data_inputs):
        # TODO: insert your own concatenation method here.
        result = np.concatenate(data_inputs, axis=-1)
        return result

4. Finally create your pipeline by passing the joiner to the FeatureUnion:

book_summary= Pipeline([
    TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True)

p = Pipeline([
    SVC(kernel='linear', class_weight='balanced')

Note: if you want your Neuraxle pipeline to become a scikit-learn pipeline back, you can do p = p.tosklearn().

To learn more on Neuraxle: https://github.com/Neuraxio/Neuraxle

More examples from the documentation: https://www.neuraxle.org/stable/examples/index.html

