How to combine features with different dimensions output using scikit-learn

Tags:

I am using scikit-learn with Pipeline and FeatureUnion to extract features from different inputs. Each sample (instance) in my dataset refers to documents with different lengths. My goal is to compute the top tfidf for each document independently, but I keep getting this error message:

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 2000.

2000 is the size of the training data. This is the main code:

book_summary= Pipeline([
   ('selector', ItemSelector(key='book')),
   ('tfidf', TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True))
])

book_contents= Pipeline([('selector3', book_content_count())]) 

ppl = Pipeline([
    ('feats', FeatureUnion([
         ('book_summary', book_summary),
         ('book_contents', book_contents)])),
    ('clf', SVC(kernel='linear', class_weight='balanced') ) # classifier with cross fold 5
])

I wrote two classes to handle each pipeline function. My problem is with book_contents pipeline which is mainly dealing with each sample and return TFidf matrix for each book independently.

class book_content_count(): 
  def count_contents2(self, bookid):
        book = open('C:/TheCorpus/'+str(int(bookid))+'_book.csv', 'r')       
        book_data = pd.read_csv(book, header=0, delimiter=',', encoding='latin1',error_bad_lines=False,dtype=str)
                      corpus=(str([user_data['text']]).strip('[]')) 
        return corpus

    def transform(self, data_dict, y=None):
        data_dict['bookid'] #from here take the name 
        text=data_dict['bookid'].apply(self.count_contents2)
        vec_pipe= Pipeline([('vec', TfidfVectorizer(min_df = 1,lowercase = False, ngram_range = (1,1), use_idf = True, stop_words='english'))])
        Xtr = vec_pipe.fit_transform(text)
        return Xtr

    def fit(self, x, y=None):
        return self

Sample of data (example):

title                         Summary                          bookid
The beauty and the beast      is a traditional fairy tale...    10
ocean at the end of the lane  is a 2013 novel by British        11

Then each id will refer to a text file with the actual contents of these books

I have tried toarray and reshape functions but with no luck. Any idea how to solve this issue. Thanks

944

asked May 20 '18 12:05

Abrial

1 Answers

You can use Neuraxle's Feature Union with a custom joiner that you would need to code yourself. The joiner is a class passed to Neuraxle's FeatureUnion to merge results together in the way you expected.

1. Import Neuraxle's classes.

from neuraxle.base import NonFittableMixin, BaseStep
from neuraxle.pipeline import Pipeline
from neuraxle.steps.sklearn import SKLearnWrapper
from neuraxle.union import FeatureUnion

2. Define your custom class by inheriting from BaseStep:

class BookContentCount(BaseStep): 

    def transform(self, data_dict, y=None):
        transformed = do_things(...)  # be sure to use SKLearnWrapper if you wrap sklearn items.
        return transformed

    def fit(self, x, y=None):
        return self

3. Create a joiner to join the resuts of the feature union the way you wish:

class CustomJoiner(NonFittableMixin, BaseStep):
    def __init__(self):
        BaseStep.__init__(self)
        NonFittableMixin.__init__(self)

    # def fit: is inherited from `NonFittableMixin` and simply returns self.

    def transform(self, data_inputs):
        # TODO: insert your own concatenation method here.
        result = np.concatenate(data_inputs, axis=-1)
        return result

4. Finally create your pipeline by passing the joiner to the FeatureUnion:

book_summary= Pipeline([
    ItemSelector(key='book'),
    TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True)
])

p = Pipeline([
    FeatureUnion([
        book_summary,
        BookContentCount()
    ], 
        joiner=CustomJoiner()
    ),
    SVC(kernel='linear', class_weight='balanced')
])

Note: if you want your Neuraxle pipeline to become a scikit-learn pipeline back, you can do p = p.tosklearn().

To learn more on Neuraxle: https://github.com/Neuraxio/Neuraxle

More examples from the documentation: https://www.neuraxle.org/stable/examples/index.html

197

answered Sep 20 '22 00:09

Guillaume Chevalier

Related questions
                            
                                NameError: name 'MediaIoBaseDownload' is not defined
                            
                                sum of products for multiple lists in python
                            
                                How do I replace multiple spaces with just one character?
                            
                                Checking the strength of a password (how to check conditions)
                            
                                cx_Freeze converted GUI-app (tkinter) crashes after pressing plot button
                            
                                Python - Using pytest to skip test unless specified
                            
                                ModuleNotFoundError: No module named 'seaborn' in Python IDE
                            
                                Tornado and Python 3.x
                            
                                CloudWatch logs stream to Lambda python
                            
                                __next__ in generators and iterators and what is a method-wrapper?
                            
                                Is this duck-typing in Python?
                            
                                Python 'No module named' error; 'package' is not a package
                            
                                "Initializing" variables in python?
                            
                                How do I curve text in a polar plot?
                            
                                configure returned code 256 - python setup.py egg_info failed with error code 1 in /tmp/pip_build_root/lxml
                            
                                User defined generic types and collections.abc
                            
                                Algorithm to exchange the roles of two randomly chosen nodes from a tree moving pointers
                            
                                Conflict between Pandas and Unittest?
                            
                                (Installing Python 3.6.1) SSLError: SSL: TLSV1_ALERT_UNKNOWN_CA tlsv1 alert unknown ca

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to combine features with different dimensions output using scikit-learn

Tags:

python-3.x

numpy

scikit-learn

pipeline

neuraxle