Custom transformer for sklearn Pipeline that alters both X and y

Tags:

I want to create my own transformer for use with the sklearn Pipeline.

I am creating a class that implements both fit and transform methods. The purpose of the transformer will be to remove rows from the matrix that have more than a specified number of NaNs.

The issue I am facing is how can I change both the X and y matrices that are passed to the transformer?

I believe this has to be done in the fit method since it has access to both X and y. Since python passes arguments by assignment once I reassign X to a new matrix with fewer rows the reference to the original X is lost (and of course the same is true for y). Is it possible to maintain this reference?

I’m using a pandas DataFrame to easily drop the rows that have too many NaNs, this may not be the right way to do it for my use case. The current code looks like this:

class Dropna():      # thresh is max number of NaNs allowed in a row     def __init__(self, thresh=0):         self.thresh = thresh      def fit(self, X, y):         total = X.shape[1]         # +1 to account for 'y' being added to the dframe                                                                                                                                     new_thresh = total + 1 - self.thresh         df = pd.DataFrame(X)         df['y'] = y         df.dropna(thresh=new_thresh, inplace=True)         X = df.drop('y', axis=1).values         y = df['y'].values         return self      def transform(self, X):         return X

610

asked Aug 28 '14 01:08

MarkAWard

1 Answers

Modifying the sample axis, e.g. removing samples, does not (yet?) comply with the scikit-learn transformer API. So if you need to do this, you should do it outside any calls to scikit learn, as preprocessing.

As it is now, the transformer API is used to transform the features of a given sample into something new. This can implicitly contain information from other samples, but samples are never deleted.

Another option is to attempt to impute the missing values. But again, if you need to delete samples, treat it as preprocessing before using scikit learn.

answered Oct 04 '22 17:10

eickenberg

Related questions
                            
                                Weird behavior trying to convert case classes to heterogeneous lists recursively with Shapeless
                            
                                Save workspace in IPython
                            
                                Spring Security: enable / disable CSRF by client type (browser / non-browser )
                            
                                How to use ipdb.set_trace in a forked process
                            
                                How does webopt:bundlereference work in ASP.Net?
                            
                                Remove spacing between items in RecyclerView android
                            
                                How can I identify partitions of an Android device from the shell?
                            
                                Creating a visual call graph for java projects from command line
                            
                                Prompt for user input using python asyncio.create_server instance
                            
                                Better approach to handling sqlalchemy disconnects
                            
                                Create Jar Library Without a Main Class
                            
                                GROUP BY in Postgres - no equality for JSON data type?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With