Getting 'ValueError: shapes not aligned' on SciKit Linear Regression

Tags:

Quite new to SciKit and linear algebra/machine learning with Python in general, so I can't seem to solve the following:

I have a training set and a test set of data, containing both continuous and discrete/categorical values. The CSV files are loaded into Pandas DataFrames and match in shape, being (1460,81) and (1459,81). However, after using Pandas' get_dummies, the shapes of the DataFrames change to (1460, 306) and (1459, 294). So, when I do linear regression with the SciKit Linear Regression module, it builds a model for 306 variables and it tries to predict one with only 294 with it. This then, naturally, leads to the following error:

ValueError: shapes (1459,294) and (306,1) not aligned: 294 (dim 1) != 306 (dim 0)

How could I tackle such a problem? Could I somehow reshape the (1459, 294) to match the other one?

Thanks and I hope I've made myself clear :)

688

asked Dec 21 '16 20:12

Koen

1 Answers

This is an extremely common problem when dealing with categorical data. There are differing opinions on how to best handle this.

One possible approach is to apply a function to categorical features that limits the set of possible options. For example, if your feature contained the letters of the alphabet, you could encode features for A, B, C, D, and 'Other/Unknown'. In this way, you could apply the same function at test time and abstract from the issue. A clear downside, of course, is that by reducing the feature space you may lose meaningful information.

Another approach is to build a model on your training data, with whichever dummies are naturally created, and treat that as the baseline for your model. When you predict with the model at test time, you transform your test data in the same way your training data is transformed. For example, if your training set had the letters of the alphabet in a feature, and the same feature in the test set contained a value of 'AA', you would ignore that in making a prediction. This is the reverse of your current situation, but the premise is the same. You need to create the missing features on the fly. This approach also has downsides, of course.

The second approach is what you mention in your question, so I'll go through it with pandas.

By using get_dummies you're encoding the categorical features into multiple one-hot encoded features. What you could do is force your test data to match your training data by using reindex, like this:

test_encoded = pd.get_dummies(test_data, columns=['your columns'])
test_encoded_for_model = test_encoded.reindex(columns = training_encoded.columns, 
    fill_value=0)

This will encode the test data in the same way as your training data, filling in 0 for dummy features that weren't created by encoding the test data but were created in during the training process.

You could just wrap this into a function, and apply it to your test data on the fly. You don't need the encoded training data in memory (which I access with training_encoded.columns) if you create an array or list of the column names.

189

answered Oct 31 '22 18:10

Nick Becker

Related questions
                            
                                Python - Basic vs extended slicing
                            
                                How to interpret `scipy.stats.kstest` and `ks_2samp` to evaluate `fit` of data to a distribution?
                            
                                JWT encrypting payload in python? (JWE)
                            
                                Retrieving result from celery worker constantly
                            
                                python patch with side_effect on Object's method is not called with self
                            
                                flake8: import statements are in the wrong order
                            
                                Most efficient way to determine overlapping timeseries in Python
                            
                                Install Python 3 to /usr/bin/ on macOS
                            
                                Uploading different versions (python 2.7 vs 3.5) to PyPI
                            
                                Why is `'↊'.isnumeric()` false?
                            
                                Can tqdm be used with Database Reads?
                            
                                python matplotlib bar chart adding bar titles
                            
                                Python multiprocessing - Capturing signals to restart child processes or shut down parent process
                            
                                Does Python's MRO, C3 linearization work depth-first? Empirically it does not
                            
                                Generalizing adding nested lists
                            
                                Using Datetimes with Seaborn's Regplot
                            
                                Django Rest Framework with multiple Viewsets and Routers for the same object
                            
                                Sklearn : Mean Distance from Centroid of each cluster
                            
                                Why does celery.control.inspect report fewer queued tasks than rabbitmqctl?
                            
                                Pandas read_excel() with multiple sheets and specific columns

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Getting 'ValueError: shapes not aligned' on SciKit Linear Regression

Tags:

python

pandas

machine-learning

scikit-learn

linear-regression

Koen

People also ask

1 Answers

Nick Becker

Recent Activity

Donate For Us