Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas for Python: Exception: Data must be 1-dimensional

Here's what I got from a tutorial

# Data Preprocessing

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

# Encoding categorical data
# Encoding the Independent Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
# Encoding the Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

This is the X matrix with encoded dummy variables

1.000000000000000000e+00    0.000000000000000000e+00    0.000000000000000000e+00    4.400000000000000000e+01    7.200000000000000000e+04
0.000000000000000000e+00    0.000000000000000000e+00    1.000000000000000000e+00    2.700000000000000000e+01    4.800000000000000000e+04
0.000000000000000000e+00    1.000000000000000000e+00    0.000000000000000000e+00    3.000000000000000000e+01    5.400000000000000000e+04
0.000000000000000000e+00    0.000000000000000000e+00    1.000000000000000000e+00    3.800000000000000000e+01    6.100000000000000000e+04
0.000000000000000000e+00    1.000000000000000000e+00    0.000000000000000000e+00    4.000000000000000000e+01    6.377777777777778101e+04
1.000000000000000000e+00    0.000000000000000000e+00    0.000000000000000000e+00    3.500000000000000000e+01    5.800000000000000000e+04
0.000000000000000000e+00    0.000000000000000000e+00    1.000000000000000000e+00    3.877777777777777857e+01    5.200000000000000000e+04
1.000000000000000000e+00    0.000000000000000000e+00    0.000000000000000000e+00    4.800000000000000000e+01    7.900000000000000000e+04
0.000000000000000000e+00    1.000000000000000000e+00    0.000000000000000000e+00    5.000000000000000000e+01    8.300000000000000000e+04
1.000000000000000000e+00    0.000000000000000000e+00    0.000000000000000000e+00    3.700000000000000000e+01    6.700000000000000000e+04

The problem is there are no column labels. I tried

something = pd.get_dummies(X)

But I get the following Exception

Exception: Data must be 1-dimensional
like image 859
Tyler L Avatar asked Aug 22 '17 23:08

Tyler L


People also ask

What is the difference between NumPy random and pandas Dataframe?

I am getting this as output, with the error Data must be 1 dimenstional. Show activity on this post. np.random.random (size= (10,1)) produces 2-dimensional array of shape (10, 1) however pandas constructs DataFrames as a collection of 1-dimensional arrays.

How to fix “data must be 1-dimensional” error in NumPy?

To fix the error of “ Data must be 1-dimensional “, You can remove the reshape (-1,1) to make sure that X and Y are 1-D arrays. The following is the code. np.ravel () returns a contiguous flattened array. Thus, we can use that to change 2-D arrays to 1-D arrays.

How to make a Dataframe from a random array in Python?

np.random.random (size= (10,1)) produces 2-dimensional array of shape (10, 1) however pandas constructs DataFrames as a collection of 1-dimensional arrays. So use np.random.random (size= (10)) to make 1-D arrays, which then can be used to make DataFrame.

Why does my Dataframe not print 1d to 2D?

Why the Error Happens It happens because pd.DataFrame is expecting to have 1-D numpy arrays or lists, since it is how columns within a dataframe should be. However, when you use reshape (-1,1), the 1-D array becomes a 2-D array. We can print out with and without reshape (-1, 1) to see the difference.


1 Answers

Most sklearn methods don't care about column names, as they're mainly concerned with the math behind the ML algorithms they implement. You can add column names back onto the OneHotEncoder output after fit_transform(), if you can figure out the label encoding ahead of time.

First, grab the column names of your predictors from the original dataset, excluding the first one (which we reserve for LabelEncoder):

X_cols = dataset.columns[1:-1]
X_cols
# Index(['Age', 'Salary'], dtype='object')

Now get the order of the encoded labels. In this particular case, it looks like LabelEncoder() organizes its integer mapping alphabetically:

labels = labelencoder_X.fit(X[:, 0]).classes_ 
labels
# ['France' 'Germany' 'Spain']

Combine these column names, and then add them to X when you convert to DataFrame:

# X gets re-used, so make sure to define encoded_cols after this line
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
encoded_cols = np.append(labels, X_cols)
# ...
X = onehotencoder.fit_transform(X).toarray()
encoded_df = pd.DataFrame(X, columns=encoded_cols)

encoded_df
   France  Germany  Spain        Age        Salary
0     1.0      0.0    0.0  44.000000  72000.000000
1     0.0      0.0    1.0  27.000000  48000.000000
2     0.0      1.0    0.0  30.000000  54000.000000
3     0.0      0.0    1.0  38.000000  61000.000000
4     0.0      1.0    0.0  40.000000  63777.777778
5     1.0      0.0    0.0  35.000000  58000.000000
6     0.0      0.0    1.0  38.777778  52000.000000
7     1.0      0.0    0.0  48.000000  79000.000000
8     0.0      1.0    0.0  50.000000  83000.000000
9     1.0      0.0    0.0  37.000000  67000.000000

NB: For example data I'm using this dataset, which seems either very similar or identical to the one used by OP. Note how the output is identical to OP's X matrix.

like image 125
andrew_reece Avatar answered Oct 20 '22 20:10

andrew_reece