How to perform StandardScaler on pandas dataframe with a column/columns containing numpy.ndarrays?

Question

I have a pandas dataframe that has some columns with numpy.ndarrays:

  col1         col2           col3         col4
0  4    array([34, 56, 234])   7     array([765, 654])
1  3    array([11, 598, 1])    89    array([34, 90])

And I would like to preform some type of scaling on.

I have done the pretty standard thing of:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)


from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

and I run into the pretty expected error of:

ValueError: setting an array element with a sequence.

I need help standardizing these numpy arrays along with everything else!

Mohsin hasan · Accepted Answer

StandardScaler expects each column to have numeric values but col2 and col4 have sequences and hence the error.

I think it would be best to treat columns with sequences separately and then combine back with rest of data.

For now, I will assume for all rows, no. of elements in sequence for a given column is same, e.g. all rows of col_2 have 3 value array.

Since, StandardScaler calculates mean and std for all columns individually. There are two approaches for sequence columns:

Approach 1: Elements at all positions of sequence come from same distribution.

In this case, you should get mean and std over all values. After fitting StandardScaler on flattened array, reshape it back to original shape.

Approach 2: Elements at different position of sequence come from different distributions.

In this scenario, a single column can be converted to a 2D numpy array. You can fit StandardScaler on that 2D array (each column mean and std will be calculated separately) and bring it back to single column after transformation.

Below is code for both approaches:

# numeric columns should work as expected
X_train_1 = X_train[['col1', 'col3']]
X_test_1 = X_test[['col1', 'col3']]

sc = StandardScaler()
X_train_1 = sc.fit_transform(X_train_1)
X_test_1 = sc.transform(X_test_1)

# first convert seq column to a 2d array
X_train_col2 = np.vstack(X_train['col2'].values).astype(float)
X_test_col2 = np.vstack(X_test['col2'].values).astype(float)

# for sequence columns, there are two approaches:
# Approach 1
sc_col2 = StandardScaler()
X_train_2 = sc_col2.fit_transform(X_train_col2.flatten().reshape(-1, 1))
X_train_2 = X_train_2.reshape(X_train_col2.shape)

X_test_2 = sc_col2.transform(X_test_col2.flatten().reshape(-1, 1))
X_test_2 = X_test_2.reshape(X_test_col2.shape)


# Approach 2
sc_col2 = StandardScaler()
X_train_2 = sc_col2.fit_transform(X_train_col2)

X_test_2 = sc_col2.transform(X_test_col2)

# To assign back to dataframe, you can do following:
X_test["col2_scaled"] = X_test_2.tolist()

# To stack with other numpy arrays
X_train_scaled = np.hstack((X_train_1, X_train_2))

In approach 2, it is possible to stack all columns first and then perform StandarScaler on all of them in one shot.

How to perform StandardScaler on pandas dataframe with a column/columns containing numpy.ndarrays?

Tags:

python-3.x

pandas

numpy

scikit-learn

raceee

1 Answers

Approach 1: Elements at all positions of sequence come from same distribution.

Approach 2: Elements at different position of sequence come from different distributions.

Mohsin hasan

Recent Activity

Donate For Us

How to perform StandardScaler on pandas dataframe with a column/columns containing numpy.ndarrays?

Tags:

python-3.x

pandas

numpy

scikit-learn

raceee

1 Answers

Approach 1: Elements at all positions of sequence come from same distribution.

Approach 2: Elements at different position of sequence come from different distributions.

Mohsin hasan

Related questions

Recent Activity

Donate For Us