Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform StandardScaler on pandas dataframe with a column/columns containing numpy.ndarrays?

I have a pandas dataframe that has some columns with numpy.ndarrays:

  col1         col2           col3         col4
0  4    array([34, 56, 234])   7     array([765, 654])
1  3    array([11, 598, 1])    89    array([34, 90])

And I would like to preform some type of scaling on.

I have done the pretty standard thing of:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)


from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

and I run into the pretty expected error of:

ValueError: setting an array element with a sequence.

I need help standardizing these numpy arrays along with everything else!

like image 914
raceee Avatar asked Nov 17 '25 03:11

raceee


1 Answers

StandardScaler expects each column to have numeric values but col2 and col4 have sequences and hence the error.

I think it would be best to treat columns with sequences separately and then combine back with rest of data.

For now, I will assume for all rows, no. of elements in sequence for a given column is same, e.g. all rows of col_2 have 3 value array.

Since, StandardScaler calculates mean and std for all columns individually. There are two approaches for sequence columns:

Approach 1: Elements at all positions of sequence come from same distribution.

In this case, you should get mean and std over all values. After fitting StandardScaler on flattened array, reshape it back to original shape.

Approach 2: Elements at different position of sequence come from different distributions.

In this scenario, a single column can be converted to a 2D numpy array. You can fit StandardScaler on that 2D array (each column mean and std will be calculated separately) and bring it back to single column after transformation.

Below is code for both approaches:

# numeric columns should work as expected
X_train_1 = X_train[['col1', 'col3']]
X_test_1 = X_test[['col1', 'col3']]

sc = StandardScaler()
X_train_1 = sc.fit_transform(X_train_1)
X_test_1 = sc.transform(X_test_1)

# first convert seq column to a 2d array
X_train_col2 = np.vstack(X_train['col2'].values).astype(float)
X_test_col2 = np.vstack(X_test['col2'].values).astype(float)

# for sequence columns, there are two approaches:
# Approach 1
sc_col2 = StandardScaler()
X_train_2 = sc_col2.fit_transform(X_train_col2.flatten().reshape(-1, 1))
X_train_2 = X_train_2.reshape(X_train_col2.shape)

X_test_2 = sc_col2.transform(X_test_col2.flatten().reshape(-1, 1))
X_test_2 = X_test_2.reshape(X_test_col2.shape)


# Approach 2
sc_col2 = StandardScaler()
X_train_2 = sc_col2.fit_transform(X_train_col2)

X_test_2 = sc_col2.transform(X_test_col2)

# To assign back to dataframe, you can do following:
X_test["col2_scaled"] = X_test_2.tolist()

# To stack with other numpy arrays
X_train_scaled = np.hstack((X_train_1, X_train_2))


In approach 2, it is possible to stack all columns first and then perform StandarScaler on all of them in one shot.

like image 76
Mohsin hasan Avatar answered Nov 20 '25 18:11

Mohsin hasan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!