I'm using pipeline for preprocessing data. Here is my code. I want to convert a string column to datetime and replace empty strings (' '), "N.A" with np.nan
for some other columns. I'm trying to use FunctionTransformer
in my pipeline steps.
df = pd.DataFrame({'categoric1':['Apple', ' ', 'Cherry', 'Apple', 'Cherry', 'Cherry', 'Orange'],
'numeric1':[1, 2, 3, 4, 5, 6, 7],
'numeric2':[7,8,9,"N.A", np.nan, ' ', 12],
'date1': ['20001103','20011109', '19910929', '19920929', '20051107', '20081103', '20101105']})
cat_features = ['categoric1']
num_features = ['numeric1', 'numeric2']
date_features = ['date1']
print(df.head(7))
def replace_with_nan(X):
X_copy = X.copy()
X_copy[X_copy == ' '] = np.nan
X_copy[X_copy == 'N.A'] = np.nan
return X_copy.values
def square_values(X):
return X**2
def convert_to_datetime(df):
df['date1'] = pd.to_datetime(df['date1'], errors='raise') #df['date1'].astype(str) + "Z"
return df
cat_transformer = Pipeline(steps=[
('ft_replace_nan', FunctionTransformer(replace_with_nan, validate=False)),
('imputer', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
('encoder', OneHotEncoder(categories=[['Apple', 'Orange', 'Cherry']], handle_unknown='error'))
])
num_transformer = Pipeline(steps=[
('ft_replace_nan', FunctionTransformer(replace_with_nan, validate=False)),
# ('ft_square_values', FunctionTransformer(square_values, validate=False)), #Another FunctionTransformer -----1
('imputer', SimpleImputer(missing_values=np.nan, strategy='median')),
('scaler', StandardScaler())
])
date_transformer = Pipeline(steps=[
('convert_to_datetime', FunctionTransformer(convert_to_datetime, validate=False))
])
preprocessor = ColumnTransformer(remainder='passthrough', transformers = [
('num', num_transformer, num_features),
('cat', cat_transformer, cat_features),
('date', date_transformer, date_features)
])
# ft_fill_nan = FunctionTransformer(replace_with_nan, validate=False)
# transformed_data = ft_fill_nan.fit_transform(df)
# print(transformed_data)
# ft_convert_datetime = FunctionTransformer(convert_to_datetime, validate=False)
# transformed_data = ft_convert_datetime.fit_transform(df)
# print(transformed_data)
transformed_data = preprocessor.fit_transform(df)
print(transformed_data)
Questions:
preprocessor.fit_transform(df)
, I'm getting
errors as below. Can you please help how do I fix this? #Another FunctionTransformer -----1
. Is it possible? If so, how?convert_to_datetime(df)
method above. I would also want to make it generic without accessing actual date1
column. How Can I achieve this?invalid type promotion
error, because of heterogenous data type. Sklearn is trying to concatenate using numpy struct arrays internally. Solution is extracting necessary features from the date, for example month of the given date. All you need to change is the convert_to_datetime
def convert_to_datetime(data):
return data.apply(lambda x: [pd.to_datetime(date, format="%Y%m%d").month for date in x])
By doing this way, you don't have to hard code the column name inside the function.
result:
('ft_square_values', FunctionTransformer(lambda x: x*2, validate=False)), #Another FunctionTransformer -----1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With