Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using FunctionTransformer with sklearn Pipeline and ColumnTransformer - error: invalid type promotion

I'm using pipeline for preprocessing data. Here is my code. I want to convert a string column to datetime and replace empty strings (' '), "N.A" with np.nan for some other columns. I'm trying to use FunctionTransformer in my pipeline steps.

df = pd.DataFrame({'categoric1':['Apple', '  ', 'Cherry', 'Apple', 'Cherry', 'Cherry', 'Orange'],                    
                   'numeric1':[1, 2, 3, 4, 5, 6, 7],                                      
                   'numeric2':[7,8,9,"N.A", np.nan, '  ', 12],
                   'date1': ['20001103','20011109', '19910929', '19920929', '20051107', '20081103', '20101105']})
cat_features = ['categoric1']
num_features = ['numeric1', 'numeric2']
date_features = ['date1']

print(df.head(7))

def replace_with_nan(X):
    X_copy = X.copy()       
    X_copy[X_copy == '  '] = np.nan
    X_copy[X_copy == 'N.A'] = np.nan
    return X_copy.values

def square_values(X):
    return X**2

def convert_to_datetime(df):
    df['date1'] = pd.to_datetime(df['date1'], errors='raise') #df['date1'].astype(str) + "Z"
    return df

cat_transformer = Pipeline(steps=[
    ('ft_replace_nan', FunctionTransformer(replace_with_nan, validate=False)),    
    ('imputer', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),   
    ('encoder', OneHotEncoder(categories=[['Apple', 'Orange', 'Cherry']], handle_unknown='error'))     
])

num_transformer = Pipeline(steps=[    
    ('ft_replace_nan', FunctionTransformer(replace_with_nan, validate=False)),
#     ('ft_square_values', FunctionTransformer(square_values, validate=False)),    #Another FunctionTransformer -----1
    ('imputer', SimpleImputer(missing_values=np.nan, strategy='median')),
    ('scaler', StandardScaler())
])

date_transformer = Pipeline(steps=[    
    ('convert_to_datetime', FunctionTransformer(convert_to_datetime, validate=False))
])

preprocessor = ColumnTransformer(remainder='passthrough', transformers = [
    ('num', num_transformer, num_features),
    ('cat', cat_transformer, cat_features),
    ('date', date_transformer, date_features)
])

# ft_fill_nan = FunctionTransformer(replace_with_nan, validate=False)
# transformed_data = ft_fill_nan.fit_transform(df)
# print(transformed_data)

# ft_convert_datetime = FunctionTransformer(convert_to_datetime, validate=False)
# transformed_data = ft_convert_datetime.fit_transform(df)
# print(transformed_data)

transformed_data = preprocessor.fit_transform(df)
print(transformed_data)

enter image description here

Questions:

  1. When I try to execute preprocessor.fit_transform(df), I'm getting errors as below. Can you please help how do I fix this?
  2. What if I want to execute another FunctionTranformer in same pipeline to square the values by uncommenting line #Another FunctionTransformer -----1. Is it possible? If so, how?
  3. I don't want to change the state of the actual data inside convert_to_datetime(df) method above. I would also want to make it generic without accessing actual date1 column. How Can I achieve this?

enter image description here

like image 466
Jyoti Prasad Pal Avatar asked Oct 02 '19 13:10

Jyoti Prasad Pal


1 Answers

  1. You are getting invalid type promotion error, because of heterogenous data type. Sklearn is trying to concatenate using numpy struct arrays internally. Solution is extracting necessary features from the date, for example month of the given date.

All you need to change is the convert_to_datetime

def convert_to_datetime(data):
    return data.apply(lambda x: [pd.to_datetime(date,  format="%Y%m%d").month for date in x])

By doing this way, you don't have to hard code the column name inside the function.

result:

enter image description here

  1. you can add one more functional transformer easily, try this!
    ('ft_square_values', FunctionTransformer(lambda x: x*2, validate=False)),    #Another FunctionTransformer -----1
  1. By adopting the solution mentioned in point 1. you can get away from this problem as well.
like image 72
Venkatachalam Avatar answered Dec 26 '22 12:12

Venkatachalam