Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does `categorical_feature` of lightgbm not work?

I want to use LightGBM to predict the tradeMoney of house, but I get troubles when I have specified categorical_feature in the lgb.Dataset of LightGBM.
I get data.dtypes as follows:

type(train)
pandas.core.frame.DataFrame

train.dtypes
area                  float64
rentType               object
houseFloor             object
totalFloor              int64
houseToward            object
houseDecoration        object
region                 object
plate                  object
buildYear               int64
saleSecHouseNum         int64
subwayStationNum        int64
busStationNum           int64
interSchoolNum          int64
schoolNum               int64
privateSchoolNum        int64
hospitalNum             int64
drugStoreNum            int64

And I use LightGBM to train it as follows:

categorical_feats = ['rentType', 'houseFloor', 'houseToward', 'houseDecoration', 'region', 'plate']
folds = KFold(n_splits=5, shuffle=True, random_state=2333)

oof_lgb = np.zeros(len(train))
predictions_lgb = np.zeros(len(test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train.values, target.values)):
    print("fold {}".format(fold_))
    trn_data = lgb.Dataset(train.iloc[trn_idx], label=target.iloc[trn_idx], categorical_feature=categorical_feats)
    val_data = lgb.Dataset(train.iloc[val_idx], label=target.iloc[val_idx], categorical_feature=categorical_feats)

    num_round = 10000
    clf = lgb.train(params, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=500, early_stopping_rounds = 200)

    oof_lgb[val_idx] = clf.predict(train.iloc[val_idx], num_iteration=clf.best_iteration)

    predictions_lgb += clf.predict(test, num_iteration=clf.best_iteration) / folds.n_splits

print("CV Score: {:<8.5f}".format(r2_score(target, oof_lgb)))

BUT it still gives such error messages even if I have specified the categorical_features.

ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in fields rentType, houseFloor, houseToward, houseDecoration, region, plate

And here are the requirements:

LightGBM version: 2.2.3
Pandas version: 0.24.2
Python version: 3.6.8
|Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]

Could anyone help me, please?

like image 444
Bowen Peng Avatar asked May 10 '19 03:05

Bowen Peng


1 Answers

The problem is that lightgbm can handle only features, that are of category type, not object. Here the list of all possible categorical features is extracted. Such features are encoded into integers in the code. But nothing happens to objects and thus lightgbm complains, when it finds that not all features have been transformed into numbers.

So the solution is to do

for c in categorical_feats:
    train[c] = train[c].astype('category')

before your CV loop

like image 126
Mischa Lisovyi Avatar answered Sep 28 '22 02:09

Mischa Lisovyi