Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LightGBM 'Using categorical_feature in Dataset.' Warning?

Tags:

lightgbm

From my reading of the LightGBM document, one is supposed to define categorical features in the Dataset method. So I have the following code:

cats=['C1', 'C2']
d_train = lgb.Dataset(X, label=y, categorical_feature=cats)

However, I received the following error message:

/app/anaconda3/anaconda3/lib/python3.7/site-packages/lightgbm/basic.py:1243: UserWarning: Using categorical_feature in Dataset. warnings.warn('Using categorical_feature in Dataset.')

Why did I get the warning message?

like image 677
David293836 Avatar asked Mar 07 '20 03:03

David293836


People also ask

Can LightGBM handle categorical data?

LightGBM offers good accuracy with integer-encoded categorical features. LightGBM applies Fisher (1958) to find the optimal split over categories as described here. This often performs better than one-hot encoding. So we can assume that LightGBM does not one-hot encode these categorical features.

How do you specify categorical features in LightGBM?

LightGBM allows us to specify directly categorical features and handles those internally in a smart way. We have to use categorical_features to specify the categorical features. Categorical features must be encoded as non-negative integers (int) less than Int32. MaxValue (2147483647).

How does LightGBM handle missing values?

Missing Value Handle LightGBM uses NA (NaN) to represent missing values by default. Change it to use zero by setting zero_as_missing=true . When zero_as_missing=false (default), the unrecorded values in sparse matrices (and LightSVM) are treated as zeros.

Do we need to encode categorical variables for LGBM?

LightGBM can use categorical feature directly (without one-hot or label encoding). It has a unique way to deal with categorical variables. LGBM applies Fisher's method to find the optimal split over categories.


1 Answers

I presume that you get this warning in a call to lgb.train. This function also has argument categorical_feature, and its default value is 'auto', which means taking categorical columns from pandas.DataFrame (documentation). The warning, which is emitted at this line, indicates that, despite lgb.train has requested that categorical features be identified automatically, LightGBM will use the features specified in the dataset instead.

To avoid the warning, you can give the same argument categorical_feature to both lgb.Dataset and lgb.train. Alternatively, you can construct the dataset with categorical_feature=None and only specify the categorical features in lgb.train.

like image 97
Andrey Popov Avatar answered Sep 21 '22 13:09

Andrey Popov