Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inconsistent labeling in sklearn LabelEncoder?

I have applied a LabelEncoder() on a dataframe, which returns the following:

enter image description here

The order/new_carts have different label-encoded numbers, like 70, 64, 71, etc

Is this inconsistent labeling, or did I do something wrong somewhere?

like image 689
Dawny33 Avatar asked Mar 12 '23 16:03

Dawny33


2 Answers

LabelEncoder works on one-dimensional arrays. If you apply it to multiple columns, it will be consistent within columns but not across columns.

As a workaround, you can convert the dataframe to a one dimensional array and call LabelEncoder on that array.

Assume this is the dataframe:

df
Out[372]: 
   0  1  2
0  d  d  a
1  c  a  c
2  c  c  b
3  e  e  d
4  d  d  e
5  d  b  e
6  e  e  b
7  a  e  b
8  b  c  c
9  e  a  b

With ravel and then reshaping:

pd.DataFrame(LabelEncoder().fit_transform(df.values.ravel()).reshape(df.shape), columns = df.columns)
Out[373]: 
   0  1  2
0  3  3  0
1  2  0  2
2  2  2  1
3  4  4  3
4  3  3  4
5  3  1  4
6  4  4  1
7  0  4  1
8  1  2  2
9  4  0  1

Edit:

If you want to store the labels, you need to save the LabelEncoder object.

le = LabelEncoder()
df2 = pd.DataFrame(le.fit_transform(df.values.ravel()).reshape(df.shape), columns = df.columns)

Now, le.classes_ gives you the classes (starting from 0).

le.classes_
Out[390]: array(['a', 'b', 'c', 'd', 'e'], dtype=object)

If you want to access the integer by label, you can construct a dict:

dict(zip(le.classes_, np.arange(len(le.classes_))))
Out[388]: {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}

You can do the same with transform method, without building a dict:

le.transform('c')
Out[395]: 2
like image 173
ayhan Avatar answered Mar 27 '23 04:03

ayhan


Your LabelEncoder object is being re-fit to each column of your DataFrame.

Because of the way the apply and fit_transform functions work, you are accidentally calling the fit function on each column of your frame. Let's walk through whats happening in the following line:

labeled_df = String_df.apply(LabelEncoder().fit_transform)
  1. create a new LabelEncoder object
  2. Call apply passing in the fit_transform method. For each column in your DataFrame it will call fit_transform on your encoder passing in the column as an argument. This does two things:
    A. refit your encoder (modifying its state) B. return the codes for the elements of your column based on your encoders new fitting.

The codes will not be consistent across columns because each time you call fit_transform the LabelEncoder object can choose new transformation codes.

If you want your codes to be consistent across columns, you should fit your LabelEncoder to your whole dataset.

Then pass the transform function to your apply function, instead of the fit_transform function. You can try the following:

encoder = LabelEncoder()
all_values = String_df.values.ravel() #convert the dataframe to one long array
encoder.fit(all_values)
labeled_df = String_df.apply(encoder.transform)
like image 40
dmlicht Avatar answered Mar 27 '23 03:03

dmlicht