Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transform multiple categorical columns

In my dataset I have two categorical columns which I would like to numerate. The two columns both contain countries, some overlap (appear in both columns). I would like to give the same number in column1 and column2 for the same country.

My data looks somewhat like:

import pandas as pd

d = {'col1': ['NL', 'BE', 'FR', 'BE'], 'col2': ['BE', 'NL', 'ES', 'ES']}
df = pd.DataFrame(data=d)
df

Currenty I am transforming the data like:

from sklearn.preprocessing import LabelEncoder
df.apply(LabelEncoder().fit_transform)

However this makes no distinction between FR and ES. Is there another simple way to come to the following output?

o = {'col1': [2,0,1,0], 'col2': [0,2,4,4]}
output = pd.DataFrame(data=o)
output
like image 965
Tox Avatar asked Nov 12 '19 15:11

Tox


People also ask

How do you convert 3 categorical variables to dummy variables?

To convert your categorical variables to dummy variables in Python you c an use Pandas get_dummies() method. For example, if you have the categorical variable “Gender” in your dataframe called “df” you can use the following code to make dummy variables: df_dc = pd. get_dummies(df, columns=['Gender']) .

How do you convert multiple categorical variables into dummy variables in R?

To convert category variables to dummy variables in tidyverse, use the spread() method. To do so, use the spread() function with three arguments: key, which is the column to convert into categorical values, in this case, “Reporting Airline”; value, which is the value you want to set the key to (in this case “dummy”);

How do you perform one-hot encoding for multi categorical variables?

For checking values we use the pandas select_dtypes function which is used for selecting the data types of variable. Now we have to drop those numerical columns from the dataset and we will use this categorical variable for our use. We only use 3-4 categorical columns from the dataset for applying one-hot encoding.


2 Answers

Here is one way

df.stack().astype('category').cat.codes.unstack()
Out[190]: 
   col1  col2
0     3     0
1     0     3
2     2     1
3     0     1

Or

s=df.stack()
s[:]=s.factorize()[0]
s.unstack()
Out[196]: 
   col1  col2
0     0     1
1     1     0
2     2     3
3     1     3
like image 113
BENY Avatar answered Oct 21 '22 00:10

BENY


You can fit the LabelEncoder() with the unique values in your dataframe first and then transform.

le = LabelEncoder()
le.fit(pd.concat([df.col1, df.col2]).unique()) # or np.unique(df.values.reshape(-1,1))

df.apply(le.transform)
Out[28]: 
   col1  col2
0     3     0
1     0     3
2     2     1
3     0     1
like image 38
Michael Gardner Avatar answered Oct 20 '22 23:10

Michael Gardner