I' doing a project based on this Kaggle dataset: https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings/data and I need to put the data into a kNN model, however this can't be done in its current state as I need to transform the string values into integers.
get_dummies isn't ideal as there are loads of categorical data in the dataset and will create thousands of columns. I am looking for a way to transform strings to numeric representations, for example:
Platform || Critic_Score || Publisher || Global_Sales
Wii || 73 || Nintendo || 53
Wii || 86 || Nintendo || 60
PC || 80 ||Activision || 30
PS3 || 74 ||Activision || 35
Xbox360 || 81 || 2K || 38
I'd like to transform into this:
Platform || Critic_Score || Publisher || Global_Sales
1 || 73 || 1 || 53
1 || 86 || 1 || 60
2 || 80 || 2 || 30
3 || 74 || 2 || 35
4 || 81 || 3 || 38
I'm using Python 3.
Thanks.
I think you need factorize
:
df['Platform'] = pd.factorize(df['Platform'])[0] + 1
df['Publisher'] = pd.factorize(df['Publisher'])[0] + 1
print (df)
Platform Critic_Score Publisher Global_Sales
0 1 73 1 53
1 1 86 1 60
2 2 80 2 30
3 3 74 2 35
4 4 81 3 38
cols = ['Platform', 'Publisher']
df[cols] = df[cols].apply(lambda x: pd.factorize(x)[0] + 1)
print (df)
Platform Critic_Score Publisher Global_Sales
0 1 73 1 53
1 1 86 1 60
2 2 80 2 30
3 3 74 2 35
4 4 81 3 38
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With