Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OneHotEncoder only a single feature which is string

I want one of my ONLY ONE of my features to be converted to a separate binary features:

df["pattern_id"]
Out[202]: 
0       3
1       3
...
7440    2
7441    2
7442    3
Name: pattern_id, Length: 7443, dtype: int64 
df["pattern_id"]
Out[202]: 
0       0 0 1
1       0 0 1
...
7440    0 1 0
7441    0 1 0
7442    0 0 1
Name: pattern_id, Length: 7443, dtype: int64 

I want to use OneHotEncoder, data is int, so no need to encode it:

onehotencoder = OneHotEncoder(categorical_features=["pattern_id"])
df = onehotencoder.fit_transform(df).toarray()

ValueError: could not convert string to float: 'http://www.zaragoza.es/sedeelectronica/'

Interesting enough I receive an error... sklearn tried to encode another column, not the one I wanted.

We have to encode pattern_id to be an integer value

I used this link: Issue with OneHotEncoder for categorical features

#transform the pattern_id feature to int
encoding_feature = ["pattern_id"]
enc = LabelEncoder()
enc.fit(encoding_feature)
working_feature = enc.transform(encoding_feature)
working_feature = working_feature.reshape(-1, 1)
ohe = OneHotEncoder(sparse=False)


#convert the pattern_id feature to separate binary features
onehotencoder = OneHotEncoder(categorical_features=working_feature, sparse=False)
df = onehotencoder.fit_transform(df).toarray()

And I get the same error. What am I doing wrong ?

Edit

source: https://github.com/martin-varbanov96/scraper/blob/master/logo_scrape/logo_scrape/analysis.py

df
Out[259]: 
      found_img  is_http                                           link_img  \
0          True        0                                  img/aahoteles.svg   
//www.zaragoza.es/cont/paginas/img/sede/logo_e...   

      pattern_id                                       current_link  site_id  \
0              3             https://www.aa-hoteles.com/es/reservas        3   
6              3      https://www.aa-hoteles.com/es/ofertas-hoteles        3   
7              2           http://about.pressreader.com/contact-us/        4   
8              3           http://about.pressreader.com/contact-us/        4   

      status                                   link_id  
0        200               https://www.aa-hoteles.com/  
1        200               https://www.365travel.asia/  
2        200               https://www.365travel.asia/  
3        200               https://www.365travel.asia/  
4        200               https://www.aa-hoteles.com/  
5        200               https://www.aa-hoteles.com/  
6        200               https://www.aa-hoteles.com/  
7        200              http://about.pressreader.com  
8        200              http://about.pressreader.com  
9        200               https://www.365travel.asia/  
10       200               https://www.365travel.asia/  
11       200               https://www.365travel.asia/  
12       200               https://www.365travel.asia/  
13       200               https://www.365travel.asia/  
14       200               https://www.365travel.asia/  
15       200               https://www.365travel.asia/  
16       200               https://www.365travel.asia/  
17       200               https://www.365travel.asia/  
18       200              http://about.pressreade 

[7443 rows x 8 columns]
like image 320
Hartun Avatar asked Oct 17 '25 17:10

Hartun


2 Answers

If you take a look at the documentation for OneHotEncoder you can see that the categorical_features argument expects '“all” or array of indices or mask' not a string. You can make your code work by changing to the following lines

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Create a dataframe of random ints
df = pd.DataFrame(np.random.randint(0, 4, size=(100, 4)),
                  columns=['pattern_id', 'B', 'C', 'D'])
onehotencoder = OneHotEncoder(categorical_features=[df.columns.tolist().index('pattern_id')])
df = onehotencoder.fit_transform(df)

However df will no longer be a DataFrame, I would suggest working directly with the numpy arrays.

like image 115
piman314 Avatar answered Oct 19 '25 06:10

piman314


You can also do it like this

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(df.required_column.values.reshape(-1, 1)).toarray()

We need to reshape the column, because fit_transform requires a 2-D array. Then you can add columns to this numpy array and then merge it with your DataFrame.

Seen from this link here

like image 45
Orionis Avatar answered Oct 19 '25 05:10

Orionis



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!