Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

understand sklearn QuantileTransformer

I try to use QuantileTransformer to transform several columns, but the results don't seem to be convenient. Moreover, it depends on the column order even for a small dataset.

I understand that there is a way to create an individual transformer for each feature, but as I read documentation, this function should accept (n_samples, n_features) object.

Here is google colab to reproduce the results.

Is there a way to apply QuantileTransformer and get consistent results (so that same original values are mapped to same transformed values instead of one to many)?

import pandas as pd
from sklearn.preprocessing import QuantileTransformer

def unique_values(x):
    return x.unique().tolist()

df = pd.read_csv('https://storage.googleapis.com/ml_universities/california_housing_train.csv', usecols=[0, 1])
columns = ['latitude', 'longitude']

qt = QuantileTransformer()
q_features = qt.fit_transform(df)
suffix = '__qt'
qdf = df.join(pd.DataFrame(q_features, columns=columns), rsuffix=suffix)

for col in columns:
    q_col = f'{col}{suffix}'
    print({col: qdf[col].nunique(), q_col: qdf[q_col].nunique()})
    gdf = qdf.groupby(col)[q_col].agg([pd.Series.nunique, unique_values])
    print(gdf.sort_values('nunique', ascending=False).head())

Results:

{'latitude': 840, 'latitude__qt': 827}
          nunique                                      unique_values
latitude                                                            
34.07       102.0  [0.9865865865865866, 0.9719719719719734, 0.963...
34.08       101.0  [0.980980980980981, 0.9474474474474475, 0.9214...
34.06        94.0  [0.9846403596403596, 0.932932932932933, 0.9294...
34.10        88.0  [0.9891329870516945, 0.9882813721745806, 0.987...
34.05        87.0  [0.9719719719719734, 0.9269269269269284, 0.923...
{'longitude': 827, 'longitude__qt': 842}
           nunique                                      unique_values
longitude                                                            
-118.31       50.0  [0.6276276276276276, 0.5721203907954981, 0.511...
-118.32       49.0  [0.5369214480068981, 0.504004004004004, 0.4804...
-118.12       49.0  [0.5418393378488674, 0.5415415415415415, 0.540...
-117.25       48.0  [0.5335335335335335, 0.5327261051927988, 0.452...
-118.15       47.0  [0.5495495495495496, 0.5418393378488674, 0.541...

Different column order:

df = pd.read_csv('https://storage.googleapis.com/ml_universities/california_housing_train.csv', usecols=[0, 1])
columns = ['longitude', 'latitude']

qt = QuantileTransformer()
q_features = qt.fit_transform(df)
suffix = '__qt'
qdf = df.join(pd.DataFrame(q_features, columns=columns), rsuffix=suffix)

for col in columns:
    q_col = f'{col}{suffix}'
    print({col: qdf[col].nunique(), q_col: qdf[q_col].nunique()})
    gdf = qdf.groupby(col)[q_col].agg([pd.Series.nunique, unique_values])
    print(gdf.sort_values('nunique', ascending=False).head())

Results:

{'longitude': 827, 'longitude__qt': 827}
           nunique            unique_values
longitude                                  
-124.35        1.0  [9.999999977795539e-08]
-118.31        1.0     [0.5900900900900901]
-118.41        1.0      [0.531031031031031]
-118.40        1.0     [0.5355355355355356]
-118.39        1.0      [0.542542542542544]
{'latitude': 840, 'latitude__qt': 842}
          nunique                             unique_values
latitude                                                   
37.74         2.0  [0.7602602602602603, 0.7577577577577578]
37.37         2.0  [0.6806806806806807, 0.6816816816816816]
32.54         1.0                   [9.999999977795539e-08]
38.34         1.0                      [0.8848848848848849]
38.36         1.0                      [0.8873873873873874]
like image 783
Alex Ozerov Avatar asked Apr 15 '19 10:04

Alex Ozerov


1 Answers

The problem is that you didn't change the order of the columns, but simply renamed the columns. If you do it this way, then you will get the correct results. I am also supplying a random_state parameter for good measure.

import pandas as pd
from sklearn.preprocessing import QuantileTransformer

def unique_values(x):
    return x.unique().tolist()

df = pd.read_csv('https://storage.googleapis.com/ml_universities/california_housing_train.csv', usecols=[0, 1])
columns = ['latitude', 'longitude']
# Change the column order
df = df[columns]

qt = QuantileTransformer(random_state = 0)
q_features = qt.fit_transform(df)
suffix = '__qt'
qdf = df.join(pd.DataFrame(q_features, columns=columns), rsuffix=suffix)

for col in columns:
    q_col = f'{col}{suffix}'
    print({col: qdf[col].nunique(), q_col: qdf[q_col].nunique()})
    gdf = qdf.groupby(col)[q_col].agg([pd.Series.nunique, unique_values])
    print(gdf.sort_values('nunique', ascending=False).head())

produces same output, just in a different order (which is what you desired because the column order was switched), as

df = pd.read_csv('https://storage.googleapis.com/ml_universities/california_housing_train.csv', usecols=[0, 1])
columns = ['longitude', 'latitude']
df = df[columns] # Changing the column order

qt = QuantileTransformer()
q_features = qt.fit_transform(df)
suffix = '__qt'
qdf = df.join(pd.DataFrame(q_features, columns=columns), rsuffix=suffix)

for col in columns:
    q_col = f'{col}{suffix}'
    print({col: qdf[col].nunique(), q_col: qdf[q_col].nunique()})
    gdf = qdf.groupby(col)[q_col].agg([pd.Series.nunique, unique_values])
    print(gdf.sort_values('nunique', ascending=False).head())
like image 94
Corey Levinson Avatar answered Sep 20 '22 01:09

Corey Levinson