I try to use QuantileTransformer to transform several columns, but the results don't seem to be convenient. Moreover, it depends on the column order even for a small dataset.
I understand that there is a way to create an individual transformer for each feature, but as I read documentation, this function should accept (n_samples, n_features) object.
Here is google colab to reproduce the results.
Is there a way to apply QuantileTransformer and get consistent results (so that same original values are mapped to same transformed values instead of one to many)?
import pandas as pd
from sklearn.preprocessing import QuantileTransformer
def unique_values(x):
return x.unique().tolist()
df = pd.read_csv('https://storage.googleapis.com/ml_universities/california_housing_train.csv', usecols=[0, 1])
columns = ['latitude', 'longitude']
qt = QuantileTransformer()
q_features = qt.fit_transform(df)
suffix = '__qt'
qdf = df.join(pd.DataFrame(q_features, columns=columns), rsuffix=suffix)
for col in columns:
q_col = f'{col}{suffix}'
print({col: qdf[col].nunique(), q_col: qdf[q_col].nunique()})
gdf = qdf.groupby(col)[q_col].agg([pd.Series.nunique, unique_values])
print(gdf.sort_values('nunique', ascending=False).head())
Results:
{'latitude': 840, 'latitude__qt': 827}
nunique unique_values
latitude
34.07 102.0 [0.9865865865865866, 0.9719719719719734, 0.963...
34.08 101.0 [0.980980980980981, 0.9474474474474475, 0.9214...
34.06 94.0 [0.9846403596403596, 0.932932932932933, 0.9294...
34.10 88.0 [0.9891329870516945, 0.9882813721745806, 0.987...
34.05 87.0 [0.9719719719719734, 0.9269269269269284, 0.923...
{'longitude': 827, 'longitude__qt': 842}
nunique unique_values
longitude
-118.31 50.0 [0.6276276276276276, 0.5721203907954981, 0.511...
-118.32 49.0 [0.5369214480068981, 0.504004004004004, 0.4804...
-118.12 49.0 [0.5418393378488674, 0.5415415415415415, 0.540...
-117.25 48.0 [0.5335335335335335, 0.5327261051927988, 0.452...
-118.15 47.0 [0.5495495495495496, 0.5418393378488674, 0.541...
Different column order:
df = pd.read_csv('https://storage.googleapis.com/ml_universities/california_housing_train.csv', usecols=[0, 1])
columns = ['longitude', 'latitude']
qt = QuantileTransformer()
q_features = qt.fit_transform(df)
suffix = '__qt'
qdf = df.join(pd.DataFrame(q_features, columns=columns), rsuffix=suffix)
for col in columns:
q_col = f'{col}{suffix}'
print({col: qdf[col].nunique(), q_col: qdf[q_col].nunique()})
gdf = qdf.groupby(col)[q_col].agg([pd.Series.nunique, unique_values])
print(gdf.sort_values('nunique', ascending=False).head())
Results:
{'longitude': 827, 'longitude__qt': 827}
nunique unique_values
longitude
-124.35 1.0 [9.999999977795539e-08]
-118.31 1.0 [0.5900900900900901]
-118.41 1.0 [0.531031031031031]
-118.40 1.0 [0.5355355355355356]
-118.39 1.0 [0.542542542542544]
{'latitude': 840, 'latitude__qt': 842}
nunique unique_values
latitude
37.74 2.0 [0.7602602602602603, 0.7577577577577578]
37.37 2.0 [0.6806806806806807, 0.6816816816816816]
32.54 1.0 [9.999999977795539e-08]
38.34 1.0 [0.8848848848848849]
38.36 1.0 [0.8873873873873874]
The problem is that you didn't change the order of the columns, but simply renamed the columns. If you do it this way, then you will get the correct results. I am also supplying a random_state
parameter for good measure.
import pandas as pd
from sklearn.preprocessing import QuantileTransformer
def unique_values(x):
return x.unique().tolist()
df = pd.read_csv('https://storage.googleapis.com/ml_universities/california_housing_train.csv', usecols=[0, 1])
columns = ['latitude', 'longitude']
# Change the column order
df = df[columns]
qt = QuantileTransformer(random_state = 0)
q_features = qt.fit_transform(df)
suffix = '__qt'
qdf = df.join(pd.DataFrame(q_features, columns=columns), rsuffix=suffix)
for col in columns:
q_col = f'{col}{suffix}'
print({col: qdf[col].nunique(), q_col: qdf[q_col].nunique()})
gdf = qdf.groupby(col)[q_col].agg([pd.Series.nunique, unique_values])
print(gdf.sort_values('nunique', ascending=False).head())
produces same output, just in a different order (which is what you desired because the column order was switched), as
df = pd.read_csv('https://storage.googleapis.com/ml_universities/california_housing_train.csv', usecols=[0, 1])
columns = ['longitude', 'latitude']
df = df[columns] # Changing the column order
qt = QuantileTransformer()
q_features = qt.fit_transform(df)
suffix = '__qt'
qdf = df.join(pd.DataFrame(q_features, columns=columns), rsuffix=suffix)
for col in columns:
q_col = f'{col}{suffix}'
print({col: qdf[col].nunique(), q_col: qdf[q_col].nunique()})
gdf = qdf.groupby(col)[q_col].agg([pd.Series.nunique, unique_values])
print(gdf.sort_values('nunique', ascending=False).head())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With