I have the following type of dataframe:
Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
0 2 3 12669 9656 7561 214 2674 1338
1 2 3 7057 9810 9568 1762 3293 1776
2 2 3 6353 8808 7684 2405 3516 7844
3 1 3 13265 1196 4221 6404 507 1788
4 2 3 22615 5410 7198 3915 1777 5185
I would like to do two things:
1) Be able to rescale only certain columns and not all of them in order for them to be between 0,1. I would like to select only certain columns but not by their name but by their position. Imagine I want to change 200 and don't want to write all of them.
The code I tried was:
df /= df.max()
But it makes all of the columns to be between (0,1) and not only the ones I want. And I can't find a way to select a part of them only.
2) I would also like to re scale the columns but not between them, what I mean is I would like to make a scale only for milk and another one only for frozen, for instance.
I want to re scale each one, for example divide between 100 because they are too big, but maybe for another column I would like to divide it between 10 cause 100 is too much. How would I do that?
This process is called Scaling. There are two most common techniques of how to scale columns of Pandas dataframe – Min-Max Normalization and Standardization. Both of them have been discussed in the content below.
Depending on your needs, you may use either of the 4 techniques below in order to randomly select columns from Pandas DataFrame: (2) Randomly select a specified number of columns. For example, to select 3 random columns, set n=3: (3) Allow a random selection of the same column more than once (by setting replace=True):
the max method on pandas on a dataframe returns a list of the max of each column. Therefore if you use the above code, you'll have max values in each of the columns exactly equal to 1.
Python’s scikit-learn library has a tool just for this called the MinMaxScaler . You can use that to rescale your values as well, if you’d like. Sometimes data spans across many powers of 10. A great example is annual income.
For 1, you can select a list of columns like this:
df[['Milk','Frozen','Grocery']]
Therefore, to rescale only those three columns, use:
df[['Milk','Frozen','Grocery']] -= df[['Milk','Frozen','Grocery']].min()
df[['Milk','Frozen','Grocery']] /= df[['Milk','Frozen','Grocery']].max()
This method already scales your column independantly from each other if this is what your second question means.
EDIT:
If you want to select the 200 first columns of your dataframe, you can use df.columns
which gives you the list of your columns:
df[df.columns[:200]] -= df[df.columns[:200]].min()
df[df.columns[:200]] /= df[df.columns[:200]].max()
the max
method on pandas on a dataframe returns a list of the max of each column. Therefore if you use the above code, you'll have max values in each of the columns exactly equal to 1.
If you don't want to divide it by the max of each column but first column by n1
, second column by n2
you can use the same notation:
df[df.columns[:4]] /= [n1,n2,n3,n4]
Here's a solution for a single column which does actually rescale over 0,1:
a = [5,15,25,35,45,50,55,65,75,85,95]
df = pd.DataFrame(data=a, columns=['a'])
df['rescale'] = (df['a'] - min(df['a'])) / (max(df['a']) - min(df['a']))
Also a numpy method:
import numpy as np
rescale = (a - np.min(a))/np.ptp(a)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With