I have a data frame with two columns :
state total_sales
AL 16714
AR 6498
AZ 107296
CA 33717
Now I want to map the strings in state column to int from 1 to N(where N is the no of rows,here 4 ) based on increasing order of values in total_sales . Result should be stored in another column (say label). That is, wanted a result like this :
state total_sales label
AL 16714 3
AR 6498 4
AZ 107296 1
CA 33717 2
Please suggest a vectorised implementation .
You can use rank
with cast to int
:
df['label'] = df['total_sales'].rank(method='dense', ascending=False).astype(int)
print (df)
state total_sales label
0 AL 16714 3
1 AR 6498 4
2 AZ 107296 1
3 CA 33717 2
One option for converting a column of values to integers is pandas.Categorical
.
This actually groups identical values, which in a case like this, where all values are unique each "group" has only one value. The resulting object has a codes
attribute, which is a Numpy array of integers indicating which group each input value is in.
Applied to this problem, if you have
In [12]: data = pd.DataFrame({
'state': ['AL', 'AR', 'AZ', 'CA'],
'total_sales': [16714, 6498, 107296, 33717]
})
you can add the label column as described using
In [13]: data['label'] = len(data) - pd.Categorical(data.total_sales, ordered=True).codes
In [14]: print(data)
state total_sales label
0 AL 16714 3
1 AR 6498 4
2 AZ 107296 1
3 CA 33717 2
For this example it is not as fast as jezrael's answer but it has a wide range of applications and it was faster for a larger series I was encoding to integers. It should be noted that if there are two identical values in the total_sales
column it will assign them the same label.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With