I have a data frame with categories and values. I need to find the value in each category closest to a value. I think I'm close but I can't really get the right output when applying the results of argsort to the original dataframe.
For example, if the input was defined in the code below the output should have only (a, 1, True)
, (b, 2, True)
, (c, 2, True)
and all other isClosest Values
should be False.
If multiple values are closest then it should be the first value listed marked.
Here is the code I have which works but I can't get it to reapply to the dataframe correctly. I would love some pointers.
df = pd.DataFrame()
df['category'] = ['a', 'b', 'b', 'b', 'c', 'a', 'b', 'c', 'c', 'a']
df['values'] = [1, 2, 3, 4, 5, 4, 3, 2, 1, 0]
df['isClosest'] = False
uniqueCategories = df['category'].unique()
for c in uniqueCategories:
filteredCategories = df[df['category']==c]
sortargs = (filteredCategories['value']-2.0).abs().argsort()
#how to use sortargs so that we set column in df isClosest=True if its the closest value in each category to 2.0?
To find the closest value in the Numpy array to a certain value, we use the following function: def find_closest(arr, val): idx = np. abs(arr – val).
Select Data Using Location Index (. This means that you can use dataframe. iloc[0:1, 0:1] to select the cell value at the intersection of the first row and first column of the dataframe. You can expand the range for either the row index or column index to select more data.
loc[source] Access a group of rows and columns by label(s) or a boolean array. .loc[] is primarily label based, but may also be used with a boolean array. Allowed inputs are: A single label, e.g. 5 or 'a' , (note that 5 is interpreted as a label of the index, and never as an integer position along the index).
You can create a column of absolute differences:
df['dif'] = (df['values'] - 2).abs()
df
Out:
category values dif
0 a 1 1
1 b 2 0
2 b 3 1
3 b 4 2
4 c 5 3
5 a 4 2
6 b 3 1
7 c 2 0
8 c 1 1
9 a 0 2
And then use groupby.transform
to check whether the minimum value of each group is equal to the difference you calculated:
df['is_closest'] = df.groupby('category')['dif'].transform('min') == df['dif']
df
Out:
category values dif is_closest
0 a 1 1 True
1 b 2 0 True
2 b 3 1 False
3 b 4 2 False
4 c 5 3 False
5 a 4 2 False
6 b 3 1 False
7 c 2 0 True
8 c 1 1 False
9 a 0 2 False
df.groupby('category')['dif'].idxmin()
would also give you the indices of the closest values for each category. You can use that for mapping too.
For selection:
df.loc[df.groupby('category')['dif'].idxmin()]
Out:
category values dif
0 a 1 1
1 b 2 0
7 c 2 0
For assignment:
df['is_closest'] = False
df.loc[df.groupby('category')['dif'].idxmin(), 'is_closest'] = True
df
Out:
category values dif is_closest
0 a 1 1 True
1 b 2 0 True
2 b 3 1 False
3 b 4 2 False
4 c 5 3 False
5 a 4 2 False
6 b 3 1 False
7 c 2 0 True
8 c 1 1 False
9 a 0 2 False
The difference between these approaches is that if you check equality against the difference, you would get True for all rows in case of ties. However, with idxmin
it will return True for the first occurrence (only one for each group).
Solution with DataFrameGroupBy.idxmin
- get indexes of minimal values per group and then assign boolean mask by Index.isin
to column isClosest
:
idx = (df['values'] - 2).abs().groupby([df['category']]).idxmin()
print (idx)
category
a 0
b 1
c 7
Name: values, dtype: int64
df['isClosest'] = df.index.isin(idx)
print (df)
category values isClosest
0 a 1 True
1 b 2 True
2 b 3 False
3 b 4 False
4 c 5 False
5 a 4 False
6 b 3 False
7 c 2 True
8 c 1 False
9 a 0 False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With