Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Identifying closest value in a column for each filter using Pandas

I have a data frame with categories and values. I need to find the value in each category closest to a value. I think I'm close but I can't really get the right output when applying the results of argsort to the original dataframe.

For example, if the input was defined in the code below the output should have only (a, 1, True), (b, 2, True), (c, 2, True) and all other isClosest Values should be False.

If multiple values are closest then it should be the first value listed marked.

Here is the code I have which works but I can't get it to reapply to the dataframe correctly. I would love some pointers.

df = pd.DataFrame()
df['category'] = ['a', 'b', 'b', 'b', 'c', 'a', 'b', 'c', 'c', 'a']
df['values'] = [1, 2, 3, 4, 5, 4, 3, 2, 1, 0]
df['isClosest'] = False

uniqueCategories = df['category'].unique()
for c in uniqueCategories:
    filteredCategories = df[df['category']==c]    
    sortargs = (filteredCategories['value']-2.0).abs().argsort()
    #how to use sortargs so that we set column in df isClosest=True if its the closest value in each category to 2.0?
like image 552
Scott Chamberlin Avatar asked Oct 11 '16 04:10

Scott Chamberlin


People also ask

How do I find the closest value in a column in Python?

To find the closest value in the Numpy array to a certain value, we use the following function: def find_closest(arr, val): idx = np. abs(arr – val).

How do you select a specific value from a pandas series?

Select Data Using Location Index (. This means that you can use dataframe. iloc[0:1, 0:1] to select the cell value at the intersection of the first row and first column of the dataframe. You can expand the range for either the row index or column index to select more data.

What is LOC method in pandas?

loc[source] Access a group of rows and columns by label(s) or a boolean array. .loc[] is primarily label based, but may also be used with a boolean array. Allowed inputs are: A single label, e.g. 5 or 'a' , (note that 5 is interpreted as a label of the index, and never as an integer position along the index).


2 Answers

You can create a column of absolute differences:

df['dif'] = (df['values'] - 2).abs()

df
Out: 
  category  values  dif
0        a       1    1
1        b       2    0
2        b       3    1
3        b       4    2
4        c       5    3
5        a       4    2
6        b       3    1
7        c       2    0
8        c       1    1
9        a       0    2

And then use groupby.transform to check whether the minimum value of each group is equal to the difference you calculated:

df['is_closest'] = df.groupby('category')['dif'].transform('min') == df['dif']

df
Out: 
  category  values  dif is_closest
0        a       1    1       True
1        b       2    0       True
2        b       3    1      False
3        b       4    2      False
4        c       5    3      False
5        a       4    2      False
6        b       3    1      False
7        c       2    0       True
8        c       1    1      False
9        a       0    2      False

df.groupby('category')['dif'].idxmin() would also give you the indices of the closest values for each category. You can use that for mapping too.

For selection:

df.loc[df.groupby('category')['dif'].idxmin()]
Out: 
  category  values  dif
0        a       1    1
1        b       2    0
7        c       2    0

For assignment:

df['is_closest'] = False
df.loc[df.groupby('category')['dif'].idxmin(), 'is_closest'] = True
df
Out: 
  category  values  dif is_closest
0        a       1    1       True
1        b       2    0       True
2        b       3    1      False
3        b       4    2      False
4        c       5    3      False
5        a       4    2      False
6        b       3    1      False
7        c       2    0       True
8        c       1    1      False
9        a       0    2      False

The difference between these approaches is that if you check equality against the difference, you would get True for all rows in case of ties. However, with idxmin it will return True for the first occurrence (only one for each group).

like image 112
ayhan Avatar answered Oct 17 '22 17:10

ayhan


Solution with DataFrameGroupBy.idxmin - get indexes of minimal values per group and then assign boolean mask by Index.isin to column isClosest:

idx = (df['values'] - 2).abs().groupby([df['category']]).idxmin()
print (idx)
category
a    0
b    1
c    7
Name: values, dtype: int64

df['isClosest'] = df.index.isin(idx)
print (df)
  category  values isClosest
0        a       1      True
1        b       2      True
2        b       3     False
3        b       4     False
4        c       5     False
5        a       4     False
6        b       3     False
7        c       2      True
8        c       1     False
9        a       0     False
like image 38
jezrael Avatar answered Oct 17 '22 16:10

jezrael