Identifying closest value in a column for each filter using Pandas

Tags:

I have a data frame with categories and values. I need to find the value in each category closest to a value. I think I'm close but I can't really get the right output when applying the results of argsort to the original dataframe.

For example, if the input was defined in the code below the output should have only (a, 1, True), (b, 2, True), (c, 2, True) and all other isClosest Values should be False.

If multiple values are closest then it should be the first value listed marked.

Here is the code I have which works but I can't get it to reapply to the dataframe correctly. I would love some pointers.

df = pd.DataFrame()
df['category'] = ['a', 'b', 'b', 'b', 'c', 'a', 'b', 'c', 'c', 'a']
df['values'] = [1, 2, 3, 4, 5, 4, 3, 2, 1, 0]
df['isClosest'] = False

uniqueCategories = df['category'].unique()
for c in uniqueCategories:
    filteredCategories = df[df['category']==c]    
    sortargs = (filteredCategories['value']-2.0).abs().argsort()
    #how to use sortargs so that we set column in df isClosest=True if its the closest value in each category to 2.0?

552

asked Oct 11 '16 04:10

Scott Chamberlin

2 Answers

You can create a column of absolute differences:

df['dif'] = (df['values'] - 2).abs()

df
Out: 
  category  values  dif
0        a       1    1
1        b       2    0
2        b       3    1
3        b       4    2
4        c       5    3
5        a       4    2
6        b       3    1
7        c       2    0
8        c       1    1
9        a       0    2

And then use groupby.transform to check whether the minimum value of each group is equal to the difference you calculated:

df['is_closest'] = df.groupby('category')['dif'].transform('min') == df['dif']

df
Out: 
  category  values  dif is_closest
0        a       1    1       True
1        b       2    0       True
2        b       3    1      False
3        b       4    2      False
4        c       5    3      False
5        a       4    2      False
6        b       3    1      False
7        c       2    0       True
8        c       1    1      False
9        a       0    2      False

df.groupby('category')['dif'].idxmin() would also give you the indices of the closest values for each category. You can use that for mapping too.

For selection:

df.loc[df.groupby('category')['dif'].idxmin()]
Out: 
  category  values  dif
0        a       1    1
1        b       2    0
7        c       2    0

For assignment:

df['is_closest'] = False
df.loc[df.groupby('category')['dif'].idxmin(), 'is_closest'] = True
df
Out: 
  category  values  dif is_closest
0        a       1    1       True
1        b       2    0       True
2        b       3    1      False
3        b       4    2      False
4        c       5    3      False
5        a       4    2      False
6        b       3    1      False
7        c       2    0       True
8        c       1    1      False
9        a       0    2      False

The difference between these approaches is that if you check equality against the difference, you would get True for all rows in case of ties. However, with idxmin it will return True for the first occurrence (only one for each group).

112

answered Oct 17 '22 17:10

ayhan

Solution with DataFrameGroupBy.idxmin - get indexes of minimal values per group and then assign boolean mask by Index.isin to column isClosest:

idx = (df['values'] - 2).abs().groupby([df['category']]).idxmin()
print (idx)
category
a    0
b    1
c    7
Name: values, dtype: int64

df['isClosest'] = df.index.isin(idx)
print (df)
  category  values isClosest
0        a       1      True
1        b       2      True
2        b       3     False
3        b       4     False
4        c       5     False
5        a       4     False
6        b       3     False
7        c       2      True
8        c       1     False
9        a       0     False

answered Oct 17 '22 16:10

jezrael

Related questions
                            
                                Finding the area of intersection of multiple overlapping rectangles in Python
                            
                                Opening a gzip file in python Apache Beam
                            
                                Do locally set Cython compiler directives affect one or all functions?
                            
                                additional column when saving pandas data frame to csv file
                            
                                Pandas Dataframe Line Plot: Show Random Markers
                            
                                Python Pandas read_excel doesn't recognize null cell
                            
                                Run multiple servers in python at same time (Threading)
                            
                                How to use yaml.load_all with fileinput.input?
                            
                                Divide two dataframes with python
                            
                                crontab to run python file if not running already
                            
                                How move a multipolygon with geopandas in python2
                            
                                Calculating the sum of a series?
                            
                                Python dictionary lookup performance, get vs in
                            
                                How do I pull a recurring key from a JSON?
                            
                                Using regex, extract quoted strings that may contain nested quotes
                            
                                Override the class patch with method patch (decorator)
                            
                                Using python requests and beautiful soup to pull text
                            
                                Model in Django 1.9. TypeError: __init__() got multiple values for argument 'verbose_name'
                            
                                What is libpython3.so compared with libpython3.5m.so built from python 3.5.2 source?
                            
                                Pandas pivot table: columns order and subtotals

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Identifying closest value in a column for each filter using Pandas

Tags:

python

pandas

group-by

boolean

closest

Scott Chamberlin

People also ask

2 Answers

ayhan

jezrael

Recent Activity

Donate For Us