I have seen: <ul> <li>how do I find the closest value to a given number in an array?</li> <li> How do I find the closest array element to an arbitrary (non-member) number?.</li> </ul> These relate to vanilla python and not pandas. If I have the series: <pre class="prettyprint"><code>ix num 0 1 1 6 2 4 3 5 4 2 </code></pre> And I input 3, how can I (efficiently) find? <ol> <li>The index of 3 if it is found in the series</li> <li>The index of the value below and above 3 if it is not found in the series.</li> </ol> Ie. With the above series {1,6,4,5,2}, and input 3, I should get values (4,2) with indexes (2,4).

You could use <code>argsort()</code> like Say, <code>input = 3</code> <pre class="prettyprint"><code>In [198]: input = 3 In [199]: df.iloc[(df['num']-input).abs().argsort()[:2]] Out[199]: num 2 4 4 2 </code></pre> <code>df_sort</code> is the dataframe with 2 closest values. <pre class="prettyprint"><code>In [200]: df_sort = df.iloc[(df['num']-input).abs().argsort()[:2]] </code></pre> For index, <pre class="prettyprint"><code>In [201]: df_sort.index.tolist() Out[201]: [2, 4] </code></pre> For values, <pre class="prettyprint"><code>In [202]: df_sort['num'].tolist() Out[202]: [4, 2] </code></pre> <hr> Detail, for the above solution <code>df</code> was <pre class="prettyprint"><code>In [197]: df Out[197]: num 0 1 1 6 2 4 3 5 4 2 </code></pre>

How do I find the closest values in a Pandas series to an input number?

Tags:

python

pandas

dataframe

ranking

I have seen:

how do I find the closest value to a given number in an array?
How do I find the closest array element to an arbitrary (non-member) number?.

These relate to vanilla python and not pandas.

If I have the series:

ix   num   0    1 1    6 2    4 3    5 4    2

And I input 3, how can I (efficiently) find?

The index of 3 if it is found in the series
The index of the value below and above 3 if it is not found in the series.

Ie. With the above series {1,6,4,5,2}, and input 3, I should get values (4,2) with indexes (2,4).

991

asked May 07 '15 21:05

Steve

2 Answers

You could use argsort() like

Say, input = 3

In [198]: input = 3  In [199]: df.iloc[(df['num']-input).abs().argsort()[:2]] Out[199]:    num 2    4 4    2

df_sort is the dataframe with 2 closest values.

In [200]: df_sort = df.iloc[(df['num']-input).abs().argsort()[:2]]

For index,

In [201]: df_sort.index.tolist() Out[201]: [2, 4]

For values,

In [202]: df_sort['num'].tolist() Out[202]: [4, 2]

Detail, for the above solution df was

In [197]: df Out[197]:    num 0    1 1    6 2    4 3    5 4    2

answered Oct 01 '22 06:10

Zero

Apart from not completely answering the question, an extra disadvantage of the other algorithms discussed here is that they have to sort the entire list. This results in a complexity of ~N log(N).

However, it is possible to achieve the same results in ~N. This approach separates the dataframe in two subsets, one smaller and one larger than the desired value. The lower neighbour is than the largest value in the smaller dataframe and vice versa for the upper neighbour.

This gives the following code snippet:

def find_neighbours(value, df, colname):     exactmatch = df[df[colname] == value]     if not exactmatch.empty:         return exactmatch.index     else:         lowerneighbour_ind = df[df[colname] < value][colname].idxmax()         upperneighbour_ind = df[df[colname] > value][colname].idxmin()         return [lowerneighbour_ind, upperneighbour_ind]

This approach is similar to using partition in pandas, which can be really useful when dealing with large datasets and complexity becomes an issue.

Comparing both strategies shows that for large N, the partitioning strategy is indeed faster. For small N, the sorting strategy will be more efficient, as it is implemented at a much lower level. It is also a one-liner, which might increase code readability. Comparison of partitioning vs sorting

The code to replicate this plot can be seen below:

from matplotlib import pyplot as plt import pandas import numpy import timeit  value=3 sizes=numpy.logspace(2, 5, num=50, dtype=int)  sort_results, partition_results=[],[] for size in sizes:     df=pandas.DataFrame({"num":100*numpy.random.random(size)})          sort_results.append(timeit.Timer("df.iloc[(df['num']-value).abs().argsort()[:2]].index",                                          globals={'find_neighbours':find_neighbours, 'df':df,'value':value}).autorange())     partition_results.append(timeit.Timer('find_neighbours(df,value)',                                           globals={'find_neighbours':find_neighbours, 'df':df,'value':value}).autorange())      sort_time=[time/amount for amount,time in sort_results] partition_time=[time/amount for amount,time in partition_results]  plt.plot(sizes, sort_time) plt.plot(sizes, partition_time) plt.legend(['Sorting','Partitioning']) plt.title('Comparison of strategies') plt.xlabel('Size of Dataframe') plt.ylabel('Time in s') plt.savefig('speed_comparison.png')

answered Oct 01 '22 07:10

Ivo Merchiers

Related questions
                            
                                Hashing arrays in Python
                            
                                Can you check that an exception is thrown with doctest in Python?
                            
                                Use the default Python rather than the Anaconda installation when called from the terminal
                            
                                Why is '#!/usr/bin/env python' supposedly more correct than just '#!/usr/bin/python'?
                            
                                TypeError: unsupported operand type(s) for -: 'str' and 'int'
                            
                                Inherited class variable modification in Python
                            
                                Testing equality of three values
                            
                                How to convert OpenDocument spreadsheets to a pandas DataFrame?
                            
                                How to scrape a website which requires login using python and beautifulsoup?
                            
                                How to keep leading zeros in a column when reading CSV with Pandas?
                            
                                Correct way to set value on a slice in pandas [duplicate]
                            
                                How to set the range of y-axis for a seaborn boxplot?
                            
                                pyyaml: dumping without tags
                            
                                Python: using sys.exit or SystemExit differences and suggestions
                            
                                python equivalent of filter() getting two output lists (i.e. partition of a list)
                            
                                Switching from SQLite to MySQL with Flask SQLAlchemy
                            
                                Can't get argparse to read quoted string with dashes in it?
                            
                                How do I set sys.argv so I can unit test it?
                            
                                Is there an equivalent to the "for ... else" Python loop in C++?
                            
                                How to run script with elevated privilege on windows

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With