I have a <code>pandas</code> <code>Series</code> that is composed of <code>int</code>s <pre class="prettyprint"><code>a = np.array([1,2,3,5,7,10,13,16,20]) pd.Series(a) 0 1 1 2 2 3 3 5 4 7 5 10 6 13 7 16 8 20 </code></pre> now I want to cluster the series into groups that in each group, the differences between two neighbour values are <code><=</code> distance. For example, if the distance is defined as <code>1</code>, we have <pre class="prettyprint"><code>[1,2,3], [5], [7], [10], [13], [16], [20] </code></pre> if the distance is <code>2</code>, we have <pre class="prettyprint"><code>[1,2,3,5,7], [10], [13], [16], [20] </code></pre> if the distance is <code>3</code>, we have <pre class="prettyprint"><code>[1,2,3,5,7,10,13,16], [20] </code></pre> how to do this using <code>pandas</code>/<code>numpy</code>?

Here's one approach - <pre class="prettyprint"><code>np.split(a,np.flatnonzero(np.diff(a)>d)+1) </code></pre> As a function to output list of lists - <pre class="prettyprint"><code>def splitme(a,d) : return list(map(list,np.split(a,np.flatnonzero(np.diff(a)>d)+1))) </code></pre> For performance, I would suggest using <code>zip</code> to get the start, stop indices and then slicing, thus avoiding <code>np.split</code> which might prove to be the bottleneck - <pre class="prettyprint"><code>def splitme_zip(a,d) : m = np.concatenate(([True],a[1:] > a[:-1] + d,[True])) idx = np.flatnonzero(m) l = a.tolist() return [l[i:j] for i,j in zip(idx[:-1],idx[1:])] </code></pre> If you need the output as a list of arrays, skip the list conversion with <code>.tolist</code>/<code>map(list,)</code>. Sample runs - <pre class="prettyprint"><code>In [122]: a = np.array([1,2,3,5,7,10,13,16,20]) In [123]: splitme(a,1) Out[123]: [[1, 2, 3], [5], [7], [10], [13], [16], [20]] In [124]: splitme(a,2) Out[124]: [[1, 2, 3, 5, 7], [10], [13], [16], [20]] In [125]: splitme(a,3) Out[125]: [[1, 2, 3, 5, 7, 10, 13, 16], [20]] </code></pre> Runtime test - <pre class="prettyprint"><code>In [180]: a = np.sort(np.random.randint(1,10000*2,(10000))) In [181]: s = pd.Series(a) In [182]: d = 3 In [183]: %timeit pandas_way(s,d) #@cᴏʟᴅsᴘᴇᴇᴅ's soln 10 loops, best of 3: 55.1 ms per loop In [184]: %timeit np.split(a,np.flatnonzero(np.diff(a)>d)+1) ...: %timeit splitme(a,d) ...: %timeit splitme_zip(a,d) 1000 loops, best of 3: 1.47 ms per loop 100 loops, best of 3: 2.87 ms per loop 1000 loops, best of 3: 516 µs per loop In [185]: a Out[185]: array([ 2, 2, 2, ..., 19992, 19996, 19999]) </code></pre>

pandas how to find continuous values in a series whose differences are within a certain distance

Q: How do you compare Pandas Series values?

Step 1: Define two Pandas series, s1 and s2. Step 2: Compare the series using compare() function in the Pandas series. Step 3: Print their difference.

Q: How do you tell the difference between consecutive rows in Pandas?

diff() function. This function calculates the difference between two consecutive DataFrame elements. Parameters: periods: Represents periods to shift for computing difference, Integer type value.

Q: How do you tell the difference between two Series in Pandas?

Pandas Series: diff() function The diff() function is used to first discrete difference of element. Calculates the difference of a Series element compared with another element in the Series (default is element in previous row). Periods to shift for calculating difference, accepts negative values.

Q: Can a Pandas Series object hold data of different types?

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).

Q: How to find difference between elements of same series in pandas?

Pandas Series.diff () is used to find difference between elements of the same series. The difference is sequential and depends on period parameter passed to diff () method. Attention geek!

Q: What is pandas series diff in Python?

Python | Pandas Series.diff() Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas Series.diff() is used to find difference between elements of the same series.

Q: How to search a substring in pandas series?

Pandas str.find () method is used to search a substring in each string present in a series. If the string is found, it returns the lowest index of its occurrence. If string is not found, it will return -1.

Q: Is it possible to deal with consecutive values in pandas?

However, dealing with consecutive values is almost always not easy in any circumstances such as SQL, so does Pandas. Standard SQL provides a bunch of window functions to facilitate this kind of manipulation, but there are not too many window functions handy in Pandas.

Tags:

python

python-3.x

pandas

numpy

I have a pandas Series that is composed of ints

a = np.array([1,2,3,5,7,10,13,16,20])
pd.Series(a)

0  1
1  2
2  3
3  5
4  7
5  10
6  13
7  16
8  20

now I want to cluster the series into groups that in each group, the differences between two neighbour values are <= distance. For example, if the distance is defined as 1, we have

[1,2,3], [5], [7], [10], [13], [16], [20]

if the distance is 2, we have

[1,2,3,5,7], [10], [13], [16], [20]

if the distance is 3, we have

[1,2,3,5,7,10,13,16], [20]

how to do this using pandas/numpy?

575

asked Nov 08 '17 15:11

daiyue

2 Answers

This is the pandas way, using groupby.

n = 1

s

0     1
1     2
2     3
3     5
4     7
5    10
6    13
7    16
8    20
dtype: int64

m = ~s.diff().fillna(0).le(n)   
v = s.groupby(m.cumsum()).apply(lambda x: x.tolist()).tolist()

v
[[1, 2, 3], [5], [7], [10], [13], [16], [20]]

186

answered Sep 22 '22 02:09

cs95

Here's one approach -

np.split(a,np.flatnonzero(np.diff(a)>d)+1)

As a function to output list of lists -

def splitme(a,d) : 
    return list(map(list,np.split(a,np.flatnonzero(np.diff(a)>d)+1)))

For performance, I would suggest using zip to get the start, stop indices and then slicing, thus avoiding np.split which might prove to be the bottleneck -

def splitme_zip(a,d) : 
    m = np.concatenate(([True],a[1:] > a[:-1] + d,[True]))
    idx = np.flatnonzero(m)
    l = a.tolist()
    return [l[i:j] for i,j in zip(idx[:-1],idx[1:])]

If you need the output as a list of arrays, skip the list conversion with .tolist/map(list,).

Sample runs -

In [122]: a = np.array([1,2,3,5,7,10,13,16,20])

In [123]: splitme(a,1)
Out[123]: [[1, 2, 3], [5], [7], [10], [13], [16], [20]]

In [124]: splitme(a,2)
Out[124]: [[1, 2, 3, 5, 7], [10], [13], [16], [20]]

In [125]: splitme(a,3)
Out[125]: [[1, 2, 3, 5, 7, 10, 13, 16], [20]]

Runtime test -

In [180]: a = np.sort(np.random.randint(1,10000*2,(10000)))

In [181]: s = pd.Series(a)

In [182]: d = 3

In [183]: %timeit pandas_way(s,d) #@cᴏʟᴅsᴘᴇᴇᴅ's soln
10 loops, best of 3: 55.1 ms per loop

In [184]: %timeit np.split(a,np.flatnonzero(np.diff(a)>d)+1)
     ...: %timeit splitme(a,d)
     ...: %timeit splitme_zip(a,d)
1000 loops, best of 3: 1.47 ms per loop
100 loops, best of 3: 2.87 ms per loop
1000 loops, best of 3: 516 µs per loop

In [185]: a
Out[185]: array([    2,     2,     2, ..., 19992, 19996, 19999])

answered Sep 22 '22 02:09

Divakar

Related questions
                            
                                Iterate over two lists with different lengths
                            
                                What is a statement in coverage.py?
                            
                                Sqlalchemy complex NOT IN another table query
                            
                                Creating a Boxplot with Matplotlib
                            
                                Converting numpy array into dataframe column?
                            
                                Keras Multi-inputs AttributeError: 'NoneType' object has no attribute 'inbound_nodes'
                            
                                How to permute one column in pandas
                            
                                count occurrence of a list in a list of lists
                            
                                ModuleNotFoundError: No module named 'models'
                            
                                What's the idiomatic way to perform an aggregate and rename operation in pandas
                            
                                Does Pycharm have Docstring Conventions checks (PEP 257)?
                            
                                Pytest skip test with certain parameter value
                            
                                Reading .eml files with Python 3.6 using emaildata 0.3.4
                            
                                Jinja2 Padding and Aligning Strings
                            
                                Pandas unable to open this Excel file
                            
                                functional difference between lookarounds and non-capture group?
                            
                                Manually changing learning_rate in tf.train.AdamOptimizer
                            
                                Mini batch-training of a scikit-learn classifier where I provide the mini batches
                            
                                Keras network can never classify the last class
                            
                                What should be the arguments of cv2.setMouseCallback()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With