Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas how to find continuous values in a series whose differences are within a certain distance

I have a pandas Series that is composed of ints

a = np.array([1,2,3,5,7,10,13,16,20])
pd.Series(a)

0  1
1  2
2  3
3  5
4  7
5  10
6  13
7  16
8  20

now I want to cluster the series into groups that in each group, the differences between two neighbour values are <= distance. For example, if the distance is defined as 1, we have

[1,2,3], [5], [7], [10], [13], [16], [20]

if the distance is 2, we have

[1,2,3,5,7], [10], [13], [16], [20]

if the distance is 3, we have

[1,2,3,5,7,10,13,16], [20]

how to do this using pandas/numpy?

like image 575
daiyue Avatar asked Nov 08 '17 15:11

daiyue


People also ask

How do you compare Pandas Series values?

Step 1: Define two Pandas series, s1 and s2. Step 2: Compare the series using compare() function in the Pandas series. Step 3: Print their difference.

How do you tell the difference between consecutive rows in Pandas?

diff() function. This function calculates the difference between two consecutive DataFrame elements. Parameters: periods: Represents periods to shift for computing difference, Integer type value.

How do you tell the difference between two Series in Pandas?

Pandas Series: diff() function The diff() function is used to first discrete difference of element. Calculates the difference of a Series element compared with another element in the Series (default is element in previous row). Periods to shift for calculating difference, accepts negative values.

Can a Pandas Series object hold data of different types?

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).

How to find difference between elements of same series in pandas?

Pandas Series.diff () is used to find difference between elements of the same series. The difference is sequential and depends on period parameter passed to diff () method. Attention geek!

What is pandas series diff in Python?

Python | Pandas Series.diff() Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas Series.diff() is used to find difference between elements of the same series.

How to search a substring in pandas series?

Pandas str.find () method is used to search a substring in each string present in a series. If the string is found, it returns the lowest index of its occurrence. If string is not found, it will return -1.

Is it possible to deal with consecutive values in pandas?

However, dealing with consecutive values is almost always not easy in any circumstances such as SQL, so does Pandas. Standard SQL provides a bunch of window functions to facilitate this kind of manipulation, but there are not too many window functions handy in Pandas.


2 Answers

This is the pandas way, using groupby.

n = 1

s

0     1
1     2
2     3
3     5
4     7
5    10
6    13
7    16
8    20
dtype: int64

m = ~s.diff().fillna(0).le(n)   
v = s.groupby(m.cumsum()).apply(lambda x: x.tolist()).tolist()

v
[[1, 2, 3], [5], [7], [10], [13], [16], [20]]
like image 186
cs95 Avatar answered Sep 22 '22 02:09

cs95


Here's one approach -

np.split(a,np.flatnonzero(np.diff(a)>d)+1)

As a function to output list of lists -

def splitme(a,d) : 
    return list(map(list,np.split(a,np.flatnonzero(np.diff(a)>d)+1)))

For performance, I would suggest using zip to get the start, stop indices and then slicing, thus avoiding np.split which might prove to be the bottleneck -

def splitme_zip(a,d) : 
    m = np.concatenate(([True],a[1:] > a[:-1] + d,[True]))
    idx = np.flatnonzero(m)
    l = a.tolist()
    return [l[i:j] for i,j in zip(idx[:-1],idx[1:])]

If you need the output as a list of arrays, skip the list conversion with .tolist/map(list,).

Sample runs -

In [122]: a = np.array([1,2,3,5,7,10,13,16,20])

In [123]: splitme(a,1)
Out[123]: [[1, 2, 3], [5], [7], [10], [13], [16], [20]]

In [124]: splitme(a,2)
Out[124]: [[1, 2, 3, 5, 7], [10], [13], [16], [20]]

In [125]: splitme(a,3)
Out[125]: [[1, 2, 3, 5, 7, 10, 13, 16], [20]]

Runtime test -

In [180]: a = np.sort(np.random.randint(1,10000*2,(10000)))

In [181]: s = pd.Series(a)

In [182]: d = 3

In [183]: %timeit pandas_way(s,d) #@cᴏʟᴅsᴘᴇᴇᴅ's soln
10 loops, best of 3: 55.1 ms per loop

In [184]: %timeit np.split(a,np.flatnonzero(np.diff(a)>d)+1)
     ...: %timeit splitme(a,d)
     ...: %timeit splitme_zip(a,d)
1000 loops, best of 3: 1.47 ms per loop
100 loops, best of 3: 2.87 ms per loop
1000 loops, best of 3: 516 µs per loop

In [185]: a
Out[185]: array([    2,     2,     2, ..., 19992, 19996, 19999])
like image 45
Divakar Avatar answered Sep 22 '22 02:09

Divakar