I have a pandas
Series
that is composed of int
s
a = np.array([1,2,3,5,7,10,13,16,20])
pd.Series(a)
0 1
1 2
2 3
3 5
4 7
5 10
6 13
7 16
8 20
now I want to cluster the series into groups that in each group, the differences between two neighbour values are <=
distance. For example, if the distance is defined as 1
, we have
[1,2,3], [5], [7], [10], [13], [16], [20]
if the distance is 2
, we have
[1,2,3,5,7], [10], [13], [16], [20]
if the distance is 3
, we have
[1,2,3,5,7,10,13,16], [20]
how to do this using pandas
/numpy
?
Step 1: Define two Pandas series, s1 and s2. Step 2: Compare the series using compare() function in the Pandas series. Step 3: Print their difference.
diff() function. This function calculates the difference between two consecutive DataFrame elements. Parameters: periods: Represents periods to shift for computing difference, Integer type value.
Pandas Series: diff() function The diff() function is used to first discrete difference of element. Calculates the difference of a Series element compared with another element in the Series (default is element in previous row). Periods to shift for calculating difference, accepts negative values.
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).
Pandas Series.diff () is used to find difference between elements of the same series. The difference is sequential and depends on period parameter passed to diff () method. Attention geek!
Python | Pandas Series.diff() Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas Series.diff() is used to find difference between elements of the same series.
Pandas str.find () method is used to search a substring in each string present in a series. If the string is found, it returns the lowest index of its occurrence. If string is not found, it will return -1.
However, dealing with consecutive values is almost always not easy in any circumstances such as SQL, so does Pandas. Standard SQL provides a bunch of window functions to facilitate this kind of manipulation, but there are not too many window functions handy in Pandas.
This is the pandas
way, using groupby
.
n = 1
s
0 1
1 2
2 3
3 5
4 7
5 10
6 13
7 16
8 20
dtype: int64
m = ~s.diff().fillna(0).le(n)
v = s.groupby(m.cumsum()).apply(lambda x: x.tolist()).tolist()
v
[[1, 2, 3], [5], [7], [10], [13], [16], [20]]
Here's one approach -
np.split(a,np.flatnonzero(np.diff(a)>d)+1)
As a function to output list of lists -
def splitme(a,d) :
return list(map(list,np.split(a,np.flatnonzero(np.diff(a)>d)+1)))
For performance, I would suggest using zip
to get the start, stop indices and then slicing, thus avoiding np.split
which might prove to be the bottleneck -
def splitme_zip(a,d) :
m = np.concatenate(([True],a[1:] > a[:-1] + d,[True]))
idx = np.flatnonzero(m)
l = a.tolist()
return [l[i:j] for i,j in zip(idx[:-1],idx[1:])]
If you need the output as a list of arrays, skip the list conversion with .tolist
/map(list,)
.
Sample runs -
In [122]: a = np.array([1,2,3,5,7,10,13,16,20])
In [123]: splitme(a,1)
Out[123]: [[1, 2, 3], [5], [7], [10], [13], [16], [20]]
In [124]: splitme(a,2)
Out[124]: [[1, 2, 3, 5, 7], [10], [13], [16], [20]]
In [125]: splitme(a,3)
Out[125]: [[1, 2, 3, 5, 7, 10, 13, 16], [20]]
Runtime test -
In [180]: a = np.sort(np.random.randint(1,10000*2,(10000)))
In [181]: s = pd.Series(a)
In [182]: d = 3
In [183]: %timeit pandas_way(s,d) #@cᴏʟᴅsᴘᴇᴇᴅ's soln
10 loops, best of 3: 55.1 ms per loop
In [184]: %timeit np.split(a,np.flatnonzero(np.diff(a)>d)+1)
...: %timeit splitme(a,d)
...: %timeit splitme_zip(a,d)
1000 loops, best of 3: 1.47 ms per loop
100 loops, best of 3: 2.87 ms per loop
1000 loops, best of 3: 516 µs per loop
In [185]: a
Out[185]: array([ 2, 2, 2, ..., 19992, 19996, 19999])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With