Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count distinct strings in rolling window using pandas

Tags:

python

pandas

How do I count the number of unique strings in a rolling window of a pandas dataframe?

a = pd.DataFrame(['a','b','a','a','b','c','d','e','e','e','e'])
a.rolling(3).apply(lambda x: len(np.unique(x)))

Output, same as original dataframe:

    0
0   a
1   b
2   a
3   a
4   b
5   c
6   d
7   e
8   e
9   e
10  e

Expected:

    0
0   1
1   2
2   2
3   2
4   2
5   3
6   3
7   3
8   2
9   1
10  1
like image 975
user4446237 Avatar asked Sep 14 '17 13:09

user4446237


People also ask

What does unique () do in pandas?

The unique function in pandas is used to find the unique values from a series. A series is a single column of a data frame. We can use the unique function on any possible set of elements in Python. It can be used on a series of strings, integers, tuples, or mixed elements.

How do you count unique string values in pandas?

You can use the nunique() function to count the number of unique values in a pandas DataFrame.

How do I select distinct rows in pandas?

And you can use the following syntax to select unique rows across specific columns in a pandas DataFrame: df = df. drop_duplicates(subset=['col1', 'col2', ...])

How do you count occurrences in pandas series?

How do you Count the Number of Occurrences in a data frame? To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.


1 Answers

I think you need first convert values to numeric - by factorize or by rank. Also min_periods parameter is necessary for avoid NaN in start of column:

a[0] = pd.factorize(a[0])[0]
print (a)
    0
0   0
1   1
2   0
3   0
4   1
5   2
6   3
7   4
8   4
9   4
10  4

b = a.rolling(3, min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
print (b)
    0
0   1
1   2
2   2
3   2
4   2
5   3
6   3
7   3
8   2
9   1
10  1

Or:

a[0] = a[0].rank(method='dense')
      0
0   1.0
1   2.0
2   1.0
3   1.0
4   2.0
5   3.0
6   4.0
7   5.0
8   5.0
9   5.0
10  5.0

b = a.rolling(3, min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
print (b)
    0
0   1
1   2
2   2
3   2
4   2
5   3
6   3
7   3
8   2
9   1
10  1
like image 50
jezrael Avatar answered Oct 01 '22 13:10

jezrael