Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: fill one column with count of # of obs between occurrences in a 2nd column

Tags:

python

pandas

Say I have the following DataFrame which has a 0/1 entry depending on whether something happened/didn't happen within a certain month.

Y = [0,0,1,1,0,0,0,0,1,1,1]
X = pd.date_range(start = "2010", freq = "MS", periods = len(Y))

df = pd.DataFrame({'R': Y},index = X)



            R
2010-01-01  0
2010-02-01  0
2010-03-01  1
2010-04-01  1
2010-05-01  0
2010-06-01  0
2010-07-01  0
2010-08-01  0
2010-09-01  1
2010-10-01  1
2010-11-01  1

What I want is to create a 2nd column that lists the # of months until the next occurrence of a 1.

That is, I need:

            R  F
2010-01-01  0  2
2010-02-01  0  1
2010-03-01  1  0
2010-04-01  1  0
2010-05-01  0  4
2010-06-01  0  3
2010-07-01  0  2
2010-08-01  0  1
2010-09-01  1  0
2010-10-01  1  0
2010-11-01  1  0

What I've tried: I haven't gotten far, but I'm able to fill the first bit

A = list(df.index)
T = df[df['R']==1]

a = df.index[0]
b = T.index[0]
c = A.index(b) - A.index(a)

df.loc[a:b, 'F'] = np.linspace(c,0,c+1)

            R    F
2010-01-01  0  2.0
2010-02-01  0  1.0
2010-03-01  1  0.0
2010-04-01  1  NaN
2010-05-01  0  NaN
2010-06-01  0  NaN
2010-07-01  0  NaN
2010-08-01  0  NaN
2010-09-01  1  NaN
2010-10-01  1  NaN
2010-11-01  1  NaN

EDIT Probably would have been better to provide an original example that spanned multiple years.

Y = [0,0,1,1,0,0,0,0,1,1,1,0,0,1,1,1,0,1,1,1]
X = pd.date_range(start = "2010", freq = "MS", periods = len(Y))

df = pd.DataFrame({'R': Y},index = X)
like image 731
measure_theory Avatar asked Aug 09 '19 13:08

measure_theory


People also ask

How do you count the number of occurrences in a column in pandas?

Using the size() or count() method with pandas. DataFrame. groupby() will generate the count of a number of occurrences of data present in a particular column of the dataframe.

How do I count a specific column in Python?

To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.


4 Answers

Here is my way

s=df.R.cumsum()
df.loc[df.R==0,'F']=s.groupby(s).cumcount(ascending=False)+1
df.F.fillna(0,inplace=True)

df
Out[12]: 
            R    F
2010-01-01  0  2.0
2010-02-01  0  1.0
2010-03-01  1  0.0
2010-04-01  1  0.0
2010-05-01  0  4.0
2010-06-01  0  3.0
2010-07-01  0  2.0
2010-08-01  0  1.0
2010-09-01  1  0.0
2010-10-01  1  0.0
2010-11-01  1  0.0
like image 174
BENY Avatar answered Oct 05 '22 13:10

BENY


Create a series containing your dates, mask this series when your R series is not equal to 1, bfill, and subtract!


u = df.index.to_series()

ii = u.where(df.R.eq(1)).bfill()

12 * (ii.dt.year - u.dt.year) + (ii.dt.month - u.dt.month)

2010-01-01    2
2010-02-01    1
2010-03-01    0
2010-04-01    0
2010-05-01    4
2010-06-01    3
2010-07-01    2
2010-08-01    1
2010-09-01    0
2010-10-01    0
2010-11-01    0
Freq: MS, dtype: int64
like image 28
user3483203 Avatar answered Oct 05 '22 14:10

user3483203


Here is a way that worked for me, not as elegant as @user3483203 but it does the job.

df['F'] = 0 
for i in df.index: 
     j = i 
     while df.loc[j, 'R'] == 0: 
         df.loc[i, 'F'] =df.loc[i, 'F'] + 1 
         j=j+1                                                                                                                      
df  
################
Out[39]: 
        index  R  F
0  2010-01-01  0  2
1  2010-02-01  0  1
2  2010-03-01  1  0
3  2010-04-01  1  0
4  2010-05-01  0  4
5  2010-06-01  0  3
6  2010-07-01  0  2
7  2010-08-01  0  1
8  2010-09-01  1  0
9  2010-10-01  1  0
10 2010-11-01  1  0

In [40]:   
like image 24
nidabdella Avatar answered Oct 05 '22 12:10

nidabdella


My take

s = (df.R.diff().ne(0) | df.R.eq(1)).cumsum()
s.groupby(s).transform(lambda s: np.arange(len(s),0,-1) if len(s)>1 else 0)

2010-01-01    2
2010-02-01    1
2010-03-01    0
2010-04-01    0
2010-05-01    4
2010-06-01    3
2010-07-01    2
2010-08-01    1
2010-09-01    0
2010-10-01    0
2010-11-01    0
Freq: MS, Name: R, dtype: int64
like image 26
rafaelc Avatar answered Oct 05 '22 13:10

rafaelc