Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to count the number of state change in pandas?

i have below dataframe that have columns 0-1 .. and i wanna count the number of 0->1,1->0 every column. in below dataframe 'a' column state change number is 6, 'b' state change number is 3 , 'c' state change number is 2 .. actually i don't know how code in pandas.

number a b c
1      0 0 0
2      1 0 1
3      0 1 1
4      1 1 1
5      0 0 0
6      1 0 0
7      0 1 0

actually i don't have idea in pandas.. because recently used only r. but now i must use python pandas. so have little bit in difficult situation anybody can help ? thanks in advance !

like image 915
jerry han Avatar asked Nov 29 '22 21:11

jerry han


2 Answers

Use rolling and compare each value, then count all True values by sum:

df = df[['a','b','c']].rolling(2).apply(lambda x: x[0] != x[-1], raw=True).sum().astype(int)
a    6
b    3
c    2
dtype: int64
like image 147
jezrael Avatar answered Dec 04 '22 03:12

jezrael


Bit wise xor (^)

Use the Numpy array df.values and compare the shifted elements with ^
This is meant to be a fast solution.

Xor has the property that only one of the two items being operated on can be true as shown in this truth table

A B XOR
T T   F
T F   T
F T   T
F F   F

And replicated in 0/1 form

a = np.array([1, 1, 0, 0])
b = np.array([1, 0, 1, 0])

pd.DataFrame(dict(A=a, B=b, XOR=a ^ b))

   A  B  XOR
0  1  1    0
1  1  0    1
2  0  1    1
3  0  0    0

Demo

v = df.values

pd.Series((v[1:] ^ v[:-1]).sum(0), df.columns)

a    6
b    3
c    2
dtype: int64

Time Testing

Open in Colab
Open in GitHub

Functions

def pir_xor(df):
  v = df.values
  return pd.Series((v[1:] ^ v[:-1]).sum(0), df.columns)

def pir_diff1(df):
  v = df.values
  return pd.Series(np.abs(np.diff(v, axis=0)).sum(0), df.columns)

def pir_diff2(df):
  v = df.values
  return pd.Series(np.diff(v.astype(np.bool), axis=0).sum(0), df.columns)

def cold(df):
  return df.ne(df.shift(-1)).sum(0) - 1

def jez(df):
  return df.rolling(2).apply(lambda x: x[0] != x[-1]).sum().astype(int)

def naga(df):
  return df.diff().abs().sum().astype(int)

Testing

np.random.seed([3, 1415])

idx = [10, 30, 100, 300, 1000, 3000, 10000, 30000, 100000, 300000]
col = 'pir_xor pir_diff1 pir_diff2 cold jez naga'.split()
res = pd.DataFrame(np.nan, idx, col)

for i in idx:
  df = pd.DataFrame(np.random.choice([0, 1], size=(i, 3)), columns=[*'abc'])
  for j in col:
    stmt = f"{j}(df)"
    setp = f"from __main__ import {j}, df"
    res.at[i, j] = timeit(stmt, setp, number=100)

Results

res.div(res.min(1), 0)

        pir_xor  pir_diff1  pir_diff2       cold         jez      naga
10      1.06203   1.119769   1.000000  21.217555   16.768532  6.601518
30      1.00000   1.075406   1.115743  23.229013   18.844025  7.212369
100     1.00000   1.134082   1.174973  22.673289   21.478068  7.519898
300     1.00000   1.119153   1.166782  21.725495   26.293712  7.215490
1000    1.00000   1.106267   1.167786  18.394462   37.925160  6.284253
3000    1.00000   1.118554   1.342192  16.053097   64.953310  5.594610
10000   1.00000   1.163557   1.511631  12.008129  106.466636  4.503359
30000   1.00000   1.249835   1.431120   7.826387  118.380227  3.621455
100000  1.00000   1.275272   1.528840   6.690012  131.912349  3.150155
300000  1.00000   1.279373   1.528238   6.301007  140.667427  3.190868

res.plot(loglog=True, figsize=(15, 8))

enter image description here

like image 42
piRSquared Avatar answered Dec 04 '22 03:12

piRSquared