Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Take the difference of all elements of a series with the previous ones in python pandas

I have a dataframe with sorted values labeled by ids and I want to take the difference of the value for the first element of an id with the value of the last elements of the all previous ids. The code below does what I want:

import pandas as pd

a = 'a'; b = 'b'; c = 'c'
df = pd.DataFrame(data=[*zip([a, a, a, b, b, c, a], [1, 2, 3, 5, 6, 7, 8])],
                  columns=['id', 'value'])
print(df)
# # take the last value for a particular id
# last_value_for_id = df.loc[df.id.shift(-1) != df.id, :]
# print(last_value_for_id)
current_id = ''; prev_values = {};diffs = {}
for t in df.itertuples(index=False):
    prev_values[t.id] = t.value
    if current_id != t.id:
        current_id = t.id
    else: continue
    for k, v in prev_values.items():
        if k == current_id: continue
        diffs[(k, current_id)] = t.value - v
print(pd.DataFrame(data=diffs.values(), columns=['diff'], index=diffs.keys()))

prints:

  id  value
0  a      1
1  a      2
2  a      3
3  b      5
4  b      6
5  c      7
6  a      8
     diff
a b     2
  c     4
b c     1
  a     2
c a     1

I want to do this in a vectorized manner however. I have found a way of getting the series of last elements as in:

# take the last value for a particular id
last_value_for_id = df.loc[df.id.shift(-1) != df.id, :]
print(last_value_for_id)

which gives me:

  id  value
2  a      3
4  b      6
5  c      7

but can't find a way of using this to take the diffs in a vectorized manner

like image 236
Mr_and_Mrs_D Avatar asked May 14 '19 13:05

Mr_and_Mrs_D


People also ask

How do you compare elements in pandas series?

It is possible to compare two pandas Series with help of Relational operators, we can easily compare the corresponding elements of two series at a time. The result will be displayed in form of True or False. And we can also use a function like Pandas Series. equals() to compare two pandas series.

How do you find the difference between two series in Python?

diff() is used to find difference between elements of the same series.

What does diff () do in pandas?

Pandas DataFrame diff() Method The diff() method returns a DataFrame with the difference between the values for each row and, by default, the previous row. Which row to compare with can be specified with the periods parameter.

What is diff () in Python?

diff(arr[, n[, axis]]) function is used when we calculate the n-th order discrete difference along the given axis. The first order difference is given by out[i] = arr[i+1] – arr[i] along the given axis. If we have to calculate higher differences, we are using diff recursively. Syntax: numpy.diff()


1 Answers

Depending on how many ids you have, this works with few thousands:

# enumerate ids, should be careful
ids = [a,b,c]
num_ids = len(ids)

# compute first and last
f = df.groupby('id').value.agg(['first','last'])

# lower triangle mask
mask = np.array([[i>=j for j in range(num_ids)] for i in range(num_ids)])

# compute diff of first and last, then mask 
diff = np.where(mask, None, f['first'][None,:] - f['last'][:,None])
diff = pd.DataFrame(diff,
                    index = ids,
                    columns = ids)
# stack
diff.stack()

output:

a  b    2
   c    4
b  c    1
dtype: object

Edit for updated data:

For the updated data, approach is similar if we can create the f table:

# create blocks of consecutive id
blocks = df['id'].ne(df['id'].shift()).cumsum()

# groupby
groups = df.groupby(blocks)

# create first and last values
df['fv'] = groups.value.transform('first')
df['lv'] = groups.value.transform('last')

# the above f and ids 
# note the column name change
f = df[['id','fv', 'lv']].drop_duplicates()
ids = f['id'].values
num_ids = len(ids)

Output:

a   b     2
    c     4
    a     5
b   c     1
    a     2
c   a     1
dtype: object

If you want to go further and drop the index (a,a), well, I'm so lazy :D.

like image 62
Quang Hoang Avatar answered Oct 19 '22 07:10

Quang Hoang