I have a dataframe that looks like this:
>>> df
value
time
2020-01-31 07:59:43.232 -6
2020-01-31 07:59:43.232 -2
2020-01-31 07:59:43.232 -1
2020-01-31 07:59:43.264 1
2020-01-31 07:59:43.389 0
2020-01-31 07:59:43.466 1
2020-01-31 07:59:43.466 5
2020-01-31 07:59:43.466 -1
2020-01-31 07:59:43.467 -1
2020-01-31 07:59:43.467 -1
2020-01-31 07:59:43.467 5
2020-01-31 07:59:43.467 1
I want to add 3 more columns that show the ratio of positive and negative values by a certain number. For example, if the number is 8:
value neg pos total
time
2020-01-31 07:59:43.232 -6
2020-01-31 07:59:43.232 -2 8 0 8
2020-01-31 07:59:43.232 -1
2020-01-31 07:59:43.264 1
2020-01-31 07:59:43.389 0
2020-01-31 07:59:43.466 1
2020-01-31 07:59:43.466 5 1 7 8
2020-01-31 07:59:43.466 -1
2020-01-31 07:59:43.467 -1
2020-01-31 07:59:43.467 -1
2020-01-31 07:59:43.467 5 3 5 8
2020-01-31 07:59:43.467 1
If the number is 5:
value neg pos total
time
2020-01-31 07:59:43.232 -6 5 0 5 # take just 5 out of -6 and the rest(-1) is used for the next calculation
2020-01-31 07:59:43.232 -2
2020-01-31 07:59:43.232 -1
2020-01-31 07:59:43.264 1 4 1 5 # sum(abs(list(-1, -2, -1, 1)))
2020-01-31 07:59:43.389 0
2020-01-31 07:59:43.466 1
2020-01-31 07:59:43.466 5 0 5 5 # 1 + 5 -> take just 5(1, 4) out of them and the rest(1) is used for the next calculation
2020-01-31 07:59:43.466 -1
2020-01-31 07:59:43.467 -1
2020-01-31 07:59:43.467 -1
2020-01-31 07:59:43.467 5 3 4 5 # 1, -1, -1, -1, 5 -> take just 5(1, -1, -1, -1, 1) out of them and the rest(4) is used for the next calculation
2020-01-31 07:59:43.467 1 0 5 5 # 4, 1
I've been doing the calculation with a loop and several conditional statements and it's pretty slow. I wonder if there are more efficient and faster ways to do this.
THE CODE BELOW SHOWS HOW I'VE DONE WHEN THE NUMBER IS 300(GROUP_SIZE)
GROUP_SIZE = 300
for DATE in lst_requiredDates:
df = dic_dtf[DATE]
lst_groups = []
lst_group = [0, 0, 0, 0]
for index, row in df.iterrows():
date = index
value = row['value']
abs_value = abs(value)
if (lst_group[3]+abs_value) < GROUP_SIZE:
if value < 0:
lst_group[0] = date
lst_group[1] += abs_value
lst_group[3] += abs_value
else:
lst_group[0] = date
lst_group[2] += abs_value
lst_group[3] += abs_value
elif (lst_group[3]+abs_value) == GROUP_SIZE:
if value < 0:
lst_group[0] = date
lst_group[1] += abs_value
lst_group[3] += abs_value
else:
lst_group[0] = date
lst_group[2] += abs_value
lst_group[3] += abs_value
lst_groups.append(lst_group)
lst_group = [0, 0, 0, 0]
elif (lst_group[3]+abs_value) > GROUP_SIZE:
int_left = (lst_group[3]+abs_value) - GROUP_SIZE
if value < 0:
lst_group[0] = date
lst_group[1] += (abs_value - int_left)
lst_group[3] += (abs_value - int_left)
lst_groups.append(lst_group)
lst_group = [0, 0, 0, 0]
lst_group[0] = date
lst_group[1] += int_left
lst_group[3] += int_left
else:
lst_group[0] = date
lst_group[2] += (abs_value - int_left)
lst_group[3] += (abs_value - int_left)
lst_groups.append(lst_group)
lst_group = [0, 0, 0, 0]
lst_group[0] = date
lst_group[2] += int_left
lst_group[3] += int_left
Here's a solution using operations on the whole dataframe at once, that should be pretty efficient.
I'm using cumsum()
twice, once on the absolute values, to find when we reach the group size, and another on the values themselves, which we can actually later use to find neg
and pos
.
One use of shift()
takes care of finding the group boundaries, which has the rows we want to update and all data we need to calculate the sums.
Handling remainders is not too hard, looking at the cumulative sum of absolutes and giving it the sign of the last value.
Here the cumulative sum of signed values comes in handy. After adjusting for the remainder, we can take the difference from the previous row to find what the sum of pos - neg
for the current row is.
Knowing they add up to the group size, we can easily calculate the two separate values and add them to the dataframe.
Code follows, has comments to further explain this all:
import pandas as pd
import numpy as np
def get_pos_neg_ratio(series, group_size):
df = series.rename('value').to_frame()
# Calculate the cumulative sum and the cumulative sum
# of absolute values. The latter will be used to break
# the series into groups.
df_aux = df.copy()
df_aux['cumsum'] = df['value'].cumsum()
df_aux['cumabs'] = abs(df['value']).cumsum()
df_aux['group'] = df_aux['cumabs'] // group_size
# Break it into groups, by locating the boundaries.
df_aux = df_aux[
df_aux['group'] != df_aux['group'].shift(fill_value=0)
].copy()
# Calculate the remainder on each boundary row. Give
# it the sign of the value in that row, since that
# value is the one that got it over the group size.
df_aux['remainder'] = (
(df_aux['cumabs'] % group_size) *
np.sign(df_aux['value'])
)
# Adjust the sums.by the remainder.
df_aux['adjsum'] = df_aux['cumsum'] - df_aux['remainder']
# Finally, find the individual sums by subtracting
# from the adjusted cumulative sum from the previous
# group. This will be the total sum of positives and
# negatives for this group.
df_aux['grpsum'] = (
df_aux['adjsum'] -
df_aux['adjsum'].shift(fill_value=0)
)
# Now we can calculate positives and negatives. We
# know that their absolute values sum up to group_size
# and that they sum up to `adjsum`, so a little bit of
# algebra will get us to:
df['neg'] = (group_size - df_aux['grpsum']) // 2
df['pos'] = (group_size + df_aux['grpsum']) // 2
df['total'] = df['neg'] + df['pos']
return df
Pass the function a Series (i.e., a column) and a group size, it will return a DataFrame with the column (under name value
) and the calculated neg
, pos
and total
.
One more note is that this function requires an index without duplicates! Otherwise the final assignment will fail. I suggest you use reset_index()
first, convert time
into a regular column, then possibly later set_index()
back to it.
This code will break if a value will get us past two group boundaries at once. So for the sample data, it will break for group_size ≤ 4
. It might be possible to fix it for that case (we can detect skipping a group, at the boundaries), but it's unclear how to handle those cases, should we insert a new row, with value NaN, and repeat the index, for the additional group(s)?
Since you didn't mention this case in the examples you gave, and your sample code uses a large group size of 300, I imagined this is most likely not something you're terribly concerned about and the current approach is suitable enough.
Another point to consider is that we're not preserving the final sum at the last incomplete group, we can't really tell how much was left in a way that would help us continue the calculation if we were streaming data, or concatenating DataFrames.
Again, since in your example with group size of 8, you don't seem to mention the remainder of 1 from the last row, I believe this is also not really a concern for you.
Sample run on your data (after resetting index) with group size of 5:
>>> df = df.reset_index()
>>> print(get_pos_neg_ratio(df['value'], 5))
value neg pos total
0 -6 5.0 0.0 5.0
1 -2 NaN NaN NaN
2 -1 NaN NaN NaN
3 1 4.0 1.0 5.0
4 0 NaN NaN NaN
5 1 NaN NaN NaN
6 5 0.0 5.0 5.0
7 -1 NaN NaN NaN
8 -1 NaN NaN NaN
9 -1 NaN NaN NaN
10 5 3.0 2.0 5.0
11 1 0.0 5.0 5.0
(In the question, you listed pos
on row 10 to be 4, but it should really be 2.)
Group size of 8:
>>> print(get_pos_neg_ratio(df['value'], 8))
value neg pos total
0 -6 NaN NaN NaN
1 -2 8.0 0.0 8.0
2 -1 NaN NaN NaN
3 1 NaN NaN NaN
4 0 NaN NaN NaN
5 1 NaN NaN NaN
6 5 1.0 7.0 8.0
7 -1 NaN NaN NaN
8 -1 NaN NaN NaN
9 -1 NaN NaN NaN
10 5 3.0 5.0 8.0
11 1 NaN NaN NaN
Group size of 7:
>>> print(get_pos_neg_ratio(df['value'], 7))
value neg pos total
0 -6 NaN NaN NaN
1 -2 7.0 0.0 7.0
2 -1 NaN NaN NaN
3 1 NaN NaN NaN
4 0 NaN NaN NaN
5 1 NaN NaN NaN
6 5 2.0 5.0 7.0
7 -1 NaN NaN NaN
8 -1 NaN NaN NaN
9 -1 NaN NaN NaN
10 5 3.0 4.0 7.0
11 1 NaN NaN NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With