Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make a dataframe that shows the ratio of different types of values?

Tags:

python

pandas

I have a dataframe that looks like this:

>>> df
                        value
time
2020-01-31 07:59:43.232    -6
2020-01-31 07:59:43.232    -2
2020-01-31 07:59:43.232    -1
2020-01-31 07:59:43.264     1
2020-01-31 07:59:43.389     0
2020-01-31 07:59:43.466     1
2020-01-31 07:59:43.466     5
2020-01-31 07:59:43.466    -1
2020-01-31 07:59:43.467    -1
2020-01-31 07:59:43.467    -1
2020-01-31 07:59:43.467     5
2020-01-31 07:59:43.467     1

I want to add 3 more columns that show the ratio of positive and negative values by a certain number. For example, if the number is 8:

                        value    neg     pos    total
time
2020-01-31 07:59:43.232    -6
2020-01-31 07:59:43.232    -2      8       0        8
2020-01-31 07:59:43.232    -1
2020-01-31 07:59:43.264     1
2020-01-31 07:59:43.389     0
2020-01-31 07:59:43.466     1
2020-01-31 07:59:43.466     5      1       7        8
2020-01-31 07:59:43.466    -1
2020-01-31 07:59:43.467    -1
2020-01-31 07:59:43.467    -1
2020-01-31 07:59:43.467     5      3       5        8
2020-01-31 07:59:43.467     1

If the number is 5:

                        value    neg     pos    total
time
2020-01-31 07:59:43.232    -6      5       0        5    # take just 5 out of -6 and the rest(-1) is used for the next calculation
2020-01-31 07:59:43.232    -2      
2020-01-31 07:59:43.232    -1
2020-01-31 07:59:43.264     1      4       1        5    # sum(abs(list(-1, -2, -1, 1)))
2020-01-31 07:59:43.389     0
2020-01-31 07:59:43.466     1
2020-01-31 07:59:43.466     5      0       5        5    # 1 + 5 -> take just 5(1, 4) out of them and the rest(1) is used for the next calculation
2020-01-31 07:59:43.466    -1
2020-01-31 07:59:43.467    -1
2020-01-31 07:59:43.467    -1
2020-01-31 07:59:43.467     5      3       4        5    # 1, -1, -1, -1, 5 -> take just 5(1, -1, -1, -1, 1) out of them and the rest(4) is used for the next calculation
2020-01-31 07:59:43.467     1      0       5        5    # 4, 1

I've been doing the calculation with a loop and several conditional statements and it's pretty slow. I wonder if there are more efficient and faster ways to do this.

THE CODE BELOW SHOWS HOW I'VE DONE WHEN THE NUMBER IS 300(GROUP_SIZE)

GROUP_SIZE = 300

for DATE in lst_requiredDates:

    df = dic_dtf[DATE]

    lst_groups = []
    lst_group = [0,  0,    0,    0]

    for index, row in df.iterrows():
        date        = index
        value      = row['value']
        abs_value  = abs(value)


        if (lst_group[3]+abs_value) < GROUP_SIZE:

            if value < 0:
                lst_group[0] = date
                lst_group[1] += abs_value
                lst_group[3] += abs_value
            else:
                lst_group[0] = date
                lst_group[2] += abs_value
                lst_group[3] += abs_value

        elif (lst_group[3]+abs_value) == GROUP_SIZE:

            if value < 0:
                lst_group[0] = date
                lst_group[1] += abs_value
                lst_group[3] += abs_value
            else:
                lst_group[0] = date
                lst_group[2] += abs_value
                lst_group[3] += abs_value

            lst_groups.append(lst_group)
            lst_group = [0,  0,    0,    0]


        elif (lst_group[3]+abs_value) > GROUP_SIZE:
            int_left = (lst_group[3]+abs_value) - GROUP_SIZE

            if value < 0:
                lst_group[0] = date
                lst_group[1] += (abs_value - int_left)
                lst_group[3] += (abs_value - int_left)

                lst_groups.append(lst_group)
                lst_group = [0,  0,    0,    0]
                lst_group[0] = date
                lst_group[1] += int_left
                lst_group[3] += int_left
            else:
                lst_group[0] = date
                lst_group[2] += (abs_value - int_left)
                lst_group[3] += (abs_value - int_left)

                lst_groups.append(lst_group)
                lst_group = [0,  0,    0,    0]
                lst_group[0] = date
                lst_group[2] += int_left
                lst_group[3] += int_left
like image 258
maynull Avatar asked Oct 15 '22 05:10

maynull


1 Answers

Here's a solution using operations on the whole dataframe at once, that should be pretty efficient.

I'm using cumsum() twice, once on the absolute values, to find when we reach the group size, and another on the values themselves, which we can actually later use to find neg and pos.

One use of shift() takes care of finding the group boundaries, which has the rows we want to update and all data we need to calculate the sums.

Handling remainders is not too hard, looking at the cumulative sum of absolutes and giving it the sign of the last value.

Here the cumulative sum of signed values comes in handy. After adjusting for the remainder, we can take the difference from the previous row to find what the sum of pos - neg for the current row is.

Knowing they add up to the group size, we can easily calculate the two separate values and add them to the dataframe.

Code follows, has comments to further explain this all:

import pandas as pd
import numpy as np

def get_pos_neg_ratio(series, group_size):
    df = series.rename('value').to_frame()

    # Calculate the cumulative sum and the cumulative sum
    # of absolute values. The latter will be used to break
    # the series into groups.
    df_aux = df.copy()
    df_aux['cumsum'] = df['value'].cumsum()
    df_aux['cumabs'] = abs(df['value']).cumsum()
    df_aux['group'] = df_aux['cumabs'] // group_size

    # Break it into groups, by locating the boundaries.
    df_aux = df_aux[
        df_aux['group'] != df_aux['group'].shift(fill_value=0)
    ].copy()

    # Calculate the remainder on each boundary row. Give
    # it the sign of the value in that row, since that
    # value is the one that got it over the group size.
    df_aux['remainder'] = (
        (df_aux['cumabs'] % group_size) *
        np.sign(df_aux['value'])
    )

    # Adjust the sums.by the remainder.
    df_aux['adjsum'] = df_aux['cumsum'] - df_aux['remainder']

    # Finally, find the individual sums by subtracting
    # from the adjusted cumulative sum from the previous
    # group. This will be the total sum of positives and
    # negatives for this group.
    df_aux['grpsum'] = (
        df_aux['adjsum'] -
        df_aux['adjsum'].shift(fill_value=0)
    )

    # Now we can calculate positives and negatives. We
    # know that their absolute values sum up to group_size
    # and that they sum up to `adjsum`, so a little bit of
    # algebra will get us to:
    df['neg'] = (group_size - df_aux['grpsum']) // 2
    df['pos'] = (group_size + df_aux['grpsum']) // 2
    df['total'] = df['neg'] + df['pos']
    return df

Pass the function a Series (i.e., a column) and a group size, it will return a DataFrame with the column (under name value) and the calculated neg, pos and total.

One more note is that this function requires an index without duplicates! Otherwise the final assignment will fail. I suggest you use reset_index() first, convert time into a regular column, then possibly later set_index() back to it.

This code will break if a value will get us past two group boundaries at once. So for the sample data, it will break for group_size ≤ 4. It might be possible to fix it for that case (we can detect skipping a group, at the boundaries), but it's unclear how to handle those cases, should we insert a new row, with value NaN, and repeat the index, for the additional group(s)?

Since you didn't mention this case in the examples you gave, and your sample code uses a large group size of 300, I imagined this is most likely not something you're terribly concerned about and the current approach is suitable enough.

Another point to consider is that we're not preserving the final sum at the last incomplete group, we can't really tell how much was left in a way that would help us continue the calculation if we were streaming data, or concatenating DataFrames.

Again, since in your example with group size of 8, you don't seem to mention the remainder of 1 from the last row, I believe this is also not really a concern for you.

Sample run on your data (after resetting index) with group size of 5:

>>> df = df.reset_index()
>>> print(get_pos_neg_ratio(df['value'], 5))
    value  neg  pos  total
0      -6  5.0  0.0    5.0
1      -2  NaN  NaN    NaN
2      -1  NaN  NaN    NaN
3       1  4.0  1.0    5.0
4       0  NaN  NaN    NaN
5       1  NaN  NaN    NaN
6       5  0.0  5.0    5.0
7      -1  NaN  NaN    NaN
8      -1  NaN  NaN    NaN
9      -1  NaN  NaN    NaN
10      5  3.0  2.0    5.0
11      1  0.0  5.0    5.0

(In the question, you listed pos on row 10 to be 4, but it should really be 2.)

Group size of 8:

>>> print(get_pos_neg_ratio(df['value'], 8)) 
    value  neg  pos  total
0      -6  NaN  NaN    NaN
1      -2  8.0  0.0    8.0
2      -1  NaN  NaN    NaN
3       1  NaN  NaN    NaN
4       0  NaN  NaN    NaN
5       1  NaN  NaN    NaN
6       5  1.0  7.0    8.0
7      -1  NaN  NaN    NaN
8      -1  NaN  NaN    NaN
9      -1  NaN  NaN    NaN
10      5  3.0  5.0    8.0
11      1  NaN  NaN    NaN

Group size of 7:

>>> print(get_pos_neg_ratio(df['value'], 7))
    value  neg  pos  total
0      -6  NaN  NaN    NaN
1      -2  7.0  0.0    7.0
2      -1  NaN  NaN    NaN
3       1  NaN  NaN    NaN
4       0  NaN  NaN    NaN
5       1  NaN  NaN    NaN
6       5  2.0  5.0    7.0
7      -1  NaN  NaN    NaN
8      -1  NaN  NaN    NaN
9      -1  NaN  NaN    NaN
10      5  3.0  4.0    7.0
11      1  NaN  NaN    NaN
like image 130
filbranden Avatar answered Oct 18 '22 22:10

filbranden