I have the following working code. I need to add a percentage column to monitor changes. I dont know much on how to do it in pandas. I need ideas on what part needs to be modified.
import pandas as pd
dl = []
with open('sampledata.txt') as f:
    for line in f:
        parts = line.split()
        # Cleaning data here.. Conversions to int/float etc,
        if not parts[3][:2].startswith('($'):
            parts.insert(3,'0')
        if len(parts) > 5:
            temp = ' '.join(parts[4:])
            parts = parts[:4] + [temp]
        parts[1] = int(parts[1])
        parts[2] = float(parts[2].replace(',', ''))
        parts[3] = float(parts[3].strip('($)'))
        dl.append(parts)
headers = ['col1', 'col2', 'col3', 'col4', 'col5']
df = pd.DataFrame(dl,columns=headers)
df = df.groupby(['col1','col5']).sum().reset_index()
df = df.sort_values('col2',ascending=False)
df['col4'] =  '($' + df['col4'].astype(str) + ')'
df = df[headers]
print(df)
sampledata.txt #-- Sample Data Source file
alpha   1   54,00.01                    ABC DSW2S
bravo   3   500,000.00                  ACDEF
charlie 1   27,722.29 ($250.45)         DGAS-CAS
delta   2   11 ($10)                    SWSDSASS-CCSSW
echo    5   143,299.00 ($101)           ACS34S1
lima    6   45.00181 ($38.9)            FGF5GGD-DDD
falcon  3   0.1234                      DSS2SFS3
echo    8   145,300 ($125.01)           ACS34S1
charlie 10  252,336,733.383 ($492.06)   DGAS-CAS
romeo   12  980                         ASDS SSSS SDSD
falcon  5   9.19                        DSS2SFS3
Current Output: #-- working result
      col1  col2          col3       col4            col5
4     echo    13  2.885990e+05  ($226.01)         ACS34S1
7    romeo    12  9.800000e+02     ($0.0)  ASDS SSSS SDSD
2  charlie    11  2.523645e+08  ($742.51)        DGAS-CAS
5   falcon     8  9.313400e+00     ($0.0)        DSS2SFS3
6     lima     6  4.500181e+01    ($38.9)     FGF5GGD-DDD
1    bravo     3  5.000000e+05     ($0.0)           ACDEF
3    delta     2  1.100000e+01    ($10.0)  SWSDSASS-CCSSW
0    alpha     1  5.400010e+03     ($0.0)       ABC DSW2S
Improved Output: #-- with Additional Column for %
      col1  col2          col3       col4            col5   col6
4     echo    13  2.885990e+05  ($226.01)         ACS34S1   60%     #-- (5 + 8) = 13
7    romeo    12  9.800000e+02     ($0.0)  ASDS SSSS SDSD   0%
2  charlie    11  2.523645e+08  ($742.51)        DGAS-CAS   900%  #-- (1 + 10) = 11
5   falcon     8  9.313400e+00     ($0.0)        DSS2SFS3   66.67%  #-- (3 + 5) = 8
6     lima     6  4.500181e+01    ($38.9)     FGF5GGD-DDD   0%
1    bravo     3  5.000000e+05     ($0.0)           ACDEF   0%
3    delta     2  1.100000e+01    ($10.0)  SWSDSASS-CCSSW   0%
0    alpha     1  5.400010e+03     ($0.0)       ABC DSW2S   0%
                We can add percentage symbol to the results from pct_change() using style method and specify the format that we would like to have. Note that the row with nan values also have percentage sign and that does not make sense. We can change the values of nan using “na_rep” to format() function's argument.
To calculate a percentage in Python, use the division operator (/) to get the quotient from two numbers and then multiply this quotient by 100 using the multiplication operator (*) to get the percentage. This is a simple equation in mathematics to get the percentage.
pct_change() function calculates the percentage change between the current and a prior element. This function by default calculates the percentage change from the immediately previous row. Note : This function is mostly useful in the time-series data.
Setup:
#sampledata.txt
df = pd.DataFrame(data={'col1': ['alpha', 'bravo', 'charlie', 'delta', 'echo','lima', 'falcon', 'echo', 'charlie', 'romeo', 'falcon'],
                        'col2': [1, 3, 1, 2, 5, 6, 3, 8, 10, 12, 5],
                        'col3': ['54,00.01', '500,000.00', '27,722.29 ($250.45)', '11 ($10)', '143,299.00 ($101)', '45.00181 ($38.9)', '0.1234', '145,300 ($125.01)', '252,336,733.383 ($492.06)', '980', '9.19'],
                        'col4': ['ABC DSW2S', 'ACDEF', 'DGAS-CAS', 'SWSDSASS-CCSSW', 'ACS34S1', 'FGF5GGD-DDD', 'DSS2SFS3', 'ACS34S1', 'DGAS-CAS', 'ASDS SSSS SDSD', 'DSS2SFS3']})
Code:
df['within_brackets'] = df['col3'].str.extract('.*\((.*)\).*') #Extract whats inside the brackets.
df['within_brackets'].replace('\$', '', regex=True, inplace=True)
df['col3'] = df['col3'].str.replace(r"(\s*\(.*\))|,", "", regex=True) #Extract whats outside the brackets
df.rename(columns={'col4': 'col5', 'within_brackets': 'col4'}, inplace=True)
df[['col3', 'col4']] = df[['col3', 'col4']].astype(float)
df = df.groupby(['col1', 'col5']).agg(col2 = pd.NamedAgg(column="col2", aggfunc="sum"),
                                      col3 = pd.NamedAgg(column="col3", aggfunc="sum"),
                                      col4 = pd.NamedAgg(column="col4", aggfunc="sum"),
                                      col6 = pd.NamedAgg(column="col2", aggfunc=pd.Series.pct_change)).reset_index()
df['col6'].fillna(0, inplace=True)
#print df here and you will get to know what output looks like till now.
df['col6'].fillna(0, inplace=True)
df['col6'] = df['col6'].apply(lambda x: f"{str(round(x[-1], 4) * 100)}%" if isinstance(x, np.ndarray) else f"{round(x, 4) * 100}%")
df = df[['col1', 'col2', 'col3', 'col4', 'col5', 'col6']]
df.sort_values(by=['col2'], ascending=False, inplace=True)
print(df)
Output:
      col1  col2          col3    col4            col5    col6
4     echo    13  2.885990e+05  226.01         ACS34S1   60.0%
7    romeo    12  9.800000e+02    0.00  ASDS SSSS SDSD      0%
2  charlie    11  2.523645e+08  742.51        DGAS-CAS  900.0%
5   falcon     8  9.313400e+00    0.00        DSS2SFS3  66.67%
6     lima     6  4.500181e+01   38.90     FGF5GGD-DDD      0%
1    bravo     3  5.000000e+05    0.00           ACDEF      0%
3    delta     2  1.100000e+01   10.00  SWSDSASS-CCSSW      0%
0    alpha     1  5.400010e+03    0.00       ABC DSW2S      0%
Update
Use this with your existing code: (Use the following code after          df = pd.DataFrame(dl,columns=headers) this line in your code.)
df = df.groupby(['col1', 'col5']).agg(col2 = pd.NamedAgg(column="col2", aggfunc="sum"),
                                      col3 = pd.NamedAgg(column="col3", aggfunc="sum"),
                                      col4 = pd.NamedAgg(column="col4", aggfunc="sum"),
                                      col6 = pd.NamedAgg(column="col2", aggfunc=pd.Series.pct_change)).reset_index()
df['col6'].fillna(0, inplace=True)
#print df here and you will get to know what output looks like till now.
df['col6'].fillna(0, inplace=True)
df['col6'] = df['col6'].apply(lambda x: f"{str(round(x[-1], 4) * 100)}%" if isinstance(x, np.ndarray) else f"{round(x, 4) * 100}%")
df['col4'] =  '($' + df['col4'].astype(str) + ')'
df = df[['col1', 'col2', 'col3', 'col4', 'col5', 'col6']]
                        You can add the following lines just after your code: The function compute_percentage() is using the list variable dl.
def compute_percentage(row):
    vl = [float(parts[1]) for parts in dl if parts[0] == row['col1']]
    i = round(100. * (vl[-1]-vl[0])/vl[0] if vl[0] != 0 else 0, 2)
    if float(int(i)) == i:
        i = int(i)
    return str(i) + '%'
df['col6'] = df.apply(compute_percentage, axis=1)
Output:
      col1  col2          col3       col4            col5    col6
4     echo    13  2.885990e+05  ($226.01)         ACS34S1     60%
7    romeo    12  9.800000e+02     ($0.0)  ASDS SSSS SDSD      0%
2  charlie    11  2.523645e+08  ($742.51)        DGAS-CAS    900%
5   falcon     8  9.313400e+00     ($0.0)        DSS2SFS3  66.67%
6     lima     6  4.500181e+01    ($38.9)     FGF5GGD-DDD      0%
1    bravo     3  5.000000e+05     ($0.0)           ACDEF      0%
3    delta     2  1.100000e+01    ($10.0)  SWSDSASS-CCSSW      0%
0    alpha     1  5.400010e+03     ($0.0)       ABC DSW2S      0%
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With