I have the following working code. I need to add a percentage column to monitor changes. I dont know much on how to do it in pandas. I need ideas on what part needs to be modified.
import pandas as pd
dl = []
with open('sampledata.txt') as f:
for line in f:
parts = line.split()
# Cleaning data here.. Conversions to int/float etc,
if not parts[3][:2].startswith('($'):
parts.insert(3,'0')
if len(parts) > 5:
temp = ' '.join(parts[4:])
parts = parts[:4] + [temp]
parts[1] = int(parts[1])
parts[2] = float(parts[2].replace(',', ''))
parts[3] = float(parts[3].strip('($)'))
dl.append(parts)
headers = ['col1', 'col2', 'col3', 'col4', 'col5']
df = pd.DataFrame(dl,columns=headers)
df = df.groupby(['col1','col5']).sum().reset_index()
df = df.sort_values('col2',ascending=False)
df['col4'] = '($' + df['col4'].astype(str) + ')'
df = df[headers]
print(df)
sampledata.txt #-- Sample Data Source file
alpha 1 54,00.01 ABC DSW2S
bravo 3 500,000.00 ACDEF
charlie 1 27,722.29 ($250.45) DGAS-CAS
delta 2 11 ($10) SWSDSASS-CCSSW
echo 5 143,299.00 ($101) ACS34S1
lima 6 45.00181 ($38.9) FGF5GGD-DDD
falcon 3 0.1234 DSS2SFS3
echo 8 145,300 ($125.01) ACS34S1
charlie 10 252,336,733.383 ($492.06) DGAS-CAS
romeo 12 980 ASDS SSSS SDSD
falcon 5 9.19 DSS2SFS3
Current Output: #-- working result
col1 col2 col3 col4 col5
4 echo 13 2.885990e+05 ($226.01) ACS34S1
7 romeo 12 9.800000e+02 ($0.0) ASDS SSSS SDSD
2 charlie 11 2.523645e+08 ($742.51) DGAS-CAS
5 falcon 8 9.313400e+00 ($0.0) DSS2SFS3
6 lima 6 4.500181e+01 ($38.9) FGF5GGD-DDD
1 bravo 3 5.000000e+05 ($0.0) ACDEF
3 delta 2 1.100000e+01 ($10.0) SWSDSASS-CCSSW
0 alpha 1 5.400010e+03 ($0.0) ABC DSW2S
Improved Output: #-- with Additional Column for %
col1 col2 col3 col4 col5 col6
4 echo 13 2.885990e+05 ($226.01) ACS34S1 60% #-- (5 + 8) = 13
7 romeo 12 9.800000e+02 ($0.0) ASDS SSSS SDSD 0%
2 charlie 11 2.523645e+08 ($742.51) DGAS-CAS 900% #-- (1 + 10) = 11
5 falcon 8 9.313400e+00 ($0.0) DSS2SFS3 66.67% #-- (3 + 5) = 8
6 lima 6 4.500181e+01 ($38.9) FGF5GGD-DDD 0%
1 bravo 3 5.000000e+05 ($0.0) ACDEF 0%
3 delta 2 1.100000e+01 ($10.0) SWSDSASS-CCSSW 0%
0 alpha 1 5.400010e+03 ($0.0) ABC DSW2S 0%
We can add percentage symbol to the results from pct_change() using style method and specify the format that we would like to have. Note that the row with nan values also have percentage sign and that does not make sense. We can change the values of nan using “na_rep” to format() function's argument.
To calculate a percentage in Python, use the division operator (/) to get the quotient from two numbers and then multiply this quotient by 100 using the multiplication operator (*) to get the percentage. This is a simple equation in mathematics to get the percentage.
pct_change() function calculates the percentage change between the current and a prior element. This function by default calculates the percentage change from the immediately previous row. Note : This function is mostly useful in the time-series data.
Setup:
#sampledata.txt
df = pd.DataFrame(data={'col1': ['alpha', 'bravo', 'charlie', 'delta', 'echo','lima', 'falcon', 'echo', 'charlie', 'romeo', 'falcon'],
'col2': [1, 3, 1, 2, 5, 6, 3, 8, 10, 12, 5],
'col3': ['54,00.01', '500,000.00', '27,722.29 ($250.45)', '11 ($10)', '143,299.00 ($101)', '45.00181 ($38.9)', '0.1234', '145,300 ($125.01)', '252,336,733.383 ($492.06)', '980', '9.19'],
'col4': ['ABC DSW2S', 'ACDEF', 'DGAS-CAS', 'SWSDSASS-CCSSW', 'ACS34S1', 'FGF5GGD-DDD', 'DSS2SFS3', 'ACS34S1', 'DGAS-CAS', 'ASDS SSSS SDSD', 'DSS2SFS3']})
Code:
df['within_brackets'] = df['col3'].str.extract('.*\((.*)\).*') #Extract whats inside the brackets.
df['within_brackets'].replace('\$', '', regex=True, inplace=True)
df['col3'] = df['col3'].str.replace(r"(\s*\(.*\))|,", "", regex=True) #Extract whats outside the brackets
df.rename(columns={'col4': 'col5', 'within_brackets': 'col4'}, inplace=True)
df[['col3', 'col4']] = df[['col3', 'col4']].astype(float)
df = df.groupby(['col1', 'col5']).agg(col2 = pd.NamedAgg(column="col2", aggfunc="sum"),
col3 = pd.NamedAgg(column="col3", aggfunc="sum"),
col4 = pd.NamedAgg(column="col4", aggfunc="sum"),
col6 = pd.NamedAgg(column="col2", aggfunc=pd.Series.pct_change)).reset_index()
df['col6'].fillna(0, inplace=True)
#print df here and you will get to know what output looks like till now.
df['col6'].fillna(0, inplace=True)
df['col6'] = df['col6'].apply(lambda x: f"{str(round(x[-1], 4) * 100)}%" if isinstance(x, np.ndarray) else f"{round(x, 4) * 100}%")
df = df[['col1', 'col2', 'col3', 'col4', 'col5', 'col6']]
df.sort_values(by=['col2'], ascending=False, inplace=True)
print(df)
Output:
col1 col2 col3 col4 col5 col6
4 echo 13 2.885990e+05 226.01 ACS34S1 60.0%
7 romeo 12 9.800000e+02 0.00 ASDS SSSS SDSD 0%
2 charlie 11 2.523645e+08 742.51 DGAS-CAS 900.0%
5 falcon 8 9.313400e+00 0.00 DSS2SFS3 66.67%
6 lima 6 4.500181e+01 38.90 FGF5GGD-DDD 0%
1 bravo 3 5.000000e+05 0.00 ACDEF 0%
3 delta 2 1.100000e+01 10.00 SWSDSASS-CCSSW 0%
0 alpha 1 5.400010e+03 0.00 ABC DSW2S 0%
Update
Use this with your existing code: (Use the following code after df = pd.DataFrame(dl,columns=headers)
this line in your code.)
df = df.groupby(['col1', 'col5']).agg(col2 = pd.NamedAgg(column="col2", aggfunc="sum"),
col3 = pd.NamedAgg(column="col3", aggfunc="sum"),
col4 = pd.NamedAgg(column="col4", aggfunc="sum"),
col6 = pd.NamedAgg(column="col2", aggfunc=pd.Series.pct_change)).reset_index()
df['col6'].fillna(0, inplace=True)
#print df here and you will get to know what output looks like till now.
df['col6'].fillna(0, inplace=True)
df['col6'] = df['col6'].apply(lambda x: f"{str(round(x[-1], 4) * 100)}%" if isinstance(x, np.ndarray) else f"{round(x, 4) * 100}%")
df['col4'] = '($' + df['col4'].astype(str) + ')'
df = df[['col1', 'col2', 'col3', 'col4', 'col5', 'col6']]
You can add the following lines just after your code: The function compute_percentage() is using the list variable dl.
def compute_percentage(row):
vl = [float(parts[1]) for parts in dl if parts[0] == row['col1']]
i = round(100. * (vl[-1]-vl[0])/vl[0] if vl[0] != 0 else 0, 2)
if float(int(i)) == i:
i = int(i)
return str(i) + '%'
df['col6'] = df.apply(compute_percentage, axis=1)
Output:
col1 col2 col3 col4 col5 col6
4 echo 13 2.885990e+05 ($226.01) ACS34S1 60%
7 romeo 12 9.800000e+02 ($0.0) ASDS SSSS SDSD 0%
2 charlie 11 2.523645e+08 ($742.51) DGAS-CAS 900%
5 falcon 8 9.313400e+00 ($0.0) DSS2SFS3 66.67%
6 lima 6 4.500181e+01 ($38.9) FGF5GGD-DDD 0%
1 bravo 3 5.000000e+05 ($0.0) ACDEF 0%
3 delta 2 1.100000e+01 ($10.0) SWSDSASS-CCSSW 0%
0 alpha 1 5.400010e+03 ($0.0) ABC DSW2S 0%
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With