I have a dataset that consists of compositional data. Each column represents the percentage (decimal value) of a component in the mixture whole. Each row sums to 1.
If one component in the mixture changes, the rest must change accordingly so as to satisfy the sum constraint.
I am performing a multiple linear regression with this data and it requires some transformations such that the regression coefficients are meaningful and interpretable. The dataset contains zero values and this is an issue for the particular type of transformation I am trying to implement.
Before I can perform this transformation the recommended action is to replace all zero values with a small number and to adjust the the remaining components such that the sum constraint is still satisfied.
You can see in the dummy df below there are cases where there are more than 1 zero value in a row.
data = {'X1': [0.21, 0.08, 0.57, 0.03],
'X2': [0.27, 0.56, 0.0, 0.02],
'X3': [0.0, 0.14, 0.0, 0.45],
'X4': [0.13, 0.02, 0.26, 0.37],
'X5': [0.39, 0.2, 0.17, 0.13]}
df = pd.DataFrame(data)
print(df)
Lets just consider one row and so the formula to do so is as follows:
let the original value be r_i. For a change in component r_i of delta_i we get the new value x_i.
so, x_i = r_i + delta_i
To maintain the relative proportion between the remaining components,
let r_j be the original value of the remaining components,
the new value for the jth component x_j is then,
x_j = r_j - ((r_j / (1 - r_i) * delta_i) and i != j
I am struggling to write an appropriate loop for this problem that will search the dataset for zero values, then add a small number to the index and column containing the zero value and then go on to adjust the entire row with the formula I described above.
EDIT:
Sorry for the bad representation of the maths formula.
For the first row in the dummy df the application of the formula is straight forward as there is only one zero in the row:

It's important that the relative proportions between the remaining components remains the same, you can see that here as I update the zero value to a small number.

For the third row in the dummy df things get a bit more complicated. I update the first (X2) zero value by adding a small number. The second (X3) zero value remains zero because the formula is multiplying and dividing by zero. So I make a second update such that X2 and X3 are now small non zero values and this this shown on the third row of the table below.

It's the same case for maintaining relative proportions between remaining components for the case where more than one zero exists on the row.

I can't think of a loop for the first problem, never mind the second! Also, don't worry about the large numbers created by dividing by a small number in the relative proportions tables, I'll deal with that later.
Here is the answer following edit:
import pandas as pd
# To show 10 decimal points.
pd.options.display.float_format = '{:.10f}'.format
data = {'X1': [0.21, 0.08, 0.57, 0.03],
'X2': [0.27, 0.56, 0.0, 0.02],
'X3': [0.0, 0.14, 0.0, 0.45],
'X4': [0.13, 0.02, 0.26, 0.37],
'X5': [0.39, 0.2, 0.17, 0.13]}
df = pd.DataFrame(data)
delta_i = 0.000001
r_i = 0.0
# Provided formula.
def adjust_proportion(r_j, r_i, delta_i):
return r_j - ((r_j / (1 - r_i)) * delta_i)
# For row-wise application.
def adjust_row(row, r_i, delta_i):
# Get all zeros and their count in the row.
zero_mask = (row == 0)
zero_count = row[zero_mask].shape[0] # Get only x.
# For every zero, adjust proportions for "cells" not in mask.
for i in range(zero_count):
row[~zero_mask] = row[~zero_mask].apply(lambda x: adjust_proportion(x, r_i, delta_i))
# Increase the mask by delta_i across the board.
row[zero_mask] += delta_i
return row
# Apply ROW-WISE using axis=1.
df.apply(lambda x: adjust_row(x, r_i, delta_i), axis=1)
print(df)
# Check sums.
print(df.apply(lambda x: x.sum(), axis=1))
This gives the following result:

There are more optimal ways but this should take care of the general logic.
you can use:
def exclude_zero(e, delta_i):
"""Replace zeros with a delta_i value by keeping the other non zeros value in the same distribution and total sum to 1"""
zero_count = e.count(0)
extra_amount = zero_count * delta_i
for index, value in enumerate(e):
if value == 0 :
e[index] = delta_i
else:
e[index] = value * (1 - extra_amount)
return e
data = {'X1': [0.21, 0.08, 0.57, 0.03],
'X2': [0.27, 0.56, 0.0, 0.02],
'X3': [0.0, 0.14, 0.0, 0.45],
'X4': [0.13, 0.02, 0.26, 0.37],
'X5': [0.39, 0.2, 0.17, 0.13]}
df = pd.DataFrame(data)
for index in range(len(df)):
df.iloc[index] = exclude_zero(df.iloc[index].values.tolist(), 0.000001)
pd.options.display.precision = 8
df

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With