Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Subtract row mean from each element in row

I have a dataframe with rows indexed by chemical element type and columns representing different samples. The values are floats representing the degree of presence of the row element in each sample.

I want to compute the mean of each row and subtract it from each value in that specific row to normalize the data, and make a new dataframe of that dataset.

I tried using mean(1), which give me a Series object with the mean for each chemical element, which is good, but then I tried using subtract, which didn't work.

like image 942
jeremy radcliff Avatar asked Sep 28 '14 03:09

jeremy radcliff


People also ask

How do you subtract a value from a row?

Right-click the row number of row 24. Select 'Paste Special...' from the context menu. Select Subtract, then click OK.

How do you subtract values in pandas?

Pandas DataFrame sub() MethodThe sub() method subtracts each value in the DataFrame with a specified value. The specified value must be an object that can be subtracted from the values in the DataFrame.

How do you tell the difference between consecutive rows in pandas?

During data analysis, one might need to compute the difference between two rows for comparison purposes. This can be done using pandas. DataFrame. diff() function.


2 Answers

You could use DataFrame's sub method and specify that the subtraction should happen row-wise (axis=0) as opposed to the default column-wise behaviour:

df.sub(df.mean(axis=1), axis=0)

Here's an example:

>>> df = pd.DataFrame({'a': [1.5, 2.5], 'b': [0.25, 2.75], 'c': [1.25, 0.75]})
>>> df
     a     b     c
0  1.5  0.25  1.25
1  2.5  2.75  0.75

The mean of each row is straightforward to calculate:

>>> df.mean(axis=1)
0    1
1    2
dtype: float64

To de-mean the rows of the DataFrame, just subtract the mean values of rows from df like this:

>>> df.sub(df.mean(axis=1), axis=0)
     a     b     c
0  0.5 -0.75  0.25
1  0.5  0.75 -1.25
like image 102
Alex Riley Avatar answered Oct 06 '22 08:10

Alex Riley


Additionally to @ajcr's excellent answer, you might want to consider rearranging how you store your data.

The way you're doing it at the moment, with different samples in different columns, is the way it would be represented if you were using a spreadsheet, but this might not be the most helpful way to represent your data.

Normally, each column represents a unique piece of information about a single real-world entity. The typical example of this kind of data is a person:

id  name  hair_colour  Age
1   Bob   Brown        25

Really, your different samples are different real-world entities.

I would therefore suggest having a two-level index to describe each single piece of information. This makes manipulating your data in the way you want far more convenient.

Thus:

>>> df = pd.DataFrame([['Sn',1,2,3],['Pb',2,4,6]],
                      columns=['element', 'A', 'B', 'C']).set_index('element')
>>> df.columns.name = 'sample'
>>> df # This is how your DataFrame looks at the moment
sample   A  B  C
element         
Sn       1  2  3
Pb       2  4  6
>>> # Now make those columns into a second level of index
>>> df = df.stack()
>>> df
element  sample
Sn       A         1
         B         2
         C         3
Pb       A         2
         B         4
         C         6

We now have all the delicious functionality of groupby at our disposal:

>>> demean = lambda x: x - x.mean()
>>> df.groupby(level='element').transform(demean)
element  sample
Sn       A        -1
         B         0
         C         1
Pb       A        -2
         B         0
         C         2

When you view your data in this way, you'll find that many, many use cases which used to be multi-column DataFrames are in fact MultiIndexed Series, and you have much more power over how the data is represented and transformed.

like image 40
LondonRob Avatar answered Oct 06 '22 08:10

LondonRob