I have a dataframe with rows indexed by chemical element type and columns representing different samples. The values are floats representing the degree of presence of the row element in each sample. I want to compute the mean of each row and subtract it from each value in that specific row to normalize the data, and make a new dataframe of that dataset. I tried using mean(1), which give me a Series object with the mean for each chemical element, which is good, but then I tried using subtract, which didn't work.

You could use DataFrame's <code>sub</code> method and specify that the subtraction should happen row-wise (<code>axis=0</code>) as opposed to the default column-wise behaviour: <pre class="prettyprint"><code>df.sub(df.mean(axis=1), axis=0) </code></pre> Here's an example: <pre class="prettyprint"><code>>>> df = pd.DataFrame({'a': [1.5, 2.5], 'b': [0.25, 2.75], 'c': [1.25, 0.75]}) >>> df a b c 0 1.5 0.25 1.25 1 2.5 2.75 0.75 </code></pre> The mean of each row is straightforward to calculate: <pre class="prettyprint"><code>>>> df.mean(axis=1) 0 1 1 2 dtype: float64 </code></pre> To de-mean the rows of the DataFrame, just subtract the mean values of rows from <code>df</code> like this: <pre class="prettyprint"><code>>>> df.sub(df.mean(axis=1), axis=0) a b c 0 0.5 -0.75 0.25 1 0.5 0.75 -1.25 </code></pre>

Additionally to @ajcr's excellent answer, you might want to consider rearranging how you store your data. The way you're doing it at the moment, with different samples in different columns, is the way it would be represented if you were using a spreadsheet, but this might not be the most helpful way to represent your data. Normally, each column represents a unique piece of information about a single real-world entity. The typical example of this kind of data is a person: <pre class="prettyprint"><code>id name hair_colour Age 1 Bob Brown 25 </code></pre> Really, your different samples are different real-world entities. I would therefore suggest having a two-level index to describe each single piece of information. This makes manipulating your data in the way you want far more convenient. Thus: <pre class="prettyprint"><code>>>> df = pd.DataFrame([['Sn',1,2,3],['Pb',2,4,6]], columns=['element', 'A', 'B', 'C']).set_index('element') >>> df.columns.name = 'sample' >>> df # This is how your DataFrame looks at the moment sample A B C element Sn 1 2 3 Pb 2 4 6 >>> # Now make those columns into a second level of index >>> df = df.stack() >>> df element sample Sn A 1 B 2 C 3 Pb A 2 B 4 C 6 </code></pre> We now have all the delicious functionality of <code>groupby</code> at our disposal: <pre class="prettyprint"><code>>>> demean = lambda x: x - x.mean() >>> df.groupby(level='element').transform(demean) element sample Sn A -1 B 0 C 1 Pb A -2 B 0 C 2 </code></pre> When you view your data in this way, you'll find that many, many use cases which used to be multi-column <code>DataFrames</code> are in fact MultiIndexed <code>Series</code>, and you have much more power over how the data is represented and transformed.

Pandas: Subtract row mean from each element in row

2 Answers

You could use DataFrame's sub method and specify that the subtraction should happen row-wise (axis=0) as opposed to the default column-wise behaviour:

df.sub(df.mean(axis=1), axis=0)

Here's an example:

>>> df = pd.DataFrame({'a': [1.5, 2.5], 'b': [0.25, 2.75], 'c': [1.25, 0.75]})
>>> df
     a     b     c
0  1.5  0.25  1.25
1  2.5  2.75  0.75

The mean of each row is straightforward to calculate:

>>> df.mean(axis=1)
0    1
1    2
dtype: float64

To de-mean the rows of the DataFrame, just subtract the mean values of rows from df like this:

>>> df.sub(df.mean(axis=1), axis=0)
     a     b     c
0  0.5 -0.75  0.25
1  0.5  0.75 -1.25

102

answered Oct 06 '22 08:10

Alex Riley

Additionally to @ajcr's excellent answer, you might want to consider rearranging how you store your data.

The way you're doing it at the moment, with different samples in different columns, is the way it would be represented if you were using a spreadsheet, but this might not be the most helpful way to represent your data.

Normally, each column represents a unique piece of information about a single real-world entity. The typical example of this kind of data is a person:

id  name  hair_colour  Age
1   Bob   Brown        25

Really, your different samples are different real-world entities.

I would therefore suggest having a two-level index to describe each single piece of information. This makes manipulating your data in the way you want far more convenient.

Thus:

>>> df = pd.DataFrame([['Sn',1,2,3],['Pb',2,4,6]],
                      columns=['element', 'A', 'B', 'C']).set_index('element')
>>> df.columns.name = 'sample'
>>> df # This is how your DataFrame looks at the moment
sample   A  B  C
element         
Sn       1  2  3
Pb       2  4  6
>>> # Now make those columns into a second level of index
>>> df = df.stack()
>>> df
element  sample
Sn       A         1
         B         2
         C         3
Pb       A         2
         B         4
         C         6

We now have all the delicious functionality of groupby at our disposal:

>>> demean = lambda x: x - x.mean()
>>> df.groupby(level='element').transform(demean)
element  sample
Sn       A        -1
         B         0
         C         1
Pb       A        -2
         B         0
         C         2

When you view your data in this way, you'll find that many, many use cases which used to be multi-column DataFrames are in fact MultiIndexed Series, and you have much more power over how the data is represented and transformed.

answered Oct 06 '22 08:10

LondonRob

Related questions
                            
                                How to check a remote path is a file or a directory?
                            
                                Convert unicode to datetime proper strptime format
                            
                                How exactly does the "reflect" mode for scipys ndimage filters work?
                            
                                Tweepy: ImportError: cannot import name Random
                            
                                Python: print the time zone from strftime
                            
                                ValueError: Unknown MS Compiler version 1900
                            
                                mock file open in python
                            
                                Change default Django REST Framework home page title
                            
                                Installed Virtualenv and activating virtualenv doesn't work
                            
                                Python - Selenium in Ubuntu OSError: [Errno 20] Not a directory
                            
                                How to use Pandas stylers for coloring an entire row based on a given column?
                            
                                How to use Feature2D (such as SimpleBlobDetector) correctly? (Python + OpenCV)
                            
                                how to understand Seaborn's heatmap annotation format?
                            
                                How to hide cell output in jupyter notebooks (VSCode + Python Extension)?
                            
                                What is the benefit of private name mangling?
                            
                                Shuffle in Python
                            
                                Django/Python - Check a date is in current week
                            
                                boost::python: compilation fails because copy constructor is private
                            
                                ImportError: cannot import name log
                            
                                Project Euler - How is this haskell code so fast?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas: Subtract row mean from each element in row

Tags:

python

pandas

dataframe

jeremy radcliff

People also ask

2 Answers

Alex Riley

LondonRob

Recent Activity

Donate For Us