Modifying a subset of rows in a pandas dataframe

People also ask

How do you subset rows in Python?

Using Python iloc() function to create a subset of a dataframe. Python iloc() function enables us to create subset choosing specific values from rows and columns based on indexes.

How do you modify a row in a data frame?

Using iloc() method to update the value of a row With the Python iloc() method, it is possible to change or update the value of a row/column by providing the index values of the same. In this example, we have updated the value of the rows 0, 1, 3 and 6 with respect to the first column i.e. 'Num' to 100.

Can we modify a data inside a DataFrame?

Although DataFrames are meant to be populated by reading already organized data from external files, many times you will need to somehow manage and modify already existing columns (and rows) in a DF.

Use .loc for label based indexing:

df.loc[df.A==0, 'B'] = np.nan

The df.A==0 expression creates a boolean series that indexes the rows, 'B' selects the column. You can also use this to transform a subset of a column, e.g.:

df.loc[df.A==0, 'B'] = df.loc[df.A==0, 'B'] / 2

I don't know enough about pandas internals to know exactly why that works, but the basic issue is that sometimes indexing into a DataFrame returns a copy of the result, and sometimes it returns a view on the original object. According to documentation here, this behavior depends on the underlying numpy behavior. I've found that accessing everything in one operation (rather than [one][two]) is more likely to work for setting.

Here is from pandas docs on advanced indexing:

The section will explain exactly what you need! Turns out df.loc (as .ix has been deprecated -- as many have pointed out below) can be used for cool slicing/dicing of a dataframe. And. It can also be used to set things.

df.loc[selection criteria, columns I want] = value

So Bren's answer is saying 'find me all the places where df.A == 0, select column B and set it to np.nan'

Starting from pandas 0.20 ix is deprecated. The right way is to use df.loc

here is a working example

>>> import pandas as pd 
>>> import numpy as np 
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN
>>>

Explanation:

As explained in the doc here, .loc is primarily label based, but may also be used with a boolean array.

So, what we are doing above is applying df.loc[row_index, column_index] by:

Exploiting the fact that loc can take a boolean array as a mask that tells pandas which subset of rows we want to change in row_index
Exploiting the fact loc is also label based to select the column using the label 'B' in the column_index

We can use logical, condition or any operation that returns a series of booleans to construct the array of booleans. In the above example, we want any rows that contain a 0, for that we can use df.A == 0, as you can see in the example below, this returns a series of booleans.

>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df 
   A  B
0  0  2
1  1  0
2  0  5
>>> df.A == 0 
0     True
1    False
2     True
Name: A, dtype: bool
>>>

Then, we use the above array of booleans to select and modify the necessary rows:

>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN

For more information check the advanced indexing documentation here.

For a massive speed increase, use NumPy's where function.

Setup

Create a two-column DataFrame with 100,000 rows with some zeros.

df = pd.DataFrame(np.random.randint(0,3, (100000,2)), columns=list('ab'))

Fast solution with `numpy.where`

df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)

Timings

%timeit df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
685 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.loc[df['a'] == 0, 'b'] = np.nan
3.11 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy's where is about 4x faster

To replace multiples columns convert to numpy array using .values:

df.loc[df.A==0, ['B', 'C']] = df.loc[df.A==0, ['B', 'C']].values / 2

Related questions
                            
                                NameError: name 'self' is not defined
                            
                                Exiting from python Command Line
                            
                                'and' (boolean) vs '&' (bitwise) - Why difference in behavior with lists vs numpy arrays?
                            
                                Can I install Python 3.x and 2.x on the same Windows computer?
                            
                                How to create abstract properties in python abstract classes
                            
                                Mock vs MagicMock
                            
                                How to make a class property? [duplicate]
                            
                                What is the meaning of "Failed building wheel for X" in pip install?
                            
                                How should I use the Optional type hint?
                            
                                Excluding directories in os.walk
                            
                                How can I check if character in a string is a letter? (Python)
                            
                                Django dynamic model fields
                            
                                Django: How to manage development and production settings?
                            
                                Django DB Settings 'Improperly Configured' Error
                            
                                Open file in a relative location in Python
                            
                                How to clear variables in ipython?
                            
                                Django 1.7 throws django.core.exceptions.AppRegistryNotReady: Models aren't loaded yet
                            
                                MySQL "incorrect string value" error when save unicode string in Django
                            
                                Pass a parameter to a fixture function
                            
                                Remove a prefix from a string [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Modifying a subset of rows in a pandas dataframe

Tags:

python

pandas

People also ask

Explanation:

Setup

Fast solution with `numpy.where`

Timings

Recent Activity

Donate For Us

Modifying a subset of rows in a pandas dataframe

Tags:

python

pandas

People also ask

Explanation:

Setup

Fast solution with numpy.where

Timings

Related questions

Recent Activity

Donate For Us

Fast solution with `numpy.where`