Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Modifying a subset of rows in a pandas dataframe

Tags:

python

pandas

People also ask

How do you subset rows in Python?

Using Python iloc() function to create a subset of a dataframe. Python iloc() function enables us to create subset choosing specific values from rows and columns based on indexes.

How do you modify a row in a data frame?

Using iloc() method to update the value of a row With the Python iloc() method, it is possible to change or update the value of a row/column by providing the index values of the same. In this example, we have updated the value of the rows 0, 1, 3 and 6 with respect to the first column i.e. 'Num' to 100.

Can we modify a data inside a DataFrame?

Although DataFrames are meant to be populated by reading already organized data from external files, many times you will need to somehow manage and modify already existing columns (and rows) in a DF.


Use .loc for label based indexing:

df.loc[df.A==0, 'B'] = np.nan

The df.A==0 expression creates a boolean series that indexes the rows, 'B' selects the column. You can also use this to transform a subset of a column, e.g.:

df.loc[df.A==0, 'B'] = df.loc[df.A==0, 'B'] / 2

I don't know enough about pandas internals to know exactly why that works, but the basic issue is that sometimes indexing into a DataFrame returns a copy of the result, and sometimes it returns a view on the original object. According to documentation here, this behavior depends on the underlying numpy behavior. I've found that accessing everything in one operation (rather than [one][two]) is more likely to work for setting.


Here is from pandas docs on advanced indexing:

The section will explain exactly what you need! Turns out df.loc (as .ix has been deprecated -- as many have pointed out below) can be used for cool slicing/dicing of a dataframe. And. It can also be used to set things.

df.loc[selection criteria, columns I want] = value

So Bren's answer is saying 'find me all the places where df.A == 0, select column B and set it to np.nan'


Starting from pandas 0.20 ix is deprecated. The right way is to use df.loc

here is a working example

>>> import pandas as pd 
>>> import numpy as np 
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN
>>> 

Explanation:

As explained in the doc here, .loc is primarily label based, but may also be used with a boolean array.

So, what we are doing above is applying df.loc[row_index, column_index] by:

  • Exploiting the fact that loc can take a boolean array as a mask that tells pandas which subset of rows we want to change in row_index
  • Exploiting the fact loc is also label based to select the column using the label 'B' in the column_index

We can use logical, condition or any operation that returns a series of booleans to construct the array of booleans. In the above example, we want any rows that contain a 0, for that we can use df.A == 0, as you can see in the example below, this returns a series of booleans.

>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df 
   A  B
0  0  2
1  1  0
2  0  5
>>> df.A == 0 
0     True
1    False
2     True
Name: A, dtype: bool
>>> 

Then, we use the above array of booleans to select and modify the necessary rows:

>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN

For more information check the advanced indexing documentation here.


For a massive speed increase, use NumPy's where function.

Setup

Create a two-column DataFrame with 100,000 rows with some zeros.

df = pd.DataFrame(np.random.randint(0,3, (100000,2)), columns=list('ab'))

Fast solution with numpy.where

df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)

Timings

%timeit df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
685 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.loc[df['a'] == 0, 'b'] = np.nan
3.11 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy's where is about 4x faster


To replace multiples columns convert to numpy array using .values:

df.loc[df.A==0, ['B', 'C']] = df.loc[df.A==0, ['B', 'C']].values / 2