Using Python iloc() function to create a subset of a dataframe. Python iloc() function enables us to create subset choosing specific values from rows and columns based on indexes.
Using iloc() method to update the value of a row With the Python iloc() method, it is possible to change or update the value of a row/column by providing the index values of the same. In this example, we have updated the value of the rows 0, 1, 3 and 6 with respect to the first column i.e. 'Num' to 100.
Although DataFrames are meant to be populated by reading already organized data from external files, many times you will need to somehow manage and modify already existing columns (and rows) in a DF.
Use .loc
for label based indexing:
df.loc[df.A==0, 'B'] = np.nan
The df.A==0
expression creates a boolean series that indexes the rows, 'B'
selects the column. You can also use this to transform a subset of a column, e.g.:
df.loc[df.A==0, 'B'] = df.loc[df.A==0, 'B'] / 2
I don't know enough about pandas internals to know exactly why that works, but the basic issue is that sometimes indexing into a DataFrame returns a copy of the result, and sometimes it returns a view on the original object. According to documentation here, this behavior depends on the underlying numpy behavior. I've found that accessing everything in one operation (rather than [one][two]) is more likely to work for setting.
Here is from pandas docs on advanced indexing:
The section will explain exactly what you need! Turns out df.loc
(as .ix has been deprecated -- as many have pointed out below) can be used for cool slicing/dicing of a dataframe. And. It can also be used to set things.
df.loc[selection criteria, columns I want] = value
So Bren's answer is saying 'find me all the places where df.A == 0
, select column B
and set it to np.nan
'
Starting from pandas 0.20 ix is deprecated. The right way is to use df.loc
here is a working example
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
A B
0 0 NaN
1 1 0
2 0 NaN
>>>
As explained in the doc here, .loc
is primarily label based, but may also be used with a boolean array.
So, what we are doing above is applying df.loc[row_index, column_index]
by:
loc
can take a boolean array as a mask that tells pandas which subset of rows we want to change in row_index
loc
is also label based to select the column using the label 'B'
in the column_index
We can use logical, condition or any operation that returns a series of booleans to construct the array of booleans. In the above example, we want any rows
that contain a 0
, for that we can use df.A == 0
, as you can see in the example below, this returns a series of booleans.
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df
A B
0 0 2
1 1 0
2 0 5
>>> df.A == 0
0 True
1 False
2 True
Name: A, dtype: bool
>>>
Then, we use the above array of booleans to select and modify the necessary rows:
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
A B
0 0 NaN
1 1 0
2 0 NaN
For more information check the advanced indexing documentation here.
For a massive speed increase, use NumPy's where function.
Create a two-column DataFrame with 100,000 rows with some zeros.
df = pd.DataFrame(np.random.randint(0,3, (100000,2)), columns=list('ab'))
numpy.where
df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
%timeit df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
685 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.loc[df['a'] == 0, 'b'] = np.nan
3.11 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy's where
is about 4x faster
To replace multiples columns convert to numpy array using .values
:
df.loc[df.A==0, ['B', 'C']] = df.loc[df.A==0, ['B', 'C']].values / 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With