I found the documentation for pandas.DataFrame.pop
, but after trying it and examining the source code, it does not seem to do what I want.
If I make a dataframe like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,6))
# Make a few areas have NaN values
df.iloc[1:3,1] = np.nan
df.iloc[5,3] = np.nan
df.iloc[7:9,5] = np.nan
>>> df
0 1 2 3 4 5
0 0.772762 -0.442657 1.245988 1.102018 -0.740836 1.685598
1 -0.387922 NaN -1.215723 -0.106875 0.499110 0.338759
2 0.567631 NaN -0.353032 -0.099011 -0.698925 -1.348966
3 1.320849 1.084405 -1.296177 0.681111 -1.941855 -0.950346
4 -0.026818 -1.933629 -0.693964 1.116673 0.392217 1.280808
5 -1.249192 -0.035932 -1.330916 NaN -0.135720 -0.506016
6 0.406344 1.416579 0.122019 0.648851 -0.305359 -1.253580
7 -0.092440 -0.243593 0.468463 -1.689485 0.667804 NaN
8 -0.110819 -0.627777 -0.302116 0.630068 2.567923 NaN
9 1.884069 -0.393420 -0.950275 0.151182 -1.122764 0.502117
If I want to remove selected rows and assign them to a separate object in one step, I would want a pop
behavior, like this:
# rows in column 5 which have NaN values
>>> df[df[5].isnull()].index
Int64Index([7, 8], dtype='int64')
# remove them from the dataframe, assign them to a separate object
>>> nan_rows = df.pop(df[df[5].isnull()].index)
However, this does not appear to be supported. Instead, it seems like I am forced to do this in two separate steps, which seems a bit inelegant.
# get the NaN rows
>>> nan_rows = df[df[5].isnull()]
>>> nan_rows
0 1 2 3 4 5
7 -0.092440 -0.243593 0.468463 -1.689485 0.667804 NaN
8 -0.110819 -0.627777 -0.302116 0.630068 2.567923 NaN
# remove from orignal df
>>> df = df.drop(nan_rows.index)
>>> df
0 1 2 3 4 5
0 0.772762 -0.442657 1.245988 1.102018 -0.740836 1.685598
1 -0.387922 NaN -1.215723 -0.106875 0.499110 0.338759
2 0.567631 NaN -0.353032 -0.099011 -0.698925 -1.348966
3 1.320849 1.084405 -1.296177 0.681111 -1.941855 -0.950346
4 -0.026818 -1.933629 -0.693964 1.116673 0.392217 1.280808
5 -1.249192 -0.035932 -1.330916 NaN -0.135720 -0.506016
6 0.406344 1.416579 0.122019 0.648851 -0.305359 -1.253580
9 1.884069 -0.393420 -0.950275 0.151182 -1.122764 0.502117
Is there a one-step method built-in? Or is this the way you're 'supposed' to do it?
You can use the pop() function to quickly remove a column from a pandas DataFrame.
To delete a row from a DataFrame, use the drop() method and set the index label as the parameter.
First, slice df (step 1), and then drop those columns (step 2). This is still a two step process, but you're doing it in one line. Defining df and then running the command df2 = df[['c', 'd']].
pop source code:
def pop(self, item): """ Return item and drop from frame. Raise KeyError if not found. """ result = self[item] del self[item] try: result._reset_cacher() except AttributeError: pass return result File: c:\python\lib\site-packages\pandas\core\generic.py
del
definitely won't work if item
is not a simple column name. Pass a simple column name, or do it in two steps.
Since you can pop columns, you can take transpose of the dataframe and pop its columns, ie. the rows of the original df like this. Here is the original df.
import numpy as np
df = pd.DataFrame(np.random.randint(0, 10, size=(3, 3)), columns = ['a', 'b', 'c'])
print(df)
a b c
0 4 9 4
1 5 5 8
2 5 7 4
Then you take transpose of it and pop column 0 which is the row 0 of the original df.
df_t = df.T
popped_row = df_t.pop(0)
Now you have the popped row
print(popped_row)
a 4
b 9
c 4
Name: 0, dtype: int32
And then you have the original dataframe without the first row.
df = df_t.T
print(df)
a b c
1 5 5 8
2 5 7 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With