I'm fairly new to Python and Pandas and trying to figure out how to do a simple split-join-apply. The problem I am having is that I am getting an blank row at the top of all the dataframes I'm getting back from Pandas' apply function and I'm not sure why. Can anyone explain?
The following is a minimal example that demonstrates the problem, not my actual code:
sorbet = pd.DataFrame({
'flavour': ['orange', 'orange', 'lemon', 'lemon'],
'niceosity' : [4, 5, 7, 8]})
def calc_vals(df, target) :
return pd.Series({'total' : df[target].count(), 'mean' : df[target].mean()})
sorbet_grouped = sorbet.groupby('flavour')
sorbet_vals = sorbet_grouped.apply(calc_vals, target='niceosity')
if I then do print(sorted_vals)
I get this output:
mean total
flavour <--- Why are there spaces here?
lemon 7.5 2
orange 4.5 2
[2 rows x 2 columns]
Compare this with print(sorbet)
:
flavour niceosity <--- Note how column names line up
0 orange 4
1 orange 5
2 lemon 7
3 lemon 8
[4 rows x 2 columns]
What is causing this discrepancy and how can I fix it?
Use df. dropna() to drop rows with NaN from a Pandas dataframe. Call df. dropna(subset, inplace=True) with inplace set to True and subset set to a list of column names to drop all rows that contain NaN under those columns.
Use apply() function when you wanted to update every row in pandas DataFrame by calling a custom function. In order to apply a function to every row, you should use axis=1 param to apply(). By applying a function to each row, we can create a new column by using the values from the row, updating the row e.t.c.
shape() method returns the number of rows and number of columns as a tuple, you can use this to check if pandas DataFrame is empty. DataFrame. shape[0] return number of rows. If you have no rows then it gives you 0 and comparing it with 0 gives you True .
The groupby/apply operation returns is a new DataFrame, with a named index. The name corresponds to the column name by which the original DataFrame was grouped.
The name shows up above the index. If you reset it to None
, then that row disappears:
In [155]: sorbet_vals.index.name = None
In [156]: sorbet_vals
Out[156]:
mean total
lemon 7.5 2
orange 4.5 2
[2 rows x 2 columns]
Note that the name
is useful -- I don't really recommend removing it. The name allows you to refer to that index by name rather than merely by number.
If you wish the index to be a column, use reset_index
:
In [209]: sorbet_vals.reset_index(inplace=True); sorbet_vals
Out[209]:
flavour mean total
0 lemon 7.5 2
1 orange 4.5 2
[2 rows x 3 columns]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With