I'm fairly new to Python and Pandas and trying to figure out how to do a simple split-join-apply. The problem I am having is that I am getting an blank row at the top of all the dataframes I'm getting back from Pandas' apply function and I'm not sure why. Can anyone explain?
The following is a minimal example that demonstrates the problem, not my actual code:
sorbet = pd.DataFrame({
  'flavour': ['orange', 'orange', 'lemon', 'lemon'],
  'niceosity' : [4, 5, 7, 8]})
def calc_vals(df, target) :
    return pd.Series({'total' : df[target].count(), 'mean' : df[target].mean()})
sorbet_grouped = sorbet.groupby('flavour')
sorbet_vals = sorbet_grouped.apply(calc_vals, target='niceosity')
if I then do print(sorted_vals) I get this output:
         mean  total
flavour                 <--- Why are there spaces here?
lemon     7.5      2
orange    4.5      2
[2 rows x 2 columns]
Compare this with print(sorbet):
  flavour  niceosity     <--- Note how column names line up
0  orange          4
1  orange          5
2   lemon          7
3   lemon          8
[4 rows x 2 columns]
What is causing this discrepancy and how can I fix it?
Use df. dropna() to drop rows with NaN from a Pandas dataframe. Call df. dropna(subset, inplace=True) with inplace set to True and subset set to a list of column names to drop all rows that contain NaN under those columns.
Use apply() function when you wanted to update every row in pandas DataFrame by calling a custom function. In order to apply a function to every row, you should use axis=1 param to apply(). By applying a function to each row, we can create a new column by using the values from the row, updating the row e.t.c.
shape() method returns the number of rows and number of columns as a tuple, you can use this to check if pandas DataFrame is empty. DataFrame. shape[0] return number of rows. If you have no rows then it gives you 0 and comparing it with 0 gives you True .
The groupby/apply operation returns is a new DataFrame, with a named index. The name corresponds to the column name by which the original DataFrame was grouped.
The name shows up above the index. If you reset it to None, then that row disappears:
In [155]: sorbet_vals.index.name = None
In [156]: sorbet_vals
Out[156]: 
        mean  total
lemon    7.5      2
orange   4.5      2
[2 rows x 2 columns]
Note that the name is useful -- I don't really recommend removing it. The name allows you to refer to that index by name rather than merely by number. 
If you wish the index to be a column, use reset_index:
In [209]: sorbet_vals.reset_index(inplace=True); sorbet_vals
Out[209]: 
  flavour  mean  total
0   lemon   7.5      2
1  orange   4.5      2
[2 rows x 3 columns]
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With