Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using .iterrows() with series.nlargest() to get the highest number in a row in a Dataframe

I am trying to create a function that uses df.iterrows() and Series.nlargest. I want to iterate over each row and find the largest number and then mark it as a 1. This is the data frame:

A   B    C
9   6    5
3   7    2

Here is the output I wish to have:

A    B   C
1    0   0
0    1   0

This is the function I wish to use here:

def get_top_n(df, top_n):
    """


    Parameters
    ----------
    df : DataFrame

    top_n : int
        The top number to get
    Returns
    -------
    top_numbers : DataFrame
    Returns the top number marked with a 1

    """
    # Implement Function
    for row in df.iterrows():
        top_numbers = row.nlargest(top_n).sum()

    return top_numbers

I get the following error: AttributeError: 'tuple' object has no attribute 'nlargest'

Help would be appreciated on how to re-write my function in a neater way and to actually work! Thanks in advance

like image 616
Deepak M Avatar asked Aug 02 '18 05:08

Deepak M


People also ask

What is the purpose of Iterrows () in Pandas?

Pandas DataFrame iterrows() Method The iterrows() method generates an iterator object of the DataFrame, allowing us to iterate each row in the DataFrame. Each iteration produces an index object and a row object (a Pandas Series object).

What is the use of Iterrows () and Iteritems () Explain with proper examples?

iterrows() Yields: index: The row's index. A tuple for MultiIndex data that contains the series of data for the row it returns: a generator that cycles through the frame's rows. iteritems(): A one-dimensional ndarray with axis labels is the Pandas series.

How do I find the largest number of rows in a Dataframe?

In this example we use a .csv file called data.csv The nlargest () method returns a specified number of rows, starting at the top after sorting the DataFrame by the highest value for a specified column. The keep parameter is a keyword argument.

How to get top 3 rows with largest values in pandas nlargest?

Pandas nlargest function can take more than one variable to order the top rows. We can give a list of variables as input to nlargest and get first n rows ordered by the list of columns in descending order. Here we get top 3 rows with largest values in column “lifeExp” and then “gdpPercap”.

What is the use of N largest in Dataframe?

The nlargest () method returns a specified number of rows, starting at the top after sorting the DataFrame by the highest value for a specified column. The keep parameter is a keyword argument. Optional, default 'last', specifying what to do with duplicate rows.

Can I use nlargest() on a column with a numeric value?

Please note, you can use the pandas nlargest()function on a column or Series with numeric values. If we pass “Name” to nlargest in our example, we will receive an error because the “Name” column is made up of strings.


2 Answers

Add i variable, because iterrows return indices with Series for each row:

for i, row in df.iterrows():
    top_numbers = row.nlargest(top_n).sum()

General solution with numpy.argsort for positions in descending order, then compare and convert boolean array to integers:

def get_top_n(df, top_n):
    if top_n > len(df.columns):
        raise ValueError("Value is higher as number of columns")
    elif not isinstance(top_n, int):
        raise ValueError("Value is not integer")

    else:
        arr = ((-df.values).argsort(axis=1) < top_n).astype(int)
        df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
        return (df1)

df1 = get_top_n(df, 2)
print (df1)
   A  B  C
0  1  1  0
1  1  1  0

df1 = get_top_n(df, 1)
print (df1)
   A  B  C
0  1  0  0
1  0  1  0

EDIT:

Solution with iterrows is possible, but not recommended, because slow:

top_n = 2
for i, row in df.iterrows():
    top = row.nlargest(top_n).index
    df.loc[i] = 0
    df.loc[i, top] = 1

print (df)
   A  B  C
0  1  1  0
1  1  1  0
like image 108
jezrael Avatar answered Oct 07 '22 01:10

jezrael


For context, the dataframe consists of stock return data for the S&P500 over approximately 4 years

def get_top_n(prev_returns, top_n):

    # generate dataframe populated with zeros for merging
    top_stocks = pd.DataFrame(0, columns = prev_returns.columns, index = prev_returns.index)

    # find top_n largest entries by row
    df = prev_returns.apply(lambda x: x.nlargest(top_n), axis=1)

    # merge dataframes
    top_stocks = top_stocks.merge(df, how = 'right').set_index(df.index)

    # return dataframe replacing non_zero answers with a 1
    return (top_stocks.notnull()) * 1
like image 25
Josmoor98 Avatar answered Oct 07 '22 00:10

Josmoor98