I am trying to create a function that uses df.iterrows()
and Series.nlargest
. I want to iterate over each row and find the largest number and then mark it as a 1
. This is the data frame:
A B C
9 6 5
3 7 2
Here is the output I wish to have:
A B C
1 0 0
0 1 0
This is the function I wish to use here:
def get_top_n(df, top_n):
"""
Parameters
----------
df : DataFrame
top_n : int
The top number to get
Returns
-------
top_numbers : DataFrame
Returns the top number marked with a 1
"""
# Implement Function
for row in df.iterrows():
top_numbers = row.nlargest(top_n).sum()
return top_numbers
I get the following error: AttributeError: 'tuple' object has no attribute 'nlargest'
Help would be appreciated on how to re-write my function in a neater way and to actually work! Thanks in advance
Pandas DataFrame iterrows() Method The iterrows() method generates an iterator object of the DataFrame, allowing us to iterate each row in the DataFrame. Each iteration produces an index object and a row object (a Pandas Series object).
iterrows() Yields: index: The row's index. A tuple for MultiIndex data that contains the series of data for the row it returns: a generator that cycles through the frame's rows. iteritems(): A one-dimensional ndarray with axis labels is the Pandas series.
In this example we use a .csv file called data.csv The nlargest () method returns a specified number of rows, starting at the top after sorting the DataFrame by the highest value for a specified column. The keep parameter is a keyword argument.
Pandas nlargest function can take more than one variable to order the top rows. We can give a list of variables as input to nlargest and get first n rows ordered by the list of columns in descending order. Here we get top 3 rows with largest values in column “lifeExp” and then “gdpPercap”.
The nlargest () method returns a specified number of rows, starting at the top after sorting the DataFrame by the highest value for a specified column. The keep parameter is a keyword argument. Optional, default 'last', specifying what to do with duplicate rows.
Please note, you can use the pandas nlargest()function on a column or Series with numeric values. If we pass “Name” to nlargest in our example, we will receive an error because the “Name” column is made up of strings.
Add i
variable, because iterrows
return indices with Series
for each row:
for i, row in df.iterrows():
top_numbers = row.nlargest(top_n).sum()
General solution with numpy.argsort
for positions in descending order, then compare and convert boolean array to integers:
def get_top_n(df, top_n):
if top_n > len(df.columns):
raise ValueError("Value is higher as number of columns")
elif not isinstance(top_n, int):
raise ValueError("Value is not integer")
else:
arr = ((-df.values).argsort(axis=1) < top_n).astype(int)
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
return (df1)
df1 = get_top_n(df, 2)
print (df1)
A B C
0 1 1 0
1 1 1 0
df1 = get_top_n(df, 1)
print (df1)
A B C
0 1 0 0
1 0 1 0
EDIT:
Solution with iterrows
is possible, but not recommended, because slow:
top_n = 2
for i, row in df.iterrows():
top = row.nlargest(top_n).index
df.loc[i] = 0
df.loc[i, top] = 1
print (df)
A B C
0 1 1 0
1 1 1 0
For context, the dataframe consists of stock return data for the S&P500 over approximately 4 years
def get_top_n(prev_returns, top_n):
# generate dataframe populated with zeros for merging
top_stocks = pd.DataFrame(0, columns = prev_returns.columns, index = prev_returns.index)
# find top_n largest entries by row
df = prev_returns.apply(lambda x: x.nlargest(top_n), axis=1)
# merge dataframes
top_stocks = top_stocks.merge(df, how = 'right').set_index(df.index)
# return dataframe replacing non_zero answers with a 1
return (top_stocks.notnull()) * 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With