I am trying to use a loop function to create a matrix of whether a product was seen in a particular week.
Each row in the df (representing a product) has a close_date (the date the product closed) and a week_diff (the number of weeks the product was listed).
import pandas mydata = [{'subid' : 'A', 'Close_date_wk': 25, 'week_diff':3}, {'subid' : 'B', 'Close_date_wk': 26, 'week_diff':2}, {'subid' : 'C', 'Close_date_wk': 27, 'week_diff':2},] df = pandas.DataFrame(mydata)
My goal is to see how many alternative products were listed for each product in each date_range
I have set up the following loop:
for index, row in df.iterrows(): i = 0 max_range = row['Close_date_wk'] min_range = int(row['Close_date_wk'] - row['week_diff']) for i in range(min_range,max_range): col_head = 'job_week_' + str(i) row[col_head] = 1
Can you please help explain why the "row[col_head] = 1" line is neither adding a column, nor adding a value to that column for that row.
For example, if:
row A has date range 1,2,3 row B has date range 2,3 row C has date range 3,4,5'
then ideally I would like to end up with
row A has 0 alternative products in week 1 1 alternative products in week 2 2 alternative products in week 3 row B has 1 alternative products in week 2 2 alternative products in week 3 &c..
If you want to add a column to a DataFrame by calling a function on another column, the iterrows() method in combination with a for loop is not the preferred way to go. Instead, you'll want to use apply() .
The iterrows() method generates an iterator object of the DataFrame, allowing us to iterate each row in the DataFrame. Each iteration produces an index object and a row object (a Pandas Series object).
This function returns each index value along with a series that contain the data in each row. iterrows() - used for iterating over the rows as (index, series) pairs. iteritems() - used for iterating over the (key, value) pairs. itertuples() - used for iterating over the rows as namedtuples.
This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes.
You can't mutate the df using row
here to add a new column, you'd either refer to the original df or use .loc
, .iloc
, or .ix
, example:
In [29]: df = pd.DataFrame(columns=list('abc'), data = np.random.randn(5,3)) df Out[29]: a b c 0 -1.525011 0.778190 -1.010391 1 0.619824 0.790439 -0.692568 2 1.272323 1.620728 0.192169 3 0.193523 0.070921 1.067544 4 0.057110 -1.007442 1.706704 In [30]: for index,row in df.iterrows(): df.loc[index,'d'] = np.random.randint(0, 10) df Out[30]: a b c d 0 -1.525011 0.778190 -1.010391 9 1 0.619824 0.790439 -0.692568 9 2 1.272323 1.620728 0.192169 1 3 0.193523 0.070921 1.067544 0 4 0.057110 -1.007442 1.706704 9
You can modify existing rows:
In [31]: # reset the df by slicing df = df[list('abc')] for index,row in df.iterrows(): row['b'] = np.random.randint(0, 10) df Out[31]: a b c 0 -1.525011 8 -1.010391 1 0.619824 2 -0.692568 2 1.272323 8 0.192169 3 0.193523 2 1.067544 4 0.057110 3 1.706704
But adding a new column using row won't work:
In [35]: df = df[list('abc')] for index,row in df.iterrows(): row['d'] = np.random.randint(0,10) df Out[35]: a b c 0 -1.525011 8 -1.010391 1 0.619824 2 -0.692568 2 1.272323 8 0.192169 3 0.193523 2 1.067544 4 0.057110 3 1.706704
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With