I have a relatively huge dataframe. Im trying to iterate to each row and update a column base on certain column value (basically trying to loop a lookup until no further column can be updated)
I have the following:
df = the huge dataframe (1K to 10K+ rows x 51 cols)
has_update = True
while has_update:
has_update = False
for_procdf = df.loc[df['Incident Group ID'] == '-']
for i, row in for_procdf.iterrows():
#Check if the row's parent ticket id is an existing ticket id in the bigger df
resultRow = df.loc[df['Ticket ID'] == row['Parent Ticket ID']]
resultCount = len(resultRow.index)
if resultCount == 1:
IncidentGroupID = resultRow.iloc[0]['Incident Group ID']
if IncidentGroupID != '-':
df.at[i, "Incident Group ID"] = IncidentGroupID
has_update = True
When I execute the script, an error occurs with the following traceback:
Traceback (most recent call last):
File "./sdm.etl.py", line 76, in <module>
main()
File "./sdm.etl.py", line 28, in main
fillIncidentGroupID(sdmdf.df)
File "./sdm.etl.py", line 47, in fillIncidentGroupID
df.at[i, "Incident Group ID"] = IncidentGroupID
File "/usr/local/lib/python3.6/site-packages/pandas/core/indexing.py", line 2159, in __setitem__
self.obj._set_value(*key, takeable=self._takeable)
File "/usr/local/lib/python3.6/site-packages/pandas/core/frame.py", line 2580, in _set_value
series = self._get_item_cache(col)
File "/usr/local/lib/python3.6/site-packages/pandas/core/generic.py", line 2490, in _get_item_cache
res = self._box_item_values(item, values)
File "/usr/local/lib/python3.6/site-packages/pandas/core/frame.py", line 3096, in _box_item_values
return self._constructor(values.T, columns=items, index=self.index)
AttributeError: 'BlockManager' object has no attribute 'T'
However creating a similar scenario returns no error
>>> qdf = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30], [10, 13, 17]], index=[0,1,2,3], columns=['Ab 1', 'Bc 2', 'Cd 3'])
>>> qdf
Ab 1 Bc 2 Cd 3
0 0 2 3
1 0 4 1
2 10 20 30
3 10 13 17
>>>
>>> qdf1 = qdf.loc[qdf['Ab 1'] == 0]
>>> qdf1
Ab 1 Bc 2 Cd 3
0 0 2 3
1 0 4 1
>>>
>>> for i, row in qdf1.iterrows():
... qdf.at[i, 'Ab 1'] = 10
...
>>>
>>> qdf
Ab 1 Bc 2 Cd 3
0 10 2 3
1 10 4 1
2 10 20 30
3 10 13 17
What seems to be the problem with my implementation?
Found out that, Nihal is right, the error is caused by a duplicate column name. My dataframe was too big, that I accidentally had a duplicate column name. Everything works fine now. A little time away from the code, rest and eat made me see the duplicate column. Cheers!
Below are the columns of my dataframe. "RCA Group ID" has duplicate near the end.
['Incident Group ID', 'RCA Group ID', 'Parent Ticket ID', 'Ticket ID', ..., 'RCA Group ID', 'Is Sector Down', 'Relationship Type']
the error is caused by a duplicate column name
That was true in my case.
You can use the following function to quickly determine which column names are duplicates.
def get_duplicate_cols(df: pd.DataFrame) -> pd.Series:
return pd.Series(df.columns).value_counts()[lambda x: x>1]
Source
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With