I have a pandas dataframe that looks like this:
COL data
line1 [A,B,C]
where the items in the data column could either be a list or just comma separated elements. Is there an easy of way of getting:
COL data
line1 A
line1 B
line1 C
I could iterate over the list and manually duplicate the rows via python, but is there some magic pandas trick for doing this? The key point is how to automatically duplicate the rows.
Thanks!
Find Duplicate Rows based on all columns To find & select the duplicate all rows based on all columns call the Daraframe. duplicate() without any subset argument. It will return a Boolean series with True at the place of each duplicated rows except their first occurrence (default value of keep argument is 'first').
Finding duplicate rows To find duplicates on a specific column, we can simply call duplicated() method on the column. The result is a boolean Series with the value True denoting duplicate. In other words, the value True means the entry is identical to a previous one.
You could write a simple cleaning function to make it a list (assuming it's not a list of commas, and you can't simply use ast.literal_eval
):
def clean_string_to_list(s):
return [c for c in s if c not in '[,]'] # you might need to catch errors
df['data'] = df['data'].apply(clean_string_to_list)
Iterating through the rows seems like a reasonable choice:
In [11]: pd.DataFrame([(row['COL'], d)
for d in row['data']
for _, row in df.iterrows()],
columns=df.columns)
Out[11]:
COL data
0 line1 A
1 line1 B
2 line1 C
I'm afraid I don't think pandas caters specifically for this kind of manipulation.
You can use df.explode()
option. Refer to the documentation. I believe this is exactly the functionality you need.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With