Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split pandas dataframe column list values to duplicate rows [duplicate]

I have a dataframe that looks like the following:

publication_title    authors                             type ...
title 1              ['author1', 'author2', 'author3']   proceedings
title 2              ['author4', 'author5']              collections
title 3              ['author6', 'author7']              books
.
.
. 

What I want to do is take the column 'authors' and split the list inside it into several rows by duplicating all the other columns, and I want also to store the results in a new column named: 'author' and keep the original column.

The following depicts exactly what I want to achieve:

publication_title    authors                             author          type ...
title 1              ['author1', 'author2', 'author3']   author1         proceedings
title 1              ['author1', 'author2', 'author3']   author2         proceedings
title 1              ['author1', 'author2', 'author3']   author3         proceedings
title 2              ['author4', 'author5']              author4         collections
title 2              ['author4', 'author5']              author5         collections
title 3              ['author6', 'author7']              author6         books
title 3              ['author6', 'author7']              author7         books
.
.
. 

I have tried to achieve this using pandas DataFrame explode method but I cannot find a way to store the results in a new column.

Thanks for the help.

like image 406
Aniss Chohra Avatar asked Aug 22 '19 21:08

Aniss Chohra


1 Answers

Since pandas 0.25.0 we have the explode method. First we duplicate the authors column and rename it at the same time using assign then we explode this column to rows and duplicate the other columns:

df.assign(author=df['authors']).explode('author')

Output

  publication_title                      authors         type   author
0           title_1  [author1, author2, author3]  proceedings  author1
0           title_1  [author1, author2, author3]  proceedings  author2
0           title_1  [author1, author2, author3]  proceedings  author3
1           title_2           [author4, author5]  collections  author4
1           title_2           [author4, author5]  collections  author5
2           title_3           [author6, author7]        books  author6
2           title_3           [author6, author7]        books  author7

If you want remove the duplicated index, use reset_index:

df.assign(author=df['authors']).explode('author').reset_index(drop=True)

Output

  publication_title                      authors         type   author
0           title_1  [author1, author2, author3]  proceedings  author1
1           title_1  [author1, author2, author3]  proceedings  author2
2           title_1  [author1, author2, author3]  proceedings  author3
3           title_2           [author4, author5]  collections  author4
4           title_2           [author4, author5]  collections  author5
5           title_3           [author6, author7]        books  author6
6           title_3           [author6, author7]        books  author7
like image 121
Erfan Avatar answered Sep 18 '22 14:09

Erfan