Split cell into multiple rows in pandas dataframe

Tags:

I have a dataframe contains orders data, each order has multiple packages stored as comma separated string [package & package_code] columns

I want to split the packages data and create a row for each package including its order details

Here is a sample input dataframe:

import pandas as pd df = pd.DataFrame({"order_id":[1,3,7],"order_date":["20/5/2018","22/5/2018","23/5/2018"], "package":["p1,p2,p3","p4","p5,p6"],"package_code":["#111,#222,#333","#444","#555,#666"]})

Input Dataframe

And this is what I am trying to achieve as output: Output

How can I do that with pandas?

236

asked Jun 06 '18 23:06

Nobel

2 Answers

pandas >= 0.25

Assuming all splittable columns have the same number of comma separated items, you can split on comma and then use Series.explode on each column:

(df.set_index(['order_id', 'order_date'])    .apply(lambda x: x.str.split(',').explode())    .reset_index())                                                        order_id order_date package package_code 0         1  20/5/2018      p1         #111 1         1  20/5/2018      p2         #222 2         1  20/5/2018      p3         #333 3         3  22/5/2018      p4         #444 4         7  23/5/2018      p5         #555 5         7  23/5/2018      p6         #666

Details

Set the columns not to be touched as the index,

df.set_index(['order_id', 'order_date'])                        package    package_code order_id order_date                           1        20/5/2018   p1,p2,p3  #111,#222,#333 3        22/5/2018         p4            #444 7        23/5/2018      p5,p6       #555,#666

The next step is a 2-step process: Split on comma to get a column of lists, then call explode to explode the list values into their own rows.

_.apply(lambda x: x.str.split(',').explode())                      package package_code order_id order_date                      1        20/5/2018       p1         #111          20/5/2018       p2         #222          20/5/2018       p3         #333 3        22/5/2018       p4         #444 7        23/5/2018       p5         #555          23/5/2018       p6         #666

Finally, reset the index.

_.reset_index()     order_id order_date package package_code 0         1  20/5/2018      p1         #111 1         1  20/5/2018      p2         #222 2         1  20/5/2018      p3         #333 3         3  22/5/2018      p4         #444 4         7  23/5/2018      p5         #555 5         7  23/5/2018      p6         #666

pandas <= 0.24

This should work for any number of columns like this. The essence is a little stack-unstacking magic with str.split.

(df.set_index(['order_date', 'order_id'])    .stack()    .str.split(',', expand=True)    .stack()    .unstack(-2)    .reset_index(-1, drop=True)    .reset_index() )    order_date  order_id package package_code 0  20/5/2018         1      p1         #111 1  20/5/2018         1      p2         #222 2  20/5/2018         1      p3         #333 3  22/5/2018         3      p4         #444 4  23/5/2018         7      p5         #555 5  23/5/2018         7      p6         #666

There is another performant alternative involving chain, but you'd need to explicitly chain and repeat every column (a bit of a problem with a lot of columns). Choose whatever fits the description of your problem best, as there's no single answer.

Details

First, set the columns that are not to be touched as the index.

df.set_index(['order_date', 'order_id'])                         package    package_code order_date order_id                           20/5/2018  1         p1,p2,p3  #111,#222,#333 22/5/2018  3               p4            #444 23/5/2018  7            p5,p6       #555,#666

Next, stack the rows.

_.stack()  order_date  order_id               20/5/2018   1         package               p1,p2,p3                       package_code    #111,#222,#333 22/5/2018   3         package                     p4                       package_code              #444 23/5/2018   7         package                  p5,p6                       package_code         #555,#666 dtype: object

We have a series now. So call str.split on comma.

_.str.split(',', expand=True)                                       0     1     2 order_date order_id                                20/5/2018  1        package         p1    p2    p3                     package_code  #111  #222  #333 22/5/2018  3        package         p4  None  None                     package_code  #444  None  None 23/5/2018  7        package         p5    p6  None                     package_code  #555  #666  None

We need to get rid of NULL values, so call stack again.

_.stack()  order_date  order_id                  20/5/2018   1         package       0      p1                                     1      p2                                     2      p3                       package_code  0    #111                                     1    #222                                     2    #333 22/5/2018   3         package       0      p4                       package_code  0    #444 23/5/2018   7         package       0      p5                                     1      p6                       package_code  0    #555                                     1    #666 dtype: object

We're almost there. Now we want the second last level of the index to become our columns, so unstack using unstack(-2) (unstack on the second last level)

_.unstack(-2)                        package package_code order_date order_id                        20/5/2018  1        0      p1         #111                     1      p2         #222                     2      p3         #333 22/5/2018  3        0      p4         #444 23/5/2018  7        0      p5         #555                     1      p6         #666

Get rid of the superfluous last level using reset_index:

_.reset_index(-1, drop=True)                      package package_code order_date order_id                      20/5/2018  1             p1         #111            1             p2         #222            1             p3         #333 22/5/2018  3             p4         #444 23/5/2018  7             p5         #555            7             p6         #666

And finally,

_.reset_index()    order_date  order_id package package_code 0  20/5/2018         1      p1         #111 1  20/5/2018         1      p2         #222 2  20/5/2018         1      p3         #333 3  22/5/2018         3      p4         #444 4  23/5/2018         7      p5         #555 5  23/5/2018         7      p6         #666

184

answered Sep 22 '22 14:09

cs95

Here's one way using numpy.repeat and itertools.chain. Conceptually, this is exactly what you want to do: repeat some values, chain others. Recommended for small numbers of columns, otherwise stack based methods may fare better.

import numpy as np from itertools import chain  # return list from series of comma-separated strings def chainer(s):     return list(chain.from_iterable(s.str.split(',')))  # calculate lengths of splits lens = df['package'].str.split(',').map(len)  # create new dataframe, repeating or chaining as appropriate res = pd.DataFrame({'order_id': np.repeat(df['order_id'], lens),                     'order_date': np.repeat(df['order_date'], lens),                     'package': chainer(df['package']),                     'package_code': chainer(df['package_code'])})  print(res)     order_id order_date package package_code 0         1  20/5/2018      p1         #111 0         1  20/5/2018      p2         #222 0         1  20/5/2018      p3         #333 1         3  22/5/2018      p4         #444 2         7  23/5/2018      p5         #555 2         7  23/5/2018      p6         #666

answered Sep 22 '22 14:09

jpp

Related questions
                            
                                Postgresql DROP TABLE doesn't work
                            
                                Equation parsing in Python
                            
                                Lazy module variables--can it be done?
                            
                                Turtle graphics - How do I control when the window closes?
                            
                                Python pandas, Plotting options for multiple lines
                            
                                Does Python do variable interpolation similar to "string #{var}" in Ruby?
                            
                                Why is '' > 0 True in Python 2?
                            
                                Programmatically getting an access token for using the Facebook Graph API
                            
                                python elasticsearch client set mappings during create index
                            
                                How do I plot only a table in Matplotlib?
                            
                                Case insensitive urls for Django?
                            
                                How to show a PDF file in a Django view?
                            
                                Adjusting gridlines and ticks in matplotlib imshow
                            
                                How can I create an empty n*m PNG file in Python?
                            
                                AWS Content Type Settings in S3 Using Boto3
                            
                                RMSE/ RMSLE loss function in Keras
                            
                                Django - TypeError - save() got an unexpected keyword argument 'force_insert'
                            
                                How can I decode a SSL certificate using python?
                            
                                Python Threading inside a class
                            
                                change multiple columns in pandas dataframe to datetime

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Split cell into multiple rows in pandas dataframe

Tags:

python

pandas

dataframe