I have a dataframe contains orders data, each order has multiple packages stored as comma separated string [package
& package_code
] columns
I want to split the packages data and create a row for each package including its order details
Here is a sample input dataframe:
import pandas as pd df = pd.DataFrame({"order_id":[1,3,7],"order_date":["20/5/2018","22/5/2018","23/5/2018"], "package":["p1,p2,p3","p4","p5,p6"],"package_code":["#111,#222,#333","#444","#555,#666"]})
And this is what I am trying to achieve as output:
How can I do that with pandas?
To split cell into multiple rows in a Python Pandas dataframe, we can use the apply method. to call apply with a lambda function that calls str. split to split the x string value. And then we call explode to fill new rows with the split values.
Split column by delimiter into multiple columnsApply the pandas series str. split() function on the “Address” column and pass the delimiter (comma in this case) on which you want to split the column. Also, make sure to pass True to the expand parameter.
Series and DataFrame methods define a . explode() method that explodes lists into separate rows. See the docs section on Exploding a list-like column. Since you have a list of comma separated strings, split the string on comma to get a list of elements, then call explode on that column.
Assuming all splittable columns have the same number of comma separated items, you can split on comma and then use Series.explode
on each column:
(df.set_index(['order_id', 'order_date']) .apply(lambda x: x.str.split(',').explode()) .reset_index()) order_id order_date package package_code 0 1 20/5/2018 p1 #111 1 1 20/5/2018 p2 #222 2 1 20/5/2018 p3 #333 3 3 22/5/2018 p4 #444 4 7 23/5/2018 p5 #555 5 7 23/5/2018 p6 #666
Details
Set the columns not to be touched as the index,
df.set_index(['order_id', 'order_date']) package package_code order_id order_date 1 20/5/2018 p1,p2,p3 #111,#222,#333 3 22/5/2018 p4 #444 7 23/5/2018 p5,p6 #555,#666
The next step is a 2-step process: Split on comma to get a column of lists, then call explode
to explode the list values into their own rows.
_.apply(lambda x: x.str.split(',').explode()) package package_code order_id order_date 1 20/5/2018 p1 #111 20/5/2018 p2 #222 20/5/2018 p3 #333 3 22/5/2018 p4 #444 7 23/5/2018 p5 #555 23/5/2018 p6 #666
Finally, reset the index.
_.reset_index() order_id order_date package package_code 0 1 20/5/2018 p1 #111 1 1 20/5/2018 p2 #222 2 1 20/5/2018 p3 #333 3 3 22/5/2018 p4 #444 4 7 23/5/2018 p5 #555 5 7 23/5/2018 p6 #666
This should work for any number of columns like this. The essence is a little stack-unstacking magic with str.split
.
(df.set_index(['order_date', 'order_id']) .stack() .str.split(',', expand=True) .stack() .unstack(-2) .reset_index(-1, drop=True) .reset_index() ) order_date order_id package package_code 0 20/5/2018 1 p1 #111 1 20/5/2018 1 p2 #222 2 20/5/2018 1 p3 #333 3 22/5/2018 3 p4 #444 4 23/5/2018 7 p5 #555 5 23/5/2018 7 p6 #666
There is another performant alternative involving chain
, but you'd need to explicitly chain and repeat every column (a bit of a problem with a lot of columns). Choose whatever fits the description of your problem best, as there's no single answer.
Details
First, set the columns that are not to be touched as the index.
df.set_index(['order_date', 'order_id']) package package_code order_date order_id 20/5/2018 1 p1,p2,p3 #111,#222,#333 22/5/2018 3 p4 #444 23/5/2018 7 p5,p6 #555,#666
Next, stack
the rows.
_.stack() order_date order_id 20/5/2018 1 package p1,p2,p3 package_code #111,#222,#333 22/5/2018 3 package p4 package_code #444 23/5/2018 7 package p5,p6 package_code #555,#666 dtype: object
We have a series now. So call str.split
on comma.
_.str.split(',', expand=True) 0 1 2 order_date order_id 20/5/2018 1 package p1 p2 p3 package_code #111 #222 #333 22/5/2018 3 package p4 None None package_code #444 None None 23/5/2018 7 package p5 p6 None package_code #555 #666 None
We need to get rid of NULL values, so call stack
again.
_.stack() order_date order_id 20/5/2018 1 package 0 p1 1 p2 2 p3 package_code 0 #111 1 #222 2 #333 22/5/2018 3 package 0 p4 package_code 0 #444 23/5/2018 7 package 0 p5 1 p6 package_code 0 #555 1 #666 dtype: object
We're almost there. Now we want the second last level of the index to become our columns, so unstack using unstack(-2)
(unstack
on the second last level)
_.unstack(-2) package package_code order_date order_id 20/5/2018 1 0 p1 #111 1 p2 #222 2 p3 #333 22/5/2018 3 0 p4 #444 23/5/2018 7 0 p5 #555 1 p6 #666
Get rid of the superfluous last level using reset_index
:
_.reset_index(-1, drop=True) package package_code order_date order_id 20/5/2018 1 p1 #111 1 p2 #222 1 p3 #333 22/5/2018 3 p4 #444 23/5/2018 7 p5 #555 7 p6 #666
And finally,
_.reset_index() order_date order_id package package_code 0 20/5/2018 1 p1 #111 1 20/5/2018 1 p2 #222 2 20/5/2018 1 p3 #333 3 22/5/2018 3 p4 #444 4 23/5/2018 7 p5 #555 5 23/5/2018 7 p6 #666
Here's one way using numpy.repeat
and itertools.chain
. Conceptually, this is exactly what you want to do: repeat some values, chain others. Recommended for small numbers of columns, otherwise stack
based methods may fare better.
import numpy as np from itertools import chain # return list from series of comma-separated strings def chainer(s): return list(chain.from_iterable(s.str.split(','))) # calculate lengths of splits lens = df['package'].str.split(',').map(len) # create new dataframe, repeating or chaining as appropriate res = pd.DataFrame({'order_id': np.repeat(df['order_id'], lens), 'order_date': np.repeat(df['order_date'], lens), 'package': chainer(df['package']), 'package_code': chainer(df['package_code'])}) print(res) order_id order_date package package_code 0 1 20/5/2018 p1 #111 0 1 20/5/2018 p2 #222 0 1 20/5/2018 p3 #333 1 3 22/5/2018 p4 #444 2 7 23/5/2018 p5 #555 2 7 23/5/2018 p6 #666
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With