Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split cell into multiple rows in pandas dataframe

I have a dataframe contains orders data, each order has multiple packages stored as comma separated string [package & package_code] columns

I want to split the packages data and create a row for each package including its order details

Here is a sample input dataframe:

import pandas as pd df = pd.DataFrame({"order_id":[1,3,7],"order_date":["20/5/2018","22/5/2018","23/5/2018"], "package":["p1,p2,p3","p4","p5,p6"],"package_code":["#111,#222,#333","#444","#555,#666"]}) 

Input Dataframe

And this is what I am trying to achieve as output: Output

How can I do that with pandas?

like image 236
Nobel Avatar asked Jun 06 '18 23:06

Nobel


People also ask

How do I split a cell into multiple rows in pandas?

To split cell into multiple rows in a Python Pandas dataframe, we can use the apply method. to call apply with a lambda function that calls str. split to split the x string value. And then we call explode to fill new rows with the split values.

How do I split a cell in a Dataframe pandas?

Split column by delimiter into multiple columnsApply the pandas series str. split() function on the “Address” column and pass the delimiter (comma in this case) on which you want to split the column. Also, make sure to pass True to the expand parameter.

How do you split a row in Python?

Series and DataFrame methods define a . explode() method that explodes lists into separate rows. See the docs section on Exploding a list-like column. Since you have a list of comma separated strings, split the string on comma to get a list of elements, then call explode on that column.


2 Answers

pandas >= 0.25

Assuming all splittable columns have the same number of comma separated items, you can split on comma and then use Series.explode on each column:

(df.set_index(['order_id', 'order_date'])    .apply(lambda x: x.str.split(',').explode())    .reset_index())                                                        order_id order_date package package_code 0         1  20/5/2018      p1         #111 1         1  20/5/2018      p2         #222 2         1  20/5/2018      p3         #333 3         3  22/5/2018      p4         #444 4         7  23/5/2018      p5         #555 5         7  23/5/2018      p6         #666 

Details

Set the columns not to be touched as the index,

df.set_index(['order_id', 'order_date'])                        package    package_code order_id order_date                           1        20/5/2018   p1,p2,p3  #111,#222,#333 3        22/5/2018         p4            #444 7        23/5/2018      p5,p6       #555,#666 

The next step is a 2-step process: Split on comma to get a column of lists, then call explode to explode the list values into their own rows.

_.apply(lambda x: x.str.split(',').explode())                      package package_code order_id order_date                      1        20/5/2018       p1         #111          20/5/2018       p2         #222          20/5/2018       p3         #333 3        22/5/2018       p4         #444 7        23/5/2018       p5         #555          23/5/2018       p6         #666 

Finally, reset the index.

_.reset_index()     order_id order_date package package_code 0         1  20/5/2018      p1         #111 1         1  20/5/2018      p2         #222 2         1  20/5/2018      p3         #333 3         3  22/5/2018      p4         #444 4         7  23/5/2018      p5         #555 5         7  23/5/2018      p6         #666 

pandas <= 0.24

This should work for any number of columns like this. The essence is a little stack-unstacking magic with str.split.

(df.set_index(['order_date', 'order_id'])    .stack()    .str.split(',', expand=True)    .stack()    .unstack(-2)    .reset_index(-1, drop=True)    .reset_index() )    order_date  order_id package package_code 0  20/5/2018         1      p1         #111 1  20/5/2018         1      p2         #222 2  20/5/2018         1      p3         #333 3  22/5/2018         3      p4         #444 4  23/5/2018         7      p5         #555 5  23/5/2018         7      p6         #666 

There is another performant alternative involving chain, but you'd need to explicitly chain and repeat every column (a bit of a problem with a lot of columns). Choose whatever fits the description of your problem best, as there's no single answer.

Details

First, set the columns that are not to be touched as the index.

df.set_index(['order_date', 'order_id'])                         package    package_code order_date order_id                           20/5/2018  1         p1,p2,p3  #111,#222,#333 22/5/2018  3               p4            #444 23/5/2018  7            p5,p6       #555,#666 

Next, stack the rows.

_.stack()  order_date  order_id               20/5/2018   1         package               p1,p2,p3                       package_code    #111,#222,#333 22/5/2018   3         package                     p4                       package_code              #444 23/5/2018   7         package                  p5,p6                       package_code         #555,#666 dtype: object 

We have a series now. So call str.split on comma.

_.str.split(',', expand=True)                                       0     1     2 order_date order_id                                20/5/2018  1        package         p1    p2    p3                     package_code  #111  #222  #333 22/5/2018  3        package         p4  None  None                     package_code  #444  None  None 23/5/2018  7        package         p5    p6  None                     package_code  #555  #666  None 

We need to get rid of NULL values, so call stack again.

_.stack()  order_date  order_id                  20/5/2018   1         package       0      p1                                     1      p2                                     2      p3                       package_code  0    #111                                     1    #222                                     2    #333 22/5/2018   3         package       0      p4                       package_code  0    #444 23/5/2018   7         package       0      p5                                     1      p6                       package_code  0    #555                                     1    #666 dtype: object 

We're almost there. Now we want the second last level of the index to become our columns, so unstack using unstack(-2) (unstack on the second last level)

_.unstack(-2)                        package package_code order_date order_id                        20/5/2018  1        0      p1         #111                     1      p2         #222                     2      p3         #333 22/5/2018  3        0      p4         #444 23/5/2018  7        0      p5         #555                     1      p6         #666 

Get rid of the superfluous last level using reset_index:

_.reset_index(-1, drop=True)                      package package_code order_date order_id                      20/5/2018  1             p1         #111            1             p2         #222            1             p3         #333 22/5/2018  3             p4         #444 23/5/2018  7             p5         #555            7             p6         #666 

And finally,

_.reset_index()    order_date  order_id package package_code 0  20/5/2018         1      p1         #111 1  20/5/2018         1      p2         #222 2  20/5/2018         1      p3         #333 3  22/5/2018         3      p4         #444 4  23/5/2018         7      p5         #555 5  23/5/2018         7      p6         #666 
like image 184
cs95 Avatar answered Sep 22 '22 14:09

cs95


Here's one way using numpy.repeat and itertools.chain. Conceptually, this is exactly what you want to do: repeat some values, chain others. Recommended for small numbers of columns, otherwise stack based methods may fare better.

import numpy as np from itertools import chain  # return list from series of comma-separated strings def chainer(s):     return list(chain.from_iterable(s.str.split(',')))  # calculate lengths of splits lens = df['package'].str.split(',').map(len)  # create new dataframe, repeating or chaining as appropriate res = pd.DataFrame({'order_id': np.repeat(df['order_id'], lens),                     'order_date': np.repeat(df['order_date'], lens),                     'package': chainer(df['package']),                     'package_code': chainer(df['package_code'])})  print(res)     order_id order_date package package_code 0         1  20/5/2018      p1         #111 0         1  20/5/2018      p2         #222 0         1  20/5/2018      p3         #333 1         3  22/5/2018      p4         #444 2         7  23/5/2018      p5         #555 2         7  23/5/2018      p6         #666 
like image 28
jpp Avatar answered Sep 22 '22 14:09

jpp