Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse a dataframe column by comma and pivot - python

I am trying to parse out a column by the comma (also stripping the white space) and then pivoting all of the origin/destination combinations into new rows. Here is a sample of the data:

Origin     Destination     Weight
PVG        AMS, FRA        10,000
CAN, XMN   LAX, ORD        25,000

I am having trouble reproducing the dataframe above using pd.read_clipboard, so here is the dataframe code:

df = pd.DataFrame({'Origin': ['PVG', 'CAN, XMN'], 
                   'Destination': ['AMS, FRA', 'LAX, ORD'],
                   'Weight': [10000, 25000]})

The desired output would be:

Origin     Destination     Weight
PVG        AMS             10,000
PVG        FRA             10,000
CAN        LAX             25,000   
CAN        ORD             25,000
XMN        LAX             25,000
XMN        ORD             25,000   

I have been trying to use:

df['Origin'].str.split(',', expand = True)

I had tried doing this for both the origin and destination, which works for parsing the strings into separate columns. I am struggling to then create all of the possible combinations into separate rows (I have tried using pivot_table and melt with no luck).

like image 662
Brian Avatar asked Jun 11 '18 02:06

Brian


2 Answers

Use itertools.product with list comprehenion, only first create lists with applymap:

from  itertools import product

df1 = df.applymap(lambda x: x.split(', ') if isinstance (x, str) else [x])
df2 = pd.DataFrame([j for i in df1.values for j in product(*i)], columns=df.columns)
print (df2)
  Origin Destination  Weight
0    PVG         AMS   10000
1    PVG         FRA   10000
2    CAN         LAX   25000
3    CAN         ORD   25000
4    XMN         LAX   25000
5    XMN         ORD   25000
like image 156
jezrael Avatar answered Sep 18 '22 21:09

jezrael


You should be using itertools.product here.

from itertools import product
df_dict = {
    x[-1]: pd.DataFrame(list(
        product(x[0].split(', '), x[1].split(', ')))
    ) for x in df.values.tolist()
}

df = pd.concat(df_dict).reset_index(level=-1, drop=True).reset_index() 
df.columns = ['Weight', 'Destination', 'Origin']

df 
   Weight Destination Origin
0   10000         AMS    PVG
1   10000         FRA    PVG
2   25000         LAX    CAN
3   25000         LAX    XMN
4   25000         ORD    CAN
5   25000         ORD    XMN
like image 28
cs95 Avatar answered Sep 22 '22 21:09

cs95