Pandas column reformatting

Tags:

python

pandas

Any quick way to achieve the below output pls?

Input:

Code Items
123 eq-hk
456 ca-eu; tp-lbe
789 ca-us
321 go-ch
654 ca-au; go-au
987 go-jp
147 co-ml; go-ml
258 ca-us
369 ca-us; ca-my
741 ca-us
852 ca-eu
963 ca-ml; co-ml; go-ml

Output:

Code eq   ca    go    co    tp
123  hk             
456       eu          lbe
789       us            
321             ch      
654       au    au      
987             jp      
147             ml     ml   
258       us            
369       us,my         
741       us            
852       eu            
963       ml     ml    ml

Am again running into loops and a very ugly code to make it work. If there is an elegant way to achieve this pls?

Thank you!

858

asked Nov 05 '18 03:11

spiff

2 Answers

This is a little bit complicate

(df.set_index('Code')
   .Items.str.split(';',expand=True)
   .stack()
   .str.split('-',expand=True)
   .set_index(0,append=True)[1]
   .unstack()
   .fillna('')
   .sum(level=0))

0       ca  co  eq  go   tp
Code                       
123             hk         
147         ml      ml     
258     us                 
321                 ch     
369   usmy                 
456     eu              lbe
654     au          au     
741     us                 
789     us                 
852     eu                 
963     ml  ml      ml     
987                 jp     


# using str split to get unnest the column, 
#then we do stack, and str split again , then set the first column to index 
# after unstack we yield the result

119

answered Oct 21 '22 20:10

BENY

List comprehensions work better (read: much faster) for string problems like this which require multiple levels of splitting.

df2 = pd.DataFrame([
         dict(y.split('-') for y in x.split('; ')) 
           for x in df.Items]).fillna('')
df2.insert(0, 'Code', df.Code)

print(df2)
    Code  ca  co  eq  go   tp
0    123          hk         
1    456  eu              lbe
2    789  us                 
3    321              ch     
4    654  au          au     
5    987              jp     
6    147      ml      ml     
7    258  us                    # Should be "us,my"... see below.
8    369  my                 
9    741  us                 
10   852  eu                 
11   963  ml  ml      ml

This does not handle the situation where multiple items with the same key can be present in a row. For that, a slightly more involved solution is needed.

from itertools import chain

v = [x.split('; ') for x in df.Items] 
X = pd.Series(df.Code.values.repeat([len(x) for x in v]))
Y = pd.DataFrame([x.split('-') for x in chain.from_iterable(v)])

df2 = pd.concat([X, Y], axis=1, ignore_index=True)

(df2.set_index([0, 1, 3])[2]
    .unstack(1)
    .fillna('')
    .groupby(level=0)
    .agg(lambda x: ','.join(x).strip(','))

1       ca  co  eq  go   tp
0                          
123             hk         
147         ml      ml     
258     us                 
321                 ch     
369  us,my                 
456     eu              lbe
654     au          au     
741     us                 
789     us                 
852     eu                 
963     ml  ml      ml     
987                 jp

answered Oct 21 '22 22:10

cs95

Related questions
                            
                                What are all the fields in a Python ntplib response, and how are they used?
                            
                                SQLAlchemy - How to access column names from ResultProxy and write to CSV headers
                            
                                Numpy Standard Deviation AttributeError: 'Float' object has no attribute 'sqrt'
                            
                                Why is there a semicolon ; after matplotlibs plot() function?
                            
                                Debugging not running on PyCharm for my Django project
                            
                                How to load R's .rdata files into Python?
                            
                                Filter dataframe rows by index name
                            
                                Pandas groupby and value_counts
                            
                                Find and replace substrings in a Pandas dataframe ignore case
                            
                                How to efficiently iterate a pandas DataFrame and increment a NumPy array on these values?
                            
                                Consider duplicate index in drop_duplicates method of a pandas DataFrame
                            
                                Python operator precedence with augmented assignment
                            
                                Is replace row-wise and will overwrite the value within the dict twice?
                            
                                Wildcard in dictionary key
                            
                                Check if a value exists using multiple conditions within group in pandas
                            
                                Scaling of time to broadcast an operation on 3D arrays in numpy
                            
                                Create dataframe from dictionary of list with variable length
                            
                                Find the second element from DOM using selenium and python [duplicate]
                            
                                Many-to-Many with three tables relating with each other (SqlAlchemy)
                            
                                operator precedence of floor division and division

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With