Flatten a column with value of type list while duplicating the other column's value accordingly in Pandas

Tags:

Dear power Pandas experts:

I'm trying to implement a function to flatten a column of a dataframe which has element of type list, I want for each row of the dataframe where the column has element of type list, all columns but the designated column to be flattened will be duplicated, while the designated column will have one of the value in the list.

The following illustrate my requirements:

input = DataFrame({'A': [1, 2], 'B': [['a', 'b'], 'c']})
     A   B
0    1   [a, b]
1    2   c

expected = DataFrame({'A': [1, 1, 2], 'B': ['a', 'b', 'c']}, index=[0, 0, 1])

     A   B
0    1   a
0    1   b
1    2   c

I feel that there might be an elegant solution/concept for it, but I'm struggling.

Here is my attempt, which does not work yet.

def flattenColumn(df, column):
    '''column is a string of the column's name.
    for each value of the column's element (which might be a list), duplicate the rest of columns at the correspdonding row with the (each) value.
    '''
    def duplicate_if_needed(row):
        return concat([concat([row.drop(column, axis = 1), DataFrame({column: each})], axis = 1) for each in row[column][0]])
    return df.groupby(df.index).transform(duplicate_if_needed)

In recognition of alko's help, here is my trivial generalization of the solution to deal with more than 2 columns in a dataframe:

def flattenColumn(input, column):
    '''
    column is a string of the column's name.
    for each value of the column's element (which might be a list),
    duplicate the rest of columns at the corresponding row with the (each) value.
    '''
    column_flat = pandas.DataFrame(
        [
            [i, c_flattened]
            for i, y in input[column].apply(list).iteritems()
            for c_flattened in y
        ],
        columns=['I', column]
    )
    column_flat = column_flat.set_index('I')
    return (
        input.drop(column, 1)
             .merge(column_flat, left_index=True, right_index=True)
    )

The only limitation at the moment is that the order of columns changed, the column flatten would be at the right most, not in its original position. It should be feasible to fix.

671

asked Jan 16 '14 11:01

Yu Shen

3 Answers

You can use df.explode(). Check out this method here

answered Oct 21 '22 13:10

Babak Badkoubeh

I guess easies way to flatten list of lists would be a pure python code, as this object type is not well suited for pandas or numpy. So you can do it with for example

>>> b_flat = pd.DataFrame([[i, x] 
...               for i, y in input['B'].apply(list).iteritems() 
...                    for x in y], columns=list('IB'))
>>> b_flat = b_flat.set_index('I')

Having B column flattened, you can merge it back:

>>> input[['A']].merge(b_flat, left_index=True, right_index=True)
   A  B
0  1  a
0  1  b
1  2  c

[3 rows x 2 columns]

If you want the index to be recreated, as in your expected result, you can add .reset_index(drop=True) to last command.

answered Oct 21 '22 14:10

alko

It is surprising that there isn't a more "native" solution. Putting the answer from @alko into a function is easy enough:

def unnest(df, col, reset_index=False):
    import pandas as pd
    col_flat = pd.DataFrame([[i, x] 
                       for i, y in df[col].apply(list).iteritems() 
                           for x in y], columns=['I', col])
    col_flat = col_flat.set_index('I')
    df = df.drop(col, 1)
    df = df.merge(col_flat, left_index=True, right_index=True)
    if reset_index:
        df = df.reset_index(drop=True)
    return df

Then simply

input = pd.DataFrame({'A': [1, 2], 'B': [['a', 'b'], 'c']})
expected = unnest(input, 'B')

I guess it would be nice to allow unnesting of multiple columns at once and to handle the possibility of a nested column named I, which would break this code.

answered Oct 21 '22 13:10

Ian Gow

Related questions
                            
                                Is Python's bool sorting defined?
                            
                                create new list without changing the original list
                            
                                How to set default value for FloatField in django model
                            
                                Python equivalent of sum() using xor()
                            
                                Autoincrementing option for Pandas DataFrame index
                            
                                Generating postgresql user password
                            
                                Simple example of using wx.TextCtrl and display data after button click in wxpython - new to wx
                            
                                How can I serve files with UTF-8 encoding using Python SimpleHTTPServer?
                            
                                Using cumsum in pandas on group()
                            
                                How to get Python division by -0.0 and 0.0 to result in -Inf and Inf, respectively?
                            
                                Python regular expression (regex) match comma separated number - why does this not work?
                            
                                Multiply all columns in a Pandas dataframe together
                            
                                input(): "NameError: name 'n' is not defined" [duplicate]
                            
                                local variable referenced before assignment with try and except statement [duplicate]
                            
                                Django Celery Task Logging
                            
                                How to add custom parameter into Python logging formatter?
                            
                                Reading particular cell value from excelsheet in python
                            
                                How to use leastsq function from scipy.optimize in python to fit both a straight line and a quadratic line to data sets x and y
                            
                                Why is Python class not recognizing static variable
                            
                                Return a subplot from a function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Flatten a column with value of type list while duplicating the other column's value accordingly in Pandas

Tags:

python

pandas

dataframe

data-manipulation