Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Flatten a column with value of type list while duplicating the other column's value accordingly in Pandas

Dear power Pandas experts:

I'm trying to implement a function to flatten a column of a dataframe which has element of type list, I want for each row of the dataframe where the column has element of type list, all columns but the designated column to be flattened will be duplicated, while the designated column will have one of the value in the list.

The following illustrate my requirements:

input = DataFrame({'A': [1, 2], 'B': [['a', 'b'], 'c']})
     A   B
0    1   [a, b]
1    2   c

expected = DataFrame({'A': [1, 1, 2], 'B': ['a', 'b', 'c']}, index=[0, 0, 1])

     A   B
0    1   a
0    1   b
1    2   c

I feel that there might be an elegant solution/concept for it, but I'm struggling.

Here is my attempt, which does not work yet.

def flattenColumn(df, column):
    '''column is a string of the column's name.
    for each value of the column's element (which might be a list), duplicate the rest of columns at the correspdonding row with the (each) value.
    '''
    def duplicate_if_needed(row):
        return concat([concat([row.drop(column, axis = 1), DataFrame({column: each})], axis = 1) for each in row[column][0]])
    return df.groupby(df.index).transform(duplicate_if_needed)

In recognition of alko's help, here is my trivial generalization of the solution to deal with more than 2 columns in a dataframe:

def flattenColumn(input, column):
    '''
    column is a string of the column's name.
    for each value of the column's element (which might be a list),
    duplicate the rest of columns at the corresponding row with the (each) value.
    '''
    column_flat = pandas.DataFrame(
        [
            [i, c_flattened]
            for i, y in input[column].apply(list).iteritems()
            for c_flattened in y
        ],
        columns=['I', column]
    )
    column_flat = column_flat.set_index('I')
    return (
        input.drop(column, 1)
             .merge(column_flat, left_index=True, right_index=True)
    )

The only limitation at the moment is that the order of columns changed, the column flatten would be at the right most, not in its original position. It should be feasible to fix.

like image 671
Yu Shen Avatar asked Jan 16 '14 11:01

Yu Shen


People also ask

How do I flatten a column in Pandas?

Flatten columns: use get_level_values() Flatten columns: use to_flat_index()

How do you flatten a list in a DataFrame?

The first method to flatten the pandas dataframe is through NumPy python package. There is a function in NumPy that is numpy. flatten() that perform this task. First, you have to convert the dataframe to numpy using the to_numpy() method and then apply the flatten() method.

What is the flatten method in Pandas?

Return a copy of the array collapsed into one dimension. Whether to flatten in C (row-major), Fortran (column-major) order, or preserve the C/Fortran ordering from a . The default is 'C'.


3 Answers

You can use df.explode(). Check out this method here

like image 81
Babak Badkoubeh Avatar answered Oct 21 '22 13:10

Babak Badkoubeh


I guess easies way to flatten list of lists would be a pure python code, as this object type is not well suited for pandas or numpy. So you can do it with for example

>>> b_flat = pd.DataFrame([[i, x] 
...               for i, y in input['B'].apply(list).iteritems() 
...                    for x in y], columns=list('IB'))
>>> b_flat = b_flat.set_index('I')

Having B column flattened, you can merge it back:

>>> input[['A']].merge(b_flat, left_index=True, right_index=True)
   A  B
0  1  a
0  1  b
1  2  c

[3 rows x 2 columns]

If you want the index to be recreated, as in your expected result, you can add .reset_index(drop=True) to last command.

like image 30
alko Avatar answered Oct 21 '22 14:10

alko


It is surprising that there isn't a more "native" solution. Putting the answer from @alko into a function is easy enough:

def unnest(df, col, reset_index=False):
    import pandas as pd
    col_flat = pd.DataFrame([[i, x] 
                       for i, y in df[col].apply(list).iteritems() 
                           for x in y], columns=['I', col])
    col_flat = col_flat.set_index('I')
    df = df.drop(col, 1)
    df = df.merge(col_flat, left_index=True, right_index=True)
    if reset_index:
        df = df.reset_index(drop=True)
    return df

Then simply

input = pd.DataFrame({'A': [1, 2], 'B': [['a', 'b'], 'c']})
expected = unnest(input, 'B')

I guess it would be nice to allow unnesting of multiple columns at once and to handle the possibility of a nested column named I, which would break this code.

like image 39
Ian Gow Avatar answered Oct 21 '22 13:10

Ian Gow