Dear power Pandas experts:
I'm trying to implement a function to flatten a column of a dataframe which has element of type list, I want for each row of the dataframe where the column has element of type list, all columns but the designated column to be flattened will be duplicated, while the designated column will have one of the value in the list.
The following illustrate my requirements:
input = DataFrame({'A': [1, 2], 'B': [['a', 'b'], 'c']})
A B
0 1 [a, b]
1 2 c
expected = DataFrame({'A': [1, 1, 2], 'B': ['a', 'b', 'c']}, index=[0, 0, 1])
A B
0 1 a
0 1 b
1 2 c
I feel that there might be an elegant solution/concept for it, but I'm struggling.
Here is my attempt, which does not work yet.
def flattenColumn(df, column):
'''column is a string of the column's name.
for each value of the column's element (which might be a list), duplicate the rest of columns at the correspdonding row with the (each) value.
'''
def duplicate_if_needed(row):
return concat([concat([row.drop(column, axis = 1), DataFrame({column: each})], axis = 1) for each in row[column][0]])
return df.groupby(df.index).transform(duplicate_if_needed)
In recognition of alko's help, here is my trivial generalization of the solution to deal with more than 2 columns in a dataframe:
def flattenColumn(input, column):
'''
column is a string of the column's name.
for each value of the column's element (which might be a list),
duplicate the rest of columns at the corresponding row with the (each) value.
'''
column_flat = pandas.DataFrame(
[
[i, c_flattened]
for i, y in input[column].apply(list).iteritems()
for c_flattened in y
],
columns=['I', column]
)
column_flat = column_flat.set_index('I')
return (
input.drop(column, 1)
.merge(column_flat, left_index=True, right_index=True)
)
The only limitation at the moment is that the order of columns changed, the column flatten would be at the right most, not in its original position. It should be feasible to fix.
Flatten columns: use get_level_values() Flatten columns: use to_flat_index()
The first method to flatten the pandas dataframe is through NumPy python package. There is a function in NumPy that is numpy. flatten() that perform this task. First, you have to convert the dataframe to numpy using the to_numpy() method and then apply the flatten() method.
Return a copy of the array collapsed into one dimension. Whether to flatten in C (row-major), Fortran (column-major) order, or preserve the C/Fortran ordering from a . The default is 'C'.
You can use df.explode(). Check out this method here
I guess easies way to flatten list of lists would be a pure python code, as this object type is not well suited for pandas or numpy. So you can do it with for example
>>> b_flat = pd.DataFrame([[i, x]
... for i, y in input['B'].apply(list).iteritems()
... for x in y], columns=list('IB'))
>>> b_flat = b_flat.set_index('I')
Having B column flattened, you can merge it back:
>>> input[['A']].merge(b_flat, left_index=True, right_index=True)
A B
0 1 a
0 1 b
1 2 c
[3 rows x 2 columns]
If you want the index to be recreated, as in your expected result, you can add .reset_index(drop=True)
to last command.
It is surprising that there isn't a more "native" solution. Putting the answer from @alko into a function is easy enough:
def unnest(df, col, reset_index=False):
import pandas as pd
col_flat = pd.DataFrame([[i, x]
for i, y in df[col].apply(list).iteritems()
for x in y], columns=['I', col])
col_flat = col_flat.set_index('I')
df = df.drop(col, 1)
df = df.merge(col_flat, left_index=True, right_index=True)
if reset_index:
df = df.reset_index(drop=True)
return df
Then simply
input = pd.DataFrame({'A': [1, 2], 'B': [['a', 'b'], 'c']})
expected = unnest(input, 'B')
I guess it would be nice to allow unnesting of multiple columns at once and to handle the possibility of a nested column named I
, which would break this code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With