I have the following data:
study_id list_value
1 ['aaa', 'bbb']
1 ['aaa']
1 ['ccc']
2 ['ddd', 'eee', 'aaa']
2 np.NaN
2 ['zzz', 'aaa', 'bbb']
How can I convert it into something like this?
study_id list_value
1 ['aaa', 'bbb', 'ccc']
1 ['aaa', 'bbb', 'ccc']
1 ['aaa', 'bbb', 'ccc']
2 ['aaa', 'bbb', 'ddd', 'eee', 'zzz']
2 ['aaa', 'bbb', 'ddd', 'eee', 'zzz']
2 ['aaa', 'bbb', 'ddd', 'eee', 'zzz'] # order of list item doesn't matter
itertools.chain
with GroupBy.transform
First, get rid of NaNs inside your column using a list comprehension (messy, I know, but this is the fastest way to do it).
df['list_value'] = [
[] if not isinstance(x, list) else x for x in df.list_value
]
Next, group on study_id
and flatten your lists inside GroupBy.transform
and extract unique values using a set
.
from itertools import chain
df['list_value'] = df.groupby('study_id').list_value.transform(
lambda x: [list(set(chain.from_iterable(x)))]
)
As a last step, if you plan to mutate individual list items, you may want to do
df['list_value'] = [x[:] for x in df['list_value']]
If not, changes in one list will be reflected across all sublists in that group.
df
study_id list_value
0 1 [aaa, ccc, bbb]
1 1 [aaa, ccc, bbb]
2 1 [aaa, ccc, bbb]
3 2 [bbb, ddd, eee, aaa, zzz]
4 2 [bbb, ddd, eee, aaa, zzz]
5 2 [bbb, ddd, eee, aaa, zzz]
defaultdict
from collections import defaultdict
d = defaultdict(set)
for t in df.dropna(subset=['list_value']).itertuples():
d[t.study_id] |= set(t.list_value)
df.assign(list_value=df.study_id.map(pd.Series(d).apply(sorted)))
study_id list_value
0 1 [a, b, c]
1 1 [a, b, c]
2 1 [a, b, c]
3 2 [a, b, d, e, z]
4 2 [a, b, d, e, z]
5 2 [a, b, d, e, z]
np.unique
and other other trickinessMind you the results are ndarray
df.assign(
list_value=df.study_id.map(
df.set_index('study_id').list_value.dropna().sum(level=0).apply(np.unique)
)
)
study_id list_value
0 1 [a, b, c]
1 1 [a, b, c]
2 1 [a, b, c]
3 2 [a, b, d, e, z]
4 2 [a, b, d, e, z]
5 2 [a, b, d, e, z]
We need to use sorted
to get all the way there
df.assign(
list_value=df.study_id.map(
df.set_index('study_id').list_value.dropna()
.sum(level=0).apply(np.unique).apply(sorted)
)
)
df.assign(
list_value=df.study_id.map(
df.list_value.str.join('|').groupby(df.study_id).apply(
lambda x: sorted(set('|'.join(x.dropna()).split('|')))
)
)
)
study_id list_value
0 1 [a, b, c]
1 1 [a, b, c]
2 1 [a, b, c]
3 2 [a, b, d, e, z]
4 2 [a, b, d, e, z]
5 2 [a, b, d, e, z]
df = pd.DataFrame(dict(
study_id=[1, 1, 1, 2, 2, 2],
list_value=[['a', 'b'], ['a'], ['c'], ['d', 'e', 'a'], np.nan, ['z', 'a', 'b']]
), columns=['study_id', 'list_value'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With