Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Join if value of df1 column is in list of df2 column

Suppose we have two Pandas DataFrames as follows:

df1 = pd.DataFrame({'id': ['a', 'b', 'c']})
df1
    id
0   a
1   b
2   c

df2 = pd.DataFrame({'ids': [['b','c'], ['a', 'b'], ['a', 'z']], 
                    'info': ['asdf', 'zxcv', 'sdfg']})
df2
    ids     info
0   [b, c]  asdf
1   [a, b]  zxcv
2   [a, z]  sdfg

How do I join/merge the rows of df1 with df2 where df1.id is in df2.ids?

In other words, how do I achieve the following:

df3
   id   ids     info
0  a    [a, b]  asdf
1  a    [a, z]  sdfg
2  b    [b, c]  asdf
3  b    [a, b]  zxcv
4  c    [b, c]  asdf

And also a version of the above aggregated on id, like so:

df3
   id   ids               info
0  a    [[a, b], [a, z]]  [asdf, sdfg]
2  b    [[a, b], [b, c]]  [asdf, zxcv]
3  c    [[b, c]]          [asdf]

I tried the following:

df1.merge(df2, how = 'left', left_on = 'id', right_on = 'ids')
TypeError: unhashable type: 'list'

df1.id.isin(df2.ids)
TypeError: unhashable type: 'list'
like image 948
user2205916 Avatar asked Dec 27 '18 09:12

user2205916


People also ask

How do I join two Dataframes in pandas based on column?

We can join columns from two Dataframes using the merge() function. This is similar to the SQL 'join' functionality. A detailed discussion of different join types is given in the SQL lesson. You specify the type of join you want using the how parameter.

How do I match column values in pandas?

Method. To find the positions of two matching columns, we first initialize a pandas dataframe with two columns of city names. Then we use where() of numpy to compare the values of two columns. This returns an array that represents the indices where the two columns have the same value.

Is join or merge faster pandas?

The Fastest Ways As it turns out, join always tends to perform well, and merge will perform almost exactly the same given the syntax is optimal.


1 Answers

Using stack, merge and groupby.agg:

df = df2.set_index('info').ids.apply(pd.Series)\
        .stack().reset_index(0, name='id').merge(df2)\
        .merge(df1, how='right').sort_values('id')\
        .reset_index(drop=True)

print(df)
   info id     ids
0  zxcv  a  [a, b]
1  sdfg  a  [a, z]
2  asdf  b  [b, c]
3  zxcv  b  [a, b]
4  asdf  c  [b, c]

For aggregation use:

df = df.groupby('id', as_index=False).agg(list)

print(df)
  id          info               ids
0  a  [zxcv, sdfg]  [[a, b], [a, z]]
1  b  [asdf, zxcv]  [[b, c], [a, b]]
2  c        [asdf]          [[b, c]]
like image 185
Space Impact Avatar answered Sep 29 '22 03:09

Space Impact