I have a pandas dataframe df
:
ID words
1 word1
1 word2
1 word3
2 word4
2 word5
3 word6
3 word7
3 word8
3 word9
I want to produce another dataframe that would generate all pairs of words in each group. So the result for the above would be:
ID wordA wordB
1 word1 word2
1 word1 word3
1 word2 word3
2 word4 word5
3 word6 word7
3 word6 word8
3 word6 word9
3 word7 word8
3 word7 word9
3 word8 word9
I know that I can used df.groupby['words']
to get the words within each ID
.
I also know that I can use
iterable = ['word1','word2','word3']
list(itertools.combinations(iterable, 2))
to get all possible pairwise combinations. However, I'm a little lost as to the best way to generate a resulting dataframe as shown above.
You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.
Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.
Returns a groupby object that contains information about the groups. Convenience method for frequency conversion and resampling of time series. See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.
You can use pandas DataFrame. groupby(). count() to group columns and compute the count or size aggregate, this calculates a rows count for each group combination.
Its simple use itertools combinations inside apply and stack i.e
from itertools import combinations
ndf = df.groupby('ID')['words'].apply(lambda x : list(combinations(x.values,2)))
.apply(pd.Series).stack().reset_index(level=0,name='words')
ID words
0 1 (word1, word2)
1 1 (word1, word3)
2 1 (word2, word3)
0 2 (word4, word5)
0 3 (word6, word7)
1 3 (word6, word8)
2 3 (word6, word9)
3 3 (word7, word8)
4 3 (word7, word9)
5 3 (word8, word9)
To match you exact output further we have to do
sdf = pd.concat([ndf['ID'],ndf['words'].apply(pd.Series)],1).set_axis(['ID','WordsA','WordsB'],1,inplace=False)
ID WordsA WordsB
0 1 word1 word2
1 1 word1 word3
2 1 word2 word3
0 2 word4 word5
0 3 word6 word7
1 3 word6 word8
2 3 word6 word9
3 3 word7 word8
4 3 word7 word9
5 3 word8 word9
To convert it to a one line we can do :
combo = df.groupby('ID')['words'].apply(combinations,2)\
.apply(list).apply(pd.Series)\
.stack().apply(pd.Series)\
.set_axis(['WordsA','WordsB'],1,inplace=False)\
.reset_index(level=0)
You can use groupby
with apply
and return DataFrame
, last add reset_index
for remove second level and then for create column from index:
from itertools import combinations
f = lambda x : pd.DataFrame(list(combinations(x.values,2)),
columns=['wordA','wordB'])
df = (df.groupby('ID')['words'].apply(f)
.reset_index(level=1, drop=True)
.reset_index())
print (df)
ID wordA wordB
0 1 word1 word2
1 1 word1 word3
2 1 word2 word3
3 2 word4 word5
4 3 word6 word7
5 3 word6 word8
6 3 word6 word9
7 3 word7 word8
8 3 word7 word9
9 3 word8 word9
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With