Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting a number of same words between two columns in python pandas

Tags:

python

pandas

Let's assume that I have the following table in python pandas

friend_description  friend_definition
    James is dumb      dumb dude
    Jacob is smart     smart guy
    Jane is pretty     she looks pretty
    Susan is rich      she is rich

here, in the first row, the word 'dumb' is contained in both columns. In the second row, 'smart' is contained in both columns. In the third row, 'pretty' is contained in both columns, and in the final row, 'is' and 'rich' are contained in both columns. I want to create the following columns:

friend_description  friend_definition      word_overlap    overlap_count
    James is dumb      dumb dude              dumb             1
    Jacob is smart     smart guy              smart            1
    Jane is pretty     she looks pretty       pretty           1
    Susan is rich      she is rich            is rich          2

I could use a for loop to manually define a new column with such things, but I was wondering if there's a function in pandas that makes this type of operations much smoother.

like image 311
user98235 Avatar asked Dec 10 '17 22:12

user98235


1 Answers

Simple list comprehension appears to be the fastest method when working with such strings:

In [112]: df['word_overlap'] = [set(x[0].split()) & set(x[1].split()) for x in df.values]

In [113]: df['overlap_count'] = df['word_overlap'].str.len()

In [114]: df
Out[114]:
  friend_description friend_definition word_overlap  overlap_count
0      James is dumb         dumb dude       {dumb}              1
1     Jacob is smart         smart guy      {smart}              1
2     Jane is pretty  she looks pretty     {pretty}              1
3      Susan is rich       she is rich   {rich, is}              2

single apply(..., axis=1):

In [85]: df['word_overlap'] = df.apply(lambda r: set(r['friend_description'].split()) &
    ...:                                          set(r['friend_definition'].split()),
    ...:                                axis=1)
    ...:

In [86]: df['overlap_count'] = df['word_overlap'].str.len()

In [87]: df
Out[87]:
  friend_description friend_definition word_overlap  overlap_count
0      James is dumb         dumb dude       {dumb}              1
1     Jacob is smart         smart guy      {smart}              1
2     Jane is pretty  she looks pretty     {pretty}              1
3      Susan is rich       she is rich   {rich, is}              2

apply().apply(..., axis=1) method:

In [23]: df['word_overlap'] = (df.apply(lambda x: x.str.split(expand=False))
    ...:                         .apply(lambda r: set(r['friend_description']) & set(r['friend_definition']),
    ...:                                axis=1))
    ...:

In [24]: df['overlap_count'] = df['word_overlap'].str.len()

In [25]: df
Out[25]:
  friend_description friend_definition word_overlap  overlap_count
0      James is dumb         dumb dude       {dumb}              1
1     Jacob is smart         smart guy      {smart}              1
2     Jane is pretty  she looks pretty     {pretty}              1
3      Susan is rich       she is rich   {is, rich}              2

Timing against 40.000 rows DF:

In [104]: df = pd.concat([df] * 10**4, ignore_index=True)

In [105]: df.shape
Out[105]: (40000, 2)

In [106]: %timeit [set(x[0].split()) & set(x[1].split()) for x in df.values]
223 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [107]: %timeit df.apply(lambda r: set(r['friend_description'].split()) & set(r['friend_definition'].split()), axis=1)
3.65 s ± 46.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [108]: %timeit df.apply(lambda x: x.str.split(expand=False)).apply(lambda r: set(r['friend_description']) & set(r['friend_definition']),
     ...: axis=1)
4.63 s ± 84.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 138
MaxU - stop WAR against UA Avatar answered Oct 28 '22 01:10

MaxU - stop WAR against UA