Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Output all rows with word count in a column greater than 3

I have this dummy df:

columns = ['answer', 'some_number']
data = [['hello how are you doing','1.0'],
       ['hello', '1.0'],
       ['bye bye bye bye', '0.0'],
        ['no', '0.0'],
        ['yes', '1.0'],
        ['Who let the dogs out', '0.0'],
        ['1 + 1 + 1 + 2', '1.0']]
df = pd.DataFrame(columns=columns, data=data)

I want to output the rows with a word count greater than 3. Here that would the rows 'hello how are you doing', 'bye bye bye bye', 'Who let the dogs out', '1 + 1 + 1 + 2'

My approach doesn't work: df[len(df.answer) > 3]

Output: KeyError: True

like image 395
Exa Avatar asked Mar 10 '21 18:03

Exa


4 Answers

A couple more options using str.split():

  • Combine with str.len():

    df[df.answer.str.split().str.len().gt(n)]
    
  • Or combine with apply(len):

    df[df.answer.str.split().apply(len).gt(n)]
    

What's fastest?

  • Fastest overall (BENY's list comprehension):

    df[[x.count(' ') >= n for x in df.answer]]
    
  • Fastest pandas-based (anky's first answer):

    df[df.answer.str.count(' ').ge(n)]
    

Timed with ~20 words per sentence:

timing plot


Why doesn't df[len(df.answer) > 3] work?

len(df.answer) returns the length of the answer column itself (7), not the number of words per answer (5, 1, 4, 1, 1, 5, 7).

That means the final expression evaluates to df[7 > 3] or df[True], which breaks because there is no column True:

>>> len(df.answer)
7

>>> len(df.answer) > 3     # 7 > 3
True

>>> df[len(df.answer) > 3] # df[True] doesn't exist
KeyError: True
like image 67
tdy Avatar answered Oct 22 '22 23:10

tdy


If the seperator is ' ' ,you can try series.str.count , else you can replace the sep

n=3
df[df['answer'].str.count(' ').gt(n-1)]

To include Multiple spaces #credits @piRSquared

df['answer'].str.count('\s+').gt(2)

Or using list comprehension:

n= 3
df[[len(i.split())>n for i in df['answer']]] #should be faster than above

                    answer some_number
0  hello how are you doing         1.0
2          bye bye bye bye         0.0
5     Who let the dogs out         0.0
6            1 + 1 + 1 + 2         1.0
like image 39
anky Avatar answered Oct 23 '22 00:10

anky


If I understand this correctly, here's one way:

>>> df.loc[df['answer'].str.split().apply(len) > 3, 'answer']
0    hello how are you doing
2            bye bye bye bye
5       Who let the dogs out
6              1 + 1 + 1 + 2
like image 39
timgeb Avatar answered Oct 22 '22 23:10

timgeb


Try with count for string operation

n = 3
df[[x.count(' ') > n-1 for x in df.answer]]
Out[31]: 
                    answer some_number
0  hello how are you doing         1.0
2          bye bye bye bye         0.0
5     Who let the dogs out         0.0
6            1 + 1 + 1 + 2         1.0
like image 31
BENY Avatar answered Oct 22 '22 23:10

BENY