Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count number of words per row

I'm trying to create a new column in a DataFrame that contains the word count for the respective row. I'm looking for the total number of words, not frequencies of each distinct word. I assumed there would be a simple/quick way to do this common task, but after googling around and reading a handful of SO posts (1, 2, 3, 4) I'm stuck. I've tried the solutions put forward in the linked SO posts, but got lots of attribute errors back.

words = df['col'].split() df['totalwords'] = len(words) 

results in

AttributeError: 'Series' object has no attribute 'split' 

and

f = lambda x: len(x["col"].split()) -1 df['totalwords'] = df.apply(f, axis=1) 

results in

AttributeError: ("'list' object has no attribute 'split'", 'occurred at index 0') 
like image 898
LMGagne Avatar asked Apr 23 '18 15:04

LMGagne


People also ask

How do you count words in a column in Python?

Using the count() Function The "standard" way (no external libraries) to get the count of word occurrences in a list is by using the list object's count() function. The count() method is a built-in function that takes an element as its only argument and returns the number of times that element appears in the list.

How do you find the number of words in R?

R which counts the number of words per sentence in a given text string. For a long text containing several sentences it will count words in all of them and output the mean number of words per sentence and total number of words. str_count(temp$question1," ")+1 would be easy if you know each words are separated by space.

How do I count words in a vector in R?

The strsplit() method in R is used to return a vector of words contained in the specified string based on matching with regex defined. Each element of this vector is a substring of the original string. The length of the returned vector is therefore equivalent to the number of words.


2 Answers

str.split + str.len

str.len works nicely for any non-numeric column.

df['totalwords'] = df['col'].str.split().str.len() 

str.count

If your words are single-space separated, you may simply count the spaces plus 1.

df['totalwords'] = df['col'].str.count(' ') + 1 

List Comprehension

This is faster than you think!

df['totalwords'] = [len(x.split()) for x in df['col'].tolist()] 
like image 58
cs95 Avatar answered Sep 18 '22 19:09

cs95


Here is a way using .apply():

df['number_of_words'] = df.col.apply(lambda x: len(x.split())) 

example

Given this df:

>>> df                     col 0  This is one sentence 1           and another 

After applying the .apply()

df['number_of_words'] = df.col.apply(lambda x: len(x.split()))  >>> df                     col  number_of_words 0  This is one sentence                4 1           and another                2 

Note: As pointed out by in comments, and in this answer, .apply is not necessarily the fastest method. If speed is important, better go with one of @cᴏʟᴅsᴘᴇᴇᴅ's methods.

like image 44
sacuL Avatar answered Sep 19 '22 19:09

sacuL