I'm trying to create a new column in a DataFrame that contains the word count for the respective row. I'm looking for the total number of words, not frequencies of each distinct word. I assumed there would be a simple/quick way to do this common task, but after googling around and reading a handful of SO posts (1, 2, 3, 4) I'm stuck. I've tried the solutions put forward in the linked SO posts, but got lots of attribute errors back.
words = df['col'].split() df['totalwords'] = len(words)
results in
AttributeError: 'Series' object has no attribute 'split'
and
f = lambda x: len(x["col"].split()) -1 df['totalwords'] = df.apply(f, axis=1)
results in
AttributeError: ("'list' object has no attribute 'split'", 'occurred at index 0')
Using the count() Function The "standard" way (no external libraries) to get the count of word occurrences in a list is by using the list object's count() function. The count() method is a built-in function that takes an element as its only argument and returns the number of times that element appears in the list.
R which counts the number of words per sentence in a given text string. For a long text containing several sentences it will count words in all of them and output the mean number of words per sentence and total number of words. str_count(temp$question1," ")+1 would be easy if you know each words are separated by space.
The strsplit() method in R is used to return a vector of words contained in the specified string based on matching with regex defined. Each element of this vector is a substring of the original string. The length of the returned vector is therefore equivalent to the number of words.
str.split
+ str.len
str.len
works nicely for any non-numeric column.
df['totalwords'] = df['col'].str.split().str.len()
str.count
If your words are single-space separated, you may simply count the spaces plus 1.
df['totalwords'] = df['col'].str.count(' ') + 1
This is faster than you think!
df['totalwords'] = [len(x.split()) for x in df['col'].tolist()]
Here is a way using .apply()
:
df['number_of_words'] = df.col.apply(lambda x: len(x.split()))
example
Given this df
:
>>> df col 0 This is one sentence 1 and another
After applying the .apply()
df['number_of_words'] = df.col.apply(lambda x: len(x.split())) >>> df col number_of_words 0 This is one sentence 4 1 and another 2
Note: As pointed out by in comments, and in this answer, .apply
is not necessarily the fastest method. If speed is important, better go with one of @cᴏʟᴅsᴘᴇᴇᴅ's methods.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With