Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding number of occurrences of a word in a file using R functions

Tags:

file

r

I am using the following code for finding number of occurrences of a word memory in a file and I am getting the wrong result. Can you please help me to know what I am missing?

NOTE1: The question is looking for exact occurrence of word "memory"! NOTE2: What I have realized they are exactly looking for "memory" and even something like "memory," is not accepted! That was the part which has brought up the confusion I guess. I tried it for word "action" and the correct answer is 7! You can try as well.

#names=scan("hamlet.txt", what=character())
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character())
Read 28230 items
> length(grep("memory",names))
[1] 9

Here's the file

like image 923
Mona Jalal Avatar asked Feb 05 '14 02:02

Mona Jalal


People also ask

How do I count certain words in R?

You can use the str_count function from the stringr package to get the number of keywords that match a given character vector. The pattern argument of the str_count function accepts a regular expression that can be used to specify the keyword.

How do I count the number of occurrences of a string in R?

The stringr package provides a str_count() method which is used to count the number of occurrences of a certain pattern specified as an argument to the function. The pattern may be a single character or a group of characters. Any instances matching to the expression result in the increment of the count.

How do you count the number of times a word appears in a text file?

To count the number of occurrences of a specific word in a text file, read the content of text file to a string and use String. count() function with the word passed as argument to the count() function.


2 Answers

The problem is really Shakespeare's use of punctuation. There are a lot of apostrophes (') in the text. When the R function scan encounters an apostrophe it assumes it is the start of a quoted string and reads all characters up until the next apostrophe into a single entry of your names array. One of these long entries happens to include two instances of the word "memory" and so reduces the total number of matches by one.

You can fix the problem by telling scan to regard all quotation marks as normal characters and not treat them specially:

names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )

Be careful when using the R implementation of grep. It does not behave in exactly the same way as the usual GNU/Linux program. In particular, the way you have used it here WILL find the number of matching words and not just the total number of matching lines as some people have suggested.

like image 114
andypea Avatar answered Sep 20 '22 17:09

andypea


As pointed by @andrew, my previous answer would give wrong results if a word repeats on the same line. Based on other answers/comments, this one seems ok:

names = scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
idxs = grep("memory", names, ignore.case = TRUE)

length(idxs)
# [1] 10
like image 21
Fernando Avatar answered Sep 18 '22 17:09

Fernando