I am using the following code for finding number of occurrences of a word memory
in a file and I am getting the wrong result. Can you please help me to know what I am missing?
NOTE1: The question is looking for exact occurrence of word "memory"! NOTE2: What I have realized they are exactly looking for "memory" and even something like "memory," is not accepted! That was the part which has brought up the confusion I guess. I tried it for word "action" and the correct answer is 7! You can try as well.
#names=scan("hamlet.txt", what=character())
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character())
Read 28230 items
> length(grep("memory",names))
[1] 9
Here's the file
You can use the str_count function from the stringr package to get the number of keywords that match a given character vector. The pattern argument of the str_count function accepts a regular expression that can be used to specify the keyword.
The stringr package provides a str_count() method which is used to count the number of occurrences of a certain pattern specified as an argument to the function. The pattern may be a single character or a group of characters. Any instances matching to the expression result in the increment of the count.
To count the number of occurrences of a specific word in a text file, read the content of text file to a string and use String. count() function with the word passed as argument to the count() function.
The problem is really Shakespeare's use of punctuation. There are a lot of apostrophes (') in the text. When the R function scan
encounters an apostrophe it assumes it is the start of a quoted string and reads all characters up until the next apostrophe into a single entry of your names
array. One of these long entries happens to include two instances of the word "memory" and so reduces the total number of matches by one.
You can fix the problem by telling scan
to regard all quotation marks as normal characters and not treat them specially:
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
Be careful when using the R implementation of grep
. It does not behave in exactly the same way as the usual GNU/Linux program. In particular, the way you have used it here WILL find the number of matching words and not just the total number of matching lines as some people have suggested.
As pointed by @andrew, my previous answer would give wrong results if a word repeats on the same line. Based on other answers/comments, this one seems ok:
names = scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
idxs = grep("memory", names, ignore.case = TRUE)
length(idxs)
# [1] 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With