Finding number of occurrences of a word in a file using R functions

Tags:

r

I am using the following code for finding number of occurrences of a word memory in a file and I am getting the wrong result. Can you please help me to know what I am missing?

NOTE1: The question is looking for exact occurrence of word "memory"! NOTE2: What I have realized they are exactly looking for "memory" and even something like "memory," is not accepted! That was the part which has brought up the confusion I guess. I tried it for word "action" and the correct answer is 7! You can try as well.

#names=scan("hamlet.txt", what=character())
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character())
Read 28230 items
> length(grep("memory",names))
[1] 9

Here's the file

923

asked Feb 05 '14 02:02

Mona Jalal

2 Answers

The problem is really Shakespeare's use of punctuation. There are a lot of apostrophes (') in the text. When the R function scan encounters an apostrophe it assumes it is the start of a quoted string and reads all characters up until the next apostrophe into a single entry of your names array. One of these long entries happens to include two instances of the word "memory" and so reduces the total number of matches by one.

You can fix the problem by telling scan to regard all quotation marks as normal characters and not treat them specially:

names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )

Be careful when using the R implementation of grep. It does not behave in exactly the same way as the usual GNU/Linux program. In particular, the way you have used it here WILL find the number of matching words and not just the total number of matching lines as some people have suggested.

114

answered Sep 20 '22 17:09

andypea

As pointed by @andrew, my previous answer would give wrong results if a word repeats on the same line. Based on other answers/comments, this one seems ok:

names = scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
idxs = grep("memory", names, ignore.case = TRUE)

length(idxs)
# [1] 10

answered Sep 18 '22 17:09

Fernando

Related questions
                            
                                Multiple colour scales in one stacked bar plot using ggplot
                            
                                Split string on first two colons
                            
                                How can I view the source code for a particular `predict` function? [duplicate]
                            
                                How can I plot a function in R with complex numbers?
                            
                                How to fix the geom_text label position so it is always on the middle of the plot?
                            
                                Grouped horizontal boxplot with bwplot
                            
                                R clip raster with multiple bands
                            
                                Legend for Random Forest Plot in R
                            
                                Combining first two columns and turn it into row names in R data.frame
                            
                                How to add clustering rectangle in hierarchical heatmap dendogram
                            
                                How to update existing column values in data.table?
                            
                                Python scipy chisquare returns different values than R chisquare
                            
                                How do I convert a n*1 matrix to a n*n diagonal matrix
                            
                                ggplot Multi line plot from same dataframe
                            
                                Filter data.table by multiple columns, dynamically
                            
                                by() giving error when applying mean function over a data frame. What's happening?
                            
                                selecting rows with specific conditions in R
                            
                                drawing dendrogram from pre calculated distance matrix
                            
                                Data Table - Select Value of Column by Name From Another Column
                            
                                Counting the frequency of an element in a data frame [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With