Converting a list of words into a list of the frequency in which those words appear

Question

I am doing extensive work with a variety of word lists.

Please consider the following question that I have:

docText={"settlement", "new", "beginnings", "wildwood", "settlement", "book",
"excerpt", "agnes", "leffler", "perry", "my", "mother", "junetta", 
"hally", "leffler", "brought", "my", "brother", "frank", "and", "me", 
"to", "edmonton", "from", "monmouth", "illinois", "mrs", "matilda", 
"groff", "accompanied", "us", "her", "husband", "joseph", "groff", 
"my", "father", "george", "leffler", "and", "my", "uncle", "andrew", 
"henderson", "were", "already", "in", "edmonton", "they", "came", 
"in", "1910", "we", "arrived", "july", "1", "1911", "the", "sun", 
"was", "shining", "when", "we", "arrived", "however", "it", "had", 
"been", "raining", "for", "days", "and", "it", "was", "very", 
"muddy", "especially", "around", "the", "cn", "train"}

searchWords={"the","for","my","and","me","and","we"}

Each of these lists are much longer (say 250 words in the searchWords list and docText being about 12,000 words).

Right now, I have the ability to figure out frequency of a given word by doing something like:

docFrequency=Sort[Tally[docText],#1[[2]]>#2[[2]]&];    
Flatten[Cases[docFrequency,{"settlement",_}]][[2]]

But where I am getting hung up is on my quest to generate specific lists. Specifically, the issue of converting a list of words into a list of the frequency in which those words appear. I've tried to do this with Do loops but have hit a wall.

I want to go through docText with searchWords and replace each element of docText with the sheer frequency of its appearance. I.e. since "settlement" appears twice, it would be replaced by 2 in the list, whereas since "my" appears 3 times, it would become 3. The list would then be something like 2,1,1,1,2, and so forth.

I suspect the answer lies somewhere in If[] and Map[]?

This all sounds weird, but I am trying to pre-process a bunch of information for term frequency information…

Addition for Clarity (I hope):

Here is a better example.

searchWords={"0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "a", "A", "about", 
"above", "across", "after", "again", "against", "all", "almost", 
"alone", "along", "already", "also", "although", "always", "among", 
"an", "and", "another", "any", "anyone", "anything", "anywhere", 
"are", "around", "as", "at", "b", "B", "back", "be", "became", 
"because", "become", "becomes", "been", "before", "behind", "being", 
"between", "both", "but", "by", "c", "C", "can", "cannot", "could", 
"d", "D", "do", "done", "down", "during", "e", "E", "each", "either", 
"enough", "even", "ever", "every", "everyone", "everything", 
"everywhere", "f", "F", "few", "find", "first", "for", "four", 
"from", "full", "further", "g", "G", "get", "give", "go", "h", "H", 
"had", "has", "have", "he", "her", "here", "herself", "him", 
"himself", "his", "how", "however", "i", "I", "if", "in", "interest", 
"into", "is", "it", "its", "itself", "j", "J", "k", "K", "keep", "l", 
"L", "last", "least", "less", "m", "M", "made", "many", "may", "me", 
"might", "more", "most", "mostly", "much", "must", "my", "myself", 
"n", "N", "never", "next", "no", "nobody", "noone", "not", "nothing", 
"now", "nowhere", "o", "O", "of", "off", "often", "on", "once", 
"one", "only", "or", "other", "others", "our", "out", "over", "p", 
"P", "part", "per", "perhaps", "put", "q", "Q", "r", "R", "rather", 
"s", "S", "same", "see", "seem", "seemed", "seeming", "seems", 
"several", "she", "should", "show", "side", "since", "so", "some", 
"someone", "something", "somewhere", "still", "such", "t", "T", 
"take", "than", "that", "the", "their", "them", "then", "there", 
"therefore", "these", "they", "this", "those", "though", "three", 
"through", "thus", "to", "together", "too", "toward", "two", "u", 
"U", "under", "until", "up", "upon", "us", "v", "V", "very", "w", 
"W", "was", "we", "well", "were", "what", "when", "where", "whether", 
"which", "while", "who", "whole", "whose", "why", "will", "with", 
"within", "without", "would", "x", "X", "y", "Y", "yet", "you", 
"your", "yours", "z", "Z"}

These are the automatically generated stopwords from WordData[]. So I want to compare these words against docText. Since "settlement" is NOT part of searchWords, then it would appear as 0. But since "my" is part of searchWords, it would pop up as the count (so I could tell how many times the given word appears).

I really do thank you for your help - I'm looking forward to taking some formal courses soon as I'm bumping up against the edge of my ability to really explain what I want to do!

Szabolcs · Accepted Answer

We can replace everything that doesn't appear in searchWords by 0 in docText as follows:

preprocessedDocText = 
   Replace[docText, 
     Dispatch@Append[Thread[searchWords -> searchWords], _ -> 0], {1}]

The we can replace the remaining words by their frequency:

replaceTable = Dispatch[Rule @@@ Tally[docText]];

preprocessedDocText /. replaceTable

Dispatch preprocesses a list of rules (->) and speeds up replacement significantly in subsequent uses.

I have not benchmarked this on large data, but Dispatch should provide a good speedup.

Leonid Shifrin · Answer

@Szabolcs gave a fine solution, and I'd probably go the same route myself. Here is a slightly different solution, just for fun:

ClearAll[getFreqs];
getFreqs[docText_, searchWords_] :=
  Module[{dwords, dfreqs, inSearchWords, lset},
    SetAttributes[{lset, inSearchWords}, Listable];
    lset[args__] := Set[args];
    {dwords, dfreqs} = Transpose@Tally[docText];
    lset[inSearchWords[searchWords], True];
    inSearchWords[_] = False;
    dfreqs*Boole[inSearchWords[dwords]]]

This shows how Listable attribute may be used to replace loops and even Map-ping. We have:

In[120]:= getFreqs[docText,searchWords]
Out[120]= {0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,3,1,1,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,2,
1,0,0,2,0,0,1,0,2,0,2,0,1,1,2,1,1,0,1,0,1,0,0,1,0,0}

Converting a list of words into a list of the frequency in which those words appear

Tags:

wolfram-mathematica

canadian_scholar

2 Answers

Szabolcs

Leonid Shifrin

Recent Activity

Donate For Us

Converting a list of words into a list of the frequency in which those words appear

Tags:

wolfram-mathematica

canadian_scholar

2 Answers

Szabolcs

Leonid Shifrin

Related questions

Recent Activity

Donate For Us