Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching for Number of Term Appearances in Mathematica

I'm trying to search across a large array of textual files in Mathematica 8 (12k+). So far, I've been able to plot the sheer numbers of times that a word appears (i.e. the word "love" appears 5,000 times across those 12k files). However, I'm running into difficulty determining the number of files in which "love" appears once - which might only be in 1,000 files, with it repeating several times in others.

I'm finding the documentation WRT FindList, streams, RecordSeparators, etc. a bit murky. Is there a way to set it up so it finds an incidence of a term once in a file and then moves onto the next?

Example of filelist:

{"89001.txt", "89002.txt", "89003.txt", "89004.txt", "89005.txt", "89006.txt", "89007.txt", "89008.txt", "89009.txt", "89010.txt", "89011.txt", "89012.txt", "89013.txt", "89014.txt", "89015.txt", "89016.txt", "89017.txt", "89018.txt", "89019.txt", "89020.txt", "89021.txt", "89022.txt", "89023.txt", "89024.txt"}

The following returns all of the lines with love across every file. Is there a way to return only the first incidence of love in each file before moving onto the next one?

FindList[filelist, "love"]

Thanks so much. This is my first post and I'm largely learning Mathematica through peer/supervisory help, online tutorials, and the documentation.

like image 775
canadian_scholar Avatar asked Sep 22 '11 14:09

canadian_scholar


2 Answers

In addition to Daniel's answer, you also seem to be asking for a list of files where the word only occurs once. To do that, I'd continue to run FindList across all the files

res =FindList[filelist, "love"]

Then, reduce the results to single lines only, via

lines = Select[ res, Length[#]==1& ]

But, this doesn't eliminate the cases where there is more than one occurrence in a single line. To do that, you could use StringCount and only accept instances where it is 1, as follows

Select[ lines, StringCount[ #, RegularExpression[ "\\blove\\b" ] ] == 1& ]

The RegularExpression specifies that "love" must be a distinct word using the word boundary marker (\\b), so that words like "lovely" won't be included.

Edit: It appears that FindList when passed a list of files returns a flattened list, so you can't determine which item goes with which file. For instance, if you have 3 files, and they contain the word "love", 0, 1, and 2 times, respectively, you'd get a list that looked like

{, love, love, love }

which is clearly not useful. To overcome this, you'll have to process each file individually, and that is best done via Map (/@), as follows

res = FindList[#, "love"]& /@ filelist

and the rest of the above code works as expected.

But, if you want to associate the results with a file name, you have to change it a little.

res = {#, FindList[#, "love"]}& /@ filelist
lines = Select[res, 
         Length[ #[[2]] ] ==1 &&  (* <-- Note the use of [[2]] *)
         StringCount[ #[[2]], RegularExpression[ "\\blove\\b" ] ] == 1&
        ]

which returns a list of the form

{ {filename, { "string with love in it" }, 
  {filename, { "string with love in it" }, ...}

To extract the file names, you simply type lines[[All, 1]].

Note, in order to Select on the properties you wanted, I used Part ([[ ]]) to specify the second element in each datum, and the same goes for extracting the file names.

like image 87
rcollyer Avatar answered Oct 20 '22 18:10

rcollyer


Help > Documentation Center > FindList item 4:

"FindList[files,text,n] includes only the first n lines found."

So you could set n to 1.

Daniel Lichtblau

like image 38
Daniel Lichtblau Avatar answered Oct 20 '22 19:10

Daniel Lichtblau