Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting Frequency Data from Sorted List of Phrases

After plumbing the documentation/past questions on list operations, I've come up blank - many of the cases involve numbers, whereas I'm working with large quantities of text.

I have a sorted list of common three-word phrases (trigrams) that appear in a large body of textual information, generated through Mathematica's Partition[], Tally[], and Sort[] commands. An example of the sort of data that I'm operating on (I have hundreds of these files):

{{{wa, wa, wa}, 66}, {{i, love, you}, 62}, {{la, la, la}, 50}, {{meaning, of, life}, 42}, {on, come, on}, 40}, {{come, on, come}, 40}, {{yeah, yeah, yeah}, 38}, {{no, no, no}, 36}, {{we, re, gonna}, 36}, {{you, love, me}, 35}, {{in, love, with}, 32}, {{the, way, you}, 30}, {{i, want, to}, 30}, {{back, to, me}, 29}, <<38211>>, {{of, an, xke}, 1}}

I'm hoping to search this file so that if the input is "meaning, of, life" it will return "42." I feel like I must be overlooking something obvious but after tinkering around I've hit a brick wall here. Mathematica is number heavy in its documentation, which is.. well, unsurprising.

like image 219
canadian_scholar Avatar asked Oct 02 '11 21:10

canadian_scholar


1 Answers

Assuming that you can load your data into Mathematica in the form you outlined, one very simple thing to do is to create a hash-table, where your trigrams will be the (compound) keys. Here is your sample (the part of it that you gave):

trigrams = {{{"wa", "wa", "wa"}, 66}, {{"i", "love", "you"}, 62}, 
 {{"la", "la", "la"}, 50}, {{"meaning", "of", "life"}, 42}, 
 {{"on", "come", "on"}, 40}, {{"come", "on", "come"}, 40}, 
 {{"yeah", "yeah", "yeah"}, 38}, {{"no", "no", "no"}, 36}, 
 {{"we", "re", "gonna"}, 36}, {{"you", "love", "me"}, 35}, 
 {{"in", "love", "with"}, 32}, {{"the", "way", "you"}, 30}, 
 {{"i", "want", "to"}, 30}, {{"back", "to", "me"}, 29}, 
 {{"of", "an", "xke"}, 1}};

Here is one possible way to create a hash-table:

Clear[trigramHash];
(trigramHash[Sequence @@ #1] = #2) & @@@ trigrams;

Now, we use it like

In[16]:= trigramHash["meaning","of","life"]
Out[16]= 42

This approach will be beneficial if you perform many searches, of course.

EDIT

If you have many files and want to search them efficiently in Mathematica, one thing you could do is to use the above hashing mechanism to convert all your files to .mx binary Mathematica files. These files are optimized for fast loading, and serve as a persistence mechanism for definitions you want to store. Here is how it may work:

In[20]:= DumpSave["C:\\Temp\\trigrams.mx",trigramHash]
Out[20]= {trigramHash}

In[21]:= Quit[]

In[1]:= Get["C:\\Temp\\trigrams.mx"]
In[2]:= trigramHash["meaning","of","life"]
Out[2]= 42

You use DumpSave to create an .mx file. So, the suggested procedure is to load your data into Mathematica, file by file, create hashes (you could use SubValues to index a particular hash-table with an index of your file), and then save those definitions into .mx files. In this way, you get fast load and fast search, and you have a freedom to decide which part of your data to keep loaded into Mathematica at any given time (pretty much without a performance hit, normally associated with file loading).

like image 176
Leonid Shifrin Avatar answered Oct 17 '22 01:10

Leonid Shifrin