After plumbing the documentation/past questions on list operations, I've come up blank - many of the cases involve numbers, whereas I'm working with large quantities of text.
I have a sorted list of common three-word phrases (trigrams) that appear in a large body of textual information, generated through Mathematica's Partition[], Tally[], and Sort[] commands. An example of the sort of data that I'm operating on (I have hundreds of these files):
{{{wa, wa, wa}, 66}, {{i, love, you}, 62}, {{la, la, la}, 50}, {{meaning, of, life}, 42}, {on, come, on}, 40}, {{come, on, come}, 40}, {{yeah, yeah, yeah}, 38}, {{no, no, no}, 36}, {{we, re, gonna}, 36}, {{you, love, me}, 35}, {{in, love, with}, 32}, {{the, way, you}, 30}, {{i, want, to}, 30}, {{back, to, me}, 29}, <<38211>>, {{of, an, xke}, 1}}
I'm hoping to search this file so that if the input is "meaning, of, life" it will return "42." I feel like I must be overlooking something obvious but after tinkering around I've hit a brick wall here. Mathematica is number heavy in its documentation, which is.. well, unsurprising.
Assuming that you can load your data into Mathematica in the form you outlined, one very simple thing to do is to create a hash-table, where your trigrams will be the (compound) keys. Here is your sample (the part of it that you gave):
trigrams = {{{"wa", "wa", "wa"}, 66}, {{"i", "love", "you"}, 62},
{{"la", "la", "la"}, 50}, {{"meaning", "of", "life"}, 42},
{{"on", "come", "on"}, 40}, {{"come", "on", "come"}, 40},
{{"yeah", "yeah", "yeah"}, 38}, {{"no", "no", "no"}, 36},
{{"we", "re", "gonna"}, 36}, {{"you", "love", "me"}, 35},
{{"in", "love", "with"}, 32}, {{"the", "way", "you"}, 30},
{{"i", "want", "to"}, 30}, {{"back", "to", "me"}, 29},
{{"of", "an", "xke"}, 1}};
Here is one possible way to create a hash-table:
Clear[trigramHash];
(trigramHash[Sequence @@ #1] = #2) & @@@ trigrams;
Now, we use it like
In[16]:= trigramHash["meaning","of","life"]
Out[16]= 42
This approach will be beneficial if you perform many searches, of course.
EDIT
If you have many files and want to search them efficiently in Mathematica, one thing you could do is to use the above hashing mechanism to convert all your files to .mx
binary Mathematica files. These files are optimized for fast loading, and serve as a persistence mechanism for definitions you want to store. Here is how it may work:
In[20]:= DumpSave["C:\\Temp\\trigrams.mx",trigramHash]
Out[20]= {trigramHash}
In[21]:= Quit[]
In[1]:= Get["C:\\Temp\\trigrams.mx"]
In[2]:= trigramHash["meaning","of","life"]
Out[2]= 42
You use DumpSave
to create an .mx
file. So, the suggested procedure is to load your data into Mathematica, file by file, create hashes (you could use SubValues
to index a particular hash-table with an index of your file), and then save those definitions into .mx
files. In this way, you get fast load and fast search, and you have a freedom to decide which part of your data to keep loaded into Mathematica at any given time (pretty much without a performance hit, normally associated with file loading).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With