Dear stackoverflow community :
Given some text, I wish to get the TOP 50 most frequent words in the text, and create a tag cloud out of it, and thus show the gist of what the text is about in a graphical way.
The text is actually a set of 100 or so comments PER each ITEM(a picture) there are about 120 items, and I also want to keep the cloud updated - by keeping the comments indexed, and using the cloud generation code to run each time a new web request turns up.
I settled on using Solr to index the text, and now wondering how to get the TOP 50 words, out of Solr TermsVectorComponant. Here is an example of the results returned by the terms vector componant, after you turn on term frequency by saying tv.tf="true"
:
<lst name="doc-5">
<str name="uniqueKey">MA147LL/A</str>
<lst name="includes">
<lst name="cabl"><tf>5</tf></lst>
<lst name="earbud"><tf>3</tf></lst>
<lst name="headphon"><tf>10</tf></lst>
<lst name="usb"><tf>11</tf></lst>
</lst>
</lst>
<lst name="doc-9">
<str name="uniqueKey">3007WFP</str>
<lst name="includes">
<lst name="cabl"><tf>5</tf></lst>
<lst name="usb"><tf>4</tf></lst>
</lst>
</lst>
As you can see I have 2 problems :
Is there a better way? (or) Can I tell solr termvector component to somehow sort it and pick up only 100 for me? (or) Is there some other framework which I can use? I need to keep new comments indexed as they come, so the tag cloud is always uptodate - As to the cloud generator it takes a dictionary of weighted words, and makes it into a nice image.
This answer does not help.
EDIT - trying out jpountz & paige cook's answer
Here is a result which I got for this query :
select?q=Id:d4439543-afd4-42fb-978a-b72eab0c07f9&facet=true
&facet.field=Post_Content&facet.minCount=1&facet.limit=50
<int name="also">1</int>
<int name="ani">1</int>
<int name="anoth">1</int>
<int name="atleast">1</int>
<int name="base">1</int>
<int name="bcd">1</int>
<int name="becaus">1</int>
<int name="better">1</int>
<int name="bigger">1</int>
<int name="bio">1</int>
<int name="boot">1</int>
<int name="bootabl">1</int>
<int name="bootload">1</int>
<int name="bootscreen">1</int>
I got 50 such elements, @jpountz thanks for helping limit the results, BUT why does ALL FIFTY of the individual <int>
elements hold the value 1? My thoughts are : The number 1 represents the count of the documents matching my query (which can only be one since I queried by Id:Guid) and they do not represent the frequency of the words in Post_Content
To prove this I removed the Id:GUID from query and result was:
<int name="content">33</int>
<int name="can">17</int>
<int name="on">16</int>
<int name="so">16</int>
<int name="some">16</int>
<int name="all">15</int>
<int name="i">15</int>
<int name="do">14</int>
<int name="have">14</int>
<int name="my">14</int>
My problem is how to get the term frequency in the document, and not the document frequency of many terms. For example I know for a fact that bootable was a word I used 6 times in Post_content, So i want sorted Pairs like (6,"bootable"), (5, "disc") for a set of documents.
Here is an article that describes setting up a Tag Cloud - Creating a Tag Cloud with Solr and PHP. While the PHP portion may not be applicable to you, the actual generation of the tag cloud I believe is...
This article describes a method of creating a text field with a whitespace tokenizer to return individual words and then performing a facet search against this field. I know that you can set facet limits, so in your case you can only get the top 100 results.
If a Lucene document is a comment, you could use faceting to do so. For example, the following request http://solr:port/solr/select?q={!lucene}uniqueKey:(MA147LL/A OR 3007WFP)&facet=true&facet.field=includes&facet.minCount=1&facet.limit=50
would help you build a tag cloud for comments MA147LL/A
and 3007WFP
.
However, this approach would :
includes
field, which required memory,If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With