Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Building a tag cloud with solr

Dear stackoverflow community :

Given some text, I wish to get the TOP 50 most frequent words in the text, and create a tag cloud out of it, and thus show the gist of what the text is about in a graphical way.

The text is actually a set of 100 or so comments PER each ITEM(a picture) there are about 120 items, and I also want to keep the cloud updated - by keeping the comments indexed, and using the cloud generation code to run each time a new web request turns up.

I settled on using Solr to index the text, and now wondering how to get the TOP 50 words, out of Solr TermsVectorComponant. Here is an example of the results returned by the terms vector componant, after you turn on term frequency by saying tv.tf="true" :

  <lst name="doc-5">
    <str name="uniqueKey">MA147LL/A</str>    
    <lst name="includes">
      <lst name="cabl"><tf>5</tf></lst>
      <lst name="earbud"><tf>3</tf></lst>
      <lst name="headphon"><tf>10</tf></lst>
      <lst name="usb"><tf>11</tf></lst>
    </lst>
  </lst>

  <lst name="doc-9">
    <str name="uniqueKey">3007WFP</str>
    <lst name="includes">
      <lst name="cabl"><tf>5</tf></lst>
      <lst name="usb"><tf>4</tf></lst>
    </lst>
  </lst>

As you can see I have 2 problems :

  1. I get all the terms within the document, for that field, not just top 100
  2. And They are not sorted by frequency, so I have to get terms and sort it in-memory to do what im trying.

Is there a better way? (or) Can I tell solr termvector component to somehow sort it and pick up only 100 for me? (or) Is there some other framework which I can use? I need to keep new comments indexed as they come, so the tag cloud is always uptodate - As to the cloud generator it takes a dictionary of weighted words, and makes it into a nice image.

This answer does not help.

EDIT - trying out jpountz & paige cook's answer

Here is a result which I got for this query :

    select?q=Id:d4439543-afd4-42fb-978a-b72eab0c07f9&facet=true
&facet.field=Post_Content&facet.minCount=1&facet.limit=50

<int name="also">1</int>
<int name="ani">1</int>
<int name="anoth">1</int>
<int name="atleast">1</int>
<int name="base">1</int>
<int name="bcd">1</int>
<int name="becaus">1</int>
<int name="better">1</int>
<int name="bigger">1</int>
<int name="bio">1</int>
<int name="boot">1</int>
<int name="bootabl">1</int>
<int name="bootload">1</int>
<int name="bootscreen">1</int>

I got 50 such elements, @jpountz thanks for helping limit the results, BUT why does ALL FIFTY of the individual <int> elements hold the value 1? My thoughts are : The number 1 represents the count of the documents matching my query (which can only be one since I queried by Id:Guid) and they do not represent the frequency of the words in Post_Content

To prove this I removed the Id:GUID from query and result was:

<int name="content">33</int>
<int name="can">17</int>
<int name="on">16</int>
<int name="so">16</int>
<int name="some">16</int>
<int name="all">15</int>
<int name="i">15</int>
<int name="do">14</int>
<int name="have">14</int>
<int name="my">14</int>

My problem is how to get the term frequency in the document, and not the document frequency of many terms. For example I know for a fact that bootable was a word I used 6 times in Post_content, So i want sorted Pairs like (6,"bootable"), (5, "disc") for a set of documents.

like image 868
Zasz Avatar asked Sep 06 '11 10:09

Zasz


2 Answers

Here is an article that describes setting up a Tag Cloud - Creating a Tag Cloud with Solr and PHP. While the PHP portion may not be applicable to you, the actual generation of the tag cloud I believe is...

This article describes a method of creating a text field with a whitespace tokenizer to return individual words and then performing a facet search against this field. I know that you can set facet limits, so in your case you can only get the top 100 results.

like image 127
Paige Cook Avatar answered Oct 16 '22 12:10

Paige Cook


If a Lucene document is a comment, you could use faceting to do so. For example, the following request http://solr:port/solr/select?q={!lucene}uniqueKey:(MA147LL/A OR 3007WFP)&facet=true&facet.field=includes&facet.minCount=1&facet.limit=50 would help you build a tag cloud for comments MA147LL/A and 3007WFP.

However, this approach would :

  • make Solr instantiate an UnInvertedField instance for the includes field, which required memory,
  • count the number of documents which match a term instead of the total number of occurrences of this term.
like image 1
jpountz Avatar answered Oct 16 '22 13:10

jpountz