Association mining seems to give good results for retrieving related terms in text corpora. There are several works on this topic including well-known LSA method. The most straightforward way to mine associations is to build co-occurrence matrix of docs X terms
and find terms that occur in the same documents most often. In my previous projects I implemented it directly in Lucene by iteration over TermDocs (I got it by calling IndexReader.termDocs(Term)). But I can't see anything similar in Solr.
So, my needs are:
I will rate answers in the following way:
The Terms Component provides access to the indexed terms in a field and the number of documents that match each term. This can be useful for building an auto-suggest feature or any other feature that operates at the term level instead of the search or document level.
Standard solr queries use the "q" parameter in a request. Filter queries use the "fq" parameter. The primary difference is that filtered queries do not affect relevance scores; the query functions purely as a filter (docset intersection, essentially).
Unlike Lucene, Solr is a web application (WAR) which can be deployed in any servlet container, e.g. Jetty, Tomcat, Resin, etc. Solr can be installed and used by non-programmers. Lucene cannot.
The df stands for default field , while the qf stands for query fields . The field defined by the df parameter is used when no fields are mentioned in the query. For example if you are running a query like q=solr and you have df=title the query itself will actually be title:solr .
You can export a Lucene (or Solr) index to Mahout, and then use Latent Dirichlet Allocation. If LDA is not close enough to LSA for your needs, you can just take the correlation matrix from Mahout, and then use Mahout to take the singular value decomposition.
I don't know of any LSA components for Solr.
Since there are still no answers to my questions, I have to write my own thoughts and accept it. Nevertheless, if someone propose better solution, I'll happily accept it instead of mine.
I'll go with co-occurrence matrix, since it is the most principal part of association mining. In general, Solr provides all needed functions for building this matrix in some way, though they are not as efficient as direct access with Lucene. To construct matrix we need:
Both these tasks may be easily done with standard Solr components.
To retrieve terms TermsComponent or faceted search may be used. We can get only top terms (by default) or all terms (by setting max number of terms to take, see documentation of particular feature for details).
Getting documents with the term in question is simply search for this term. The weak point here is that we need 1 request per term, and there may be thousands of terms. Another weak point is that neither simple, nor faceted search do not provide information about the count of occurrences of the current term in found document.
Having this, it is easy to build co-occurrence matrix. To mine association it is possible to use other software like Weka or write own implementation of, say, Apriori algorithm.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With