Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SOLR: Create term vector (like data returned from TermVectorComponent) from raw text

Tags:

solr

Using http://wiki.apache.org/solr/TermVectorComponent I can get indexed terms and their frequencies for any document stored in my index. How can I get the same information for a text, without storing the text in my index? I just want SOLR to process the text and return the information, but without having to store the document in my index.

like image 624
Achim Avatar asked Oct 04 '22 05:10

Achim


2 Answers

AFAIK this isn't possible without storing data in SOLR.

If you are looking to do text analysis (I understand this is broader than what you ask for), I would recommend the below alternatives:

  1. MAUI - does keyphrase and terminology extraction.
  2. Gensim - does topic modelling
  3. Kea - keyword extraction

I've also come across some python scripts that do term frequency analysis. Have a look at Mincemeat, particulary the example, which does term frequency calculation.

like image 82
Srikanth Venugopalan Avatar answered Oct 05 '22 20:10

Srikanth Venugopalan


From what you ask for I conclude that you actually need a search library, not a full search engine (service). That library is Lucene. Perhaps, this will help for starters: How to extract Document Term Vector in Lucene 3.5.0. You could store the index in RAM for the sake of computing necessary bits and then get rid of the index.

like image 28
D_K Avatar answered Oct 05 '22 19:10

D_K