Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting started with Solr

I'm trying to get started with Apache Solr, but some things are not clear to me. Reading through the tutorial, I've set up a running Solr instance. What I find confusing is that all the configuration of Solr (schemas and so on) are in XML format. When they add sample data, it's shows how to add xml documents (java -jar post.jar solr.xml monitor.xml). Is it just a bad choice of sample format? I mean, are they uploading data describing documents, or the actual documents they're adding are .xml files?

I'm trying to add some books in .txt format, so if I use java -jar post.jar mydoc.txt, am I adding it? How could I add this document and metadata (author, title) about it?

That said, I tried to set up a simple Html page to post documents to Solr:

<html>
  <head></head>
<body>
  <form action="http://localhost:8983/solr/update?commit=true" enctype="multipart/form-data" method="post">
    <input type="file">
    <input type="submit" value="Send">
  </form>
</body>
</html>

When I try to post a file, I get this response:

<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">26</int>
  </lst>
</response>

Is this correct? Does it mean that I've successfully added my file? If so, one of the words in the file, for example is "montagna" (this is an italian book, montagna means mountain...). If i visit the url

http://localhost:8983/solr/select/?q=montagna&start=0&rows=10&indent=on

I expect something to be returned (the whole text maybe, or some info about the file), but this is what I get:

<response>
  <lst name="responseHeader">
    <int name="status">0</int>
      <int name="QTime">1</int>
      <lst name="params">
        <str name="indent">on</str>
        <str name="start">0</str>
        <str name="q">montagna</str>
        <str name="rows">10</str>
    </lst>
  </lst>
  <result name="response" numFound="0" start="0"/>
</response>

Doesn't seem like a match to me. Also, according to this answer, I should be able to get back the text surrounding the matches with hl.fragsize. How do I integrate this in the search string? Thank you

like image 571
pistacchio Avatar asked Feb 06 '12 08:02

pistacchio


2 Answers

The solr example adds documents to the index through xml messages. Have a look here. The *.xml you mentioned is because there are some xml messages stored on file systems. Those xml messages are like this:

<add>
  <doc>
    <field name="id">UTF8TEST</field>
    <field name="name">Test with some UTF-8 encoded characters</field>
    <field name="manu">Apache Software Foundation</field>
    <field name="cat">software</field>
    <field name="cat">search</field>
    <field name="features">No accents here</field>
    <field name="price">0</field>
    <!-- no popularity, get the default from schema.xml -->
    <field name="inStock">true</field>
  </doc>
</add>

It's just a way to represent any kind of document to index. Every document contains one or more fields, and so on. There are different ways to add documents to Solr, for example it accepts also CSV format, but the most common is nowadays the xml format.

I think you aren't actually indexing anything. You can check the output of this query: http://localhost:8983/solr/select/?q=*:* which retrieves all the documents you have in your index. A common error is also forgetting to commit, but I saw you added the commit=true parameter to your url, so that's not your case.

If you want to index just the content of a text file, you could for example define your schema with two fields:

  • filename
  • content

and use this message to index your document:

<add>
  <doc>
    <field name="filename">test.txt</field>
    <field name="content">Test with some UTF-8 encoded characters</field>
  </doc>
</add>
like image 72
javanna Avatar answered Oct 20 '22 00:10

javanna


Do understand the terminology:

Document in solr -> Row in RDBMS
Field of document -> Column of a cell

And a Solr core is of course, both database and gigantic table, occupied in a (potentially) sparse manner.

For your (particular) use, you would create a document for each file; composed of an ID, file content etc.


XML is one way of composing solr operations. http://wiki.apache.org/solr/UpdateXmlMessages

It has the add, delete, commit and optimize operations. The add operation includes one or more documents.

<add>
  <doc>
    <field name="employeeId">05991</field>
    <field name="office">Bridgewater</field>
    <field name="skills">Perl</field>
    <field name="skills">Java</field>
  </doc>
  [<doc> ... </doc>[<doc> ... </doc>]]
</add>

There are also CSV (add functionality only), JSON (full functionality), DIH (scheduled database imports).

There is also extracting request handler, which can extract content (and metadata) from all kinds of rich documents (DOC, DOCX, PDF). Additional: there is literal to set your own fields.


The extracting request handler stores its output into the field text. The query parser q= and the highlighter assume a default field (yes, it's pertinent to what you did) of text. You can specify the fields for them; also the fields solr returns to you in results.

like image 29
Jesvin Jose Avatar answered Oct 19 '22 23:10

Jesvin Jose