Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Index pdf documents in Solr from C# client

Basically I'm trying to index word or pdf documents in Solr and found the ExtractingRequestHandler, but can't figure out how to write code in c# that performs the HTTP POST request like in the Solr wiki: http://wiki.apache.org/solr/ExtractingRequestHandler.

I've installed Solr 3.4 on Tomcat 7 (7.0.22) using the files from the example/solr directory in the Solr zip and I haven't altered anything. The ExtractingRequestHandler should be configured out of the box in the solrconfig.xml and ready to use, right?

Can some of you give an C# (HttpWebRequest) example of how you make the HTTP POST request and upload a PDF file like it is done using curl in the Solr wiki?

I've look all over this site and many others trying to find an example or a tutorial on how this is done, but haven't found anything.

EDIT:

I finally managed to get it to work using SolrNet!

In order for it to work you need to copy this to a lib-folder in your Solr installation directory from the Solr zip:

  • apache-solr-cell-3.4.0.jar file from the dist folder
  • content of contrib\extraction\lib directory

With SolrNet 0.4.0 beta 2, this code does the job:

Startup.Init<IndexDocument>("YOUR-SOLR-SERVICE-PATH");
var solr = ServiceLocator.Current.GetInstance<ISolrOperations<IndexDocument>>();

using (FileStream fileStream = File.OpenRead("FILE-PATH-FOR-THE-FILE-TO-BE-INDEXED"))
{
    var response =
        solr.Extract(
            new ExtractParameters(fileStream, "doc1")
            {
                ExtractFormat = ExtractFormat.Text,
                ExtractOnly = false
            });
}

solr.Commit();

Sorry for the trouble. I hope however that others will find this useful.

like image 809
jonasm Avatar asked Jan 19 '12 23:01

jonasm


People also ask

How do I index a PDF in Solr?

File Type Field Based on the fields you added in your search index, and based on the name of your fields, select your Search API Attachments in the General fields section. Re-index with the new fields. You will then be able to search text inside the PDF attachments fields.

Can you index PDF files?

You can index not only the document text, but also bookmarks, comments, attachments, digital signatures, form fields, metadata, and other custom document properties. You can build an index file from all the PDF files in a set of folders you define.

Can Solr index Word documents?

A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV) files, data extracted from tables in a database, and files in common file formats such as Microsoft Word or PDF.


1 Answers

I would recommend using the SolrNet client. It supports the ExtractingRequestHandler.

Here the Deprecated repo on code.google.com

like image 146
Paige Cook Avatar answered Sep 17 '22 17:09

Paige Cook