Basically I'm trying to index word or pdf documents in Solr and found the ExtractingRequestHandler, but can't figure out how to write code in c# that performs the HTTP POST request like in the Solr wiki: http://wiki.apache.org/solr/ExtractingRequestHandler.
I've installed Solr 3.4 on Tomcat 7 (7.0.22) using the files from the example/solr directory in the Solr zip and I haven't altered anything. The ExtractingRequestHandler should be configured out of the box in the solrconfig.xml and ready to use, right?
Can some of you give an C# (HttpWebRequest) example of how you make the HTTP POST request and upload a PDF file like it is done using curl in the Solr wiki?
I've look all over this site and many others trying to find an example or a tutorial on how this is done, but haven't found anything.
EDIT:
I finally managed to get it to work using SolrNet!
In order for it to work you need to copy this to a lib-folder in your Solr installation directory from the Solr zip:
With SolrNet 0.4.0 beta 2, this code does the job:
Startup.Init<IndexDocument>("YOUR-SOLR-SERVICE-PATH");
var solr = ServiceLocator.Current.GetInstance<ISolrOperations<IndexDocument>>();
using (FileStream fileStream = File.OpenRead("FILE-PATH-FOR-THE-FILE-TO-BE-INDEXED"))
{
var response =
solr.Extract(
new ExtractParameters(fileStream, "doc1")
{
ExtractFormat = ExtractFormat.Text,
ExtractOnly = false
});
}
solr.Commit();
Sorry for the trouble. I hope however that others will find this useful.
File Type Field Based on the fields you added in your search index, and based on the name of your fields, select your Search API Attachments in the General fields section. Re-index with the new fields. You will then be able to search text inside the PDF attachments fields.
You can index not only the document text, but also bookmarks, comments, attachments, digital signatures, form fields, metadata, and other custom document properties. You can build an index file from all the PDF files in a set of folders you define.
A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV) files, data extracted from tables in a database, and files in common file formats such as Microsoft Word or PDF.
I would recommend using the SolrNet client. It supports the ExtractingRequestHandler.
Here the Deprecated repo on code.google.com
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With