Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to parse html with nutch and index specific tag to solr?

i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http://wiki.apache.org/nutch/IndexMetatags) now i want to know is there any way to crawl another html tag to solr that isn't meta?(plugin or anyway) like this:

<div id=something>
      me specific tag
</div>

indeed i want to add a field to solr (something) that have value of "me specific tag" in this page.

any idea?

like image 425
Amir Avatar asked Sep 09 '12 12:09

Amir


2 Answers

I made my own plugin for something similar you want to. The config file for mapping NutchDocument to SolrDocument is in $NUTCH_HOME/conf/solrindex-mapping.xml. Here you can add your own tags. But still you have to fill your own tags somewhere.

Here are some tips to plugin:

  • read http://wiki.apache.org/nutch/WritingPluginExample, here you can find how to make your plugin very simply
  • in your plugin extend the ParseFilter and IndexingFilter.
  • in YourParseFilter you can use NodeWalker to find your specific div
  • your parsed informations put into page metadata like this

    page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));

  • in YourIndexingFilter add the metadata from page (page.getMetadata) to NutchDocument

    doc.add("your_specific_tag", value);

  • most important!!!!!

  • put your_specific_tag to fileds of:

    • Solr config file schema.xml (and restart Solr)

    field name="your_specific_tag" type="string" stored="true" indexed="true"

    • Nutch config file schema.xml (don't know if it is realy neccessary)
    • Nutch config file solrindex-mapping.xml

    field dest="your_specific_tag" source="your_specific_tag"

like image 113
Babu Avatar answered Oct 11 '22 21:10

Babu


u have to just try http://lifelongprogrammer.blogspot.in/2013/08/nutch2-crawl-and-index-extra-tag.html the tutorial said img tag how to get and what all are steps are there mention...

like image 20
Arul Pandian Avatar answered Oct 11 '22 21:10

Arul Pandian