I am using Nutch to crawl websites and I want to parse specific sections of html pages crawled by Nutch. For example,
<h><title> title to search </title></h>
<div id="abc">
content to search
</div>
<div class="efg">
other content to search
</div>
I want to parse div element with id ="abc" and class="efg" and so on.
I know that I have to create a plugin for customized parsing as htmlparser plugin provided by Nutch removes all html tags, css and javascript content and leaves only text content. I refered to this blog http://sujitpal.blogspot.in/2009/07/nutch-custom-plugin-to-parse-and-add.html but I found that this is for parsing with html tag whereas I want to parse html tags with attribute having specific value. I found that Jericho has been mentioned as useful for parsing specific html tags but I could find any example for nutch plugin associated with Jericho.
I need some guidance about how to devise a strategy for parsing html pages on the basis of tags with attribute having specific value.
You can use this plugin to extract data from your pages based on css rules:
https://github.com/BayanGroup/nutch-custom-search
In your example, you can configure it in this way:
<config>
<fields>
<field name="custom_content" />
</fields>
<documents>
<document url=".+" engine="css">
<extract-to field="custom_content">
<text>
<expr value="#abc" />
</text>
<text>
<expr value=".efg" />
</text>
</extract-to>
</document>
</documents>
</config>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With