I'm look for a framework to grab articles, then I find Nutch 2.1. Here's my plan and questions in each:
1
Add article list pages into url/seed.txt Here's one problem. What I actually want to be indexed is the article pages, not the article list pages. But, if I don't allow the list page to be indexed, Nutch will do nothing because the list page is the entrance. So, how can I index only the article page without list pages?
2
Write a plugin to parse out the 'author', 'date', 'article body', 'headline' and maybe other information from html. The 'Parser' plugin interface in Nutch 2.1 is: Parse getParse(String url, WebPage page) And the 'WebPage' class has some predefined attributs:
public class WebPage extends PersistentBase {
// ...
private Utf8 baseUrl;
// ...
private ByteBuffer content; // <== This becomes null in IndexFilter
// ...
private Utf8 title;
private Utf8 text;
// ...
private Map<Utf8,Utf8> headers;
private Map<Utf8,Utf8> outlinks;
private Map<Utf8,Utf8> inlinks;
private Map<Utf8,Utf8> markers;
private Map<Utf8,ByteBuffer> metadata;
// ...
}
So, as you can see, there are 5 maps I can put my specified attributes in. But, 'headers', 'outlinks', 'inlinks' seem not used for this. Maybe I could put those information into markers or metadata. Are they designed for this purpose?
BTW, the Parser in trunk looks like: 'public ParseResult getParse(Content content)', and seems more reasonable for me.
3
After the articles are indexed into Solr, another application can query it by 'date' then store the article information into Mysql. My question here is: can Nutch store the article directly into Mysql? Or can I write a plugin to specify the index behavior?
Is Nutch a good choice for my purpose? If not, do you guys suggest another good quality framework/library for me? Thanks for your help.
If article extraction from a few websites is all that you are looking for, then check out http://www.crawl-anywhere.com/
It comes with an admin UI where you can specify that you want to use boilerpipe article extractor (which is great). You can also specify by URL pattern matching which pages you want crawled vs which page you want crawled AND indexed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With