I have followed article: https://wiki.apache.org/nutch/NutchTutorial and set up apache nutch +solr. But i want to clarify if i understood correct about working of nutch steps.
1). Inject: In this part, apache reads url list from given seed.txt, compare urls with regex-urlfiler regex and update crawldb with supported urls.
2). Generate: bin/nutch generate crawl/crawldb crawl/segments Nutch takes URLs from crawldb and create fetch list of URLs which are ready to be fetched. it takes input like -topN and timegap etc then create directory with current time under segments.
I believe, In first two steps there was no interaction with internet. Everything was happening locally.
Q: Where is fetch list kept ?
3). Fetch: bin/nutch fetch crawl/segments/
Fetch run fetchList and fetch contents (and URLs) from given URLs and keep it somewhere.
Q: Does fetch read the whole given page of URL (Text + another URLs)? Q: Where Nutch keeps fetched data ?
4). Parse: bin/nutch parse crawl/segments/
It parses the entries.
Q: What is meant by parse here ? Q: Where i can find result of this step ?
5). bin/nutch updatedb crawl/crawldb crawl/segments/
When this is complete, Nutch update the database with the results of the fetch.
Q: Does it update crawldb with parsed data only or something else also?
Please clear my doubts.
Your assumption for the first and second steps are correct. However, you need to understand how the whole workflow takes place. When Nutch fetches urls, it fetches data like web page data or images as binary and stored them into segements as crawl data using a class named Content.
Later, in the parsing step, the stored Content objects are parsed into another data format called ParsedData that includes text of the data plus its outlinks if avaiable. The ParsedData are put back to segements to be processed in the next job batch. After this step comes the crawldb update job, here the links from the previous step are put back into the crawldb to update the page rank and web links details.
At the indexing step, the information from parsed data at segments are structured into fields. Nutch uses a classed named "NutchDocument" to store the structured data, The nutch documents are put back into segments to be processed in the next step. Lastly, Nutch sends Nutch documents to indexing storage like Solr or Elasticsearch. This is the last step, at this stage you can remove the segments if you do not want to send them again to indexing storage. In another words, this is the follow of data
seed list -> inject urls -> crawl item (simply the urls) -> Contents-> parsed data -> nutch documents.
I hope that answers some of your questions.
Your answer for the first two steps of inject and generate is write.
I am answering your questions step by step : A fetch-list is a list of urls to be fetched in this iteration of crawling. You can configure the size of a fetch-list using the generate.max.count property. The generated fetch-list is stored inside the crawl_generate directory inside its corresponding segment. You won't, however, be able to read it as it is in binary.
After the generate step comes fetch step where the urls in the fetch-list are fetched from the web. They are stored in the crawl_fetch directory of the segment.
Once the urls are fetched, the content is parsed to get the outlinks, content, metadata, etc. The output of the parsing step is inside crawl_parse, parse_data, parse_text directories of the segment.
Once parsing is complete, we update the crawldb with the newly found links from the recently fetched urls. Crawldb only contains urls and information corresponding to them, such as fetch_status, score, modified_time, etc. You can think of it like a database that stores information about urls.
When a new iteration of crawl starts, the newly added urls from crawldb are inserted into the fetch-list and the process continues.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With