Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I group data scraped from multiple pages, using Scrapy, into one Item?

Tags:

python

scrapy

I'm trying to collect a few pieces of information about a bunch of different web sites. I want to produce one Item per site that summarizes the information I found across that site, regardless of which page(s) I found it on.

I feel like this should be an item pipeline, like the duplicates filter example, except I need the final contents of the Item, not the results from the first page the crawler examined.

So I tried using request.meta to pass a single partially-filled Item through the various Requests for a given site. To make that work, I had to have my parse callback return exactly one new Request per call until it had no more pages to visit, then finally return the finished Item. Which is a pain if I find multiple links I want to follow, and breaks entirely if the scheduler throws away one of the requests due to a link cycle.

The only other approach I can see is to dump the spider output to json-lines and post-process it with an external tool. But I'd prefer to fold it into the spider, preferably in a middleware or item pipeline. How can I do that?

like image 547
Jamey Sharp Avatar asked Apr 06 '13 22:04

Jamey Sharp


People also ask

How do I store scraped data in a CSV file Scrapy?

The first and simplest way to create a CSV file of the data you have scraped, is to simply define a output path when starting your spider in the command line. To save to a CSV file add the flag -o to the scrapy crawl command along with the file path you want to save the file to.

Which is better Scrapy or BeautifulSoup?

Scrapy is a more robust, feature-complete, more extensible, and more maintained web scraping tool. Scrapy allows you to crawl, extract, and store a full website. BeautilfulSoup on the other end only allows you to parse HTML and extract the information you're looking for.

Why is BeautifulSoup better than Scrapy?

Scrapy is incredibly fast. Its ability to send asynchronous requests makes it hands-down faster than BeautifulSoup. This means that you'll be able to scrape and extract data from many pages at once. BeautifulSoup doesn't have the means to crawl and scrape pages by itself.


1 Answers

How about this ugly solution?

Define a dictionary (defaultdict(list)) on a pipeline for storing per-site data. In process_item you can just append a dict(item) to the list of per-site items and raise DropItem exception. Then, in close_spider method, you can dump the data to whereever you want.

Should work in theory, but I'm not sure that this solution is the best one.

like image 51
alecxe Avatar answered Nov 06 '22 15:11

alecxe