I'm trying to collect a few pieces of information about a bunch of different web sites. I want to produce one Item
per site that summarizes the information I found across that site, regardless of which page(s) I found it on.
I feel like this should be an item pipeline, like the duplicates filter example, except I need the final contents of the Item
, not the results from the first page the crawler examined.
So I tried using request.meta
to pass a single partially-filled Item
through the various Request
s for a given site. To make that work, I had to have my parse callback return exactly one new Request
per call until it had no more pages to visit, then finally return the finished Item
. Which is a pain if I find multiple links I want to follow, and breaks entirely if the scheduler throws away one of the requests due to a link cycle.
The only other approach I can see is to dump the spider output to json-lines and post-process it with an external tool. But I'd prefer to fold it into the spider, preferably in a middleware or item pipeline. How can I do that?
The first and simplest way to create a CSV file of the data you have scraped, is to simply define a output path when starting your spider in the command line. To save to a CSV file add the flag -o to the scrapy crawl command along with the file path you want to save the file to.
Scrapy is a more robust, feature-complete, more extensible, and more maintained web scraping tool. Scrapy allows you to crawl, extract, and store a full website. BeautilfulSoup on the other end only allows you to parse HTML and extract the information you're looking for.
Scrapy is incredibly fast. Its ability to send asynchronous requests makes it hands-down faster than BeautifulSoup. This means that you'll be able to scrape and extract data from many pages at once. BeautifulSoup doesn't have the means to crawl and scrape pages by itself.
How about this ugly solution?
Define a dictionary (defaultdict(list)) on a pipeline for storing per-site data. In process_item you can just append a dict(item) to the list of per-site items and raise DropItem exception. Then, in close_spider method, you can dump the data to whereever you want.
Should work in theory, but I'm not sure that this solution is the best one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With