Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A web crawler in a self-contained python file

I have found lots of Scrapy tutorials (such as this good tutorial) that all need the steps listed below. The result is a project, with lots of files (project.cfg + some .py files + a specific folder structure).

How to make the steps (listed below) work as a self-contained python file that can be run with python mycrawler.py ?

(instead of a full project with lots of files, some .cfg files, etc., and having to use scrapy crawl myproject -o myproject.json... by the way, it seems that scrapy is a new shell command? is this true?)

Note: here could be an answer to this question but unfortunately it is deprecated and no longer works.


1) Create a new scrapy project with scrapy startproject myproject

2) Define the data structure with Item like this:

from scrapy.item import Item, Field
    class MyItem(Item):
        title = Field() 
        link = Field()
        ...

3) Define the crawler with

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class MySpider(BaseSpider):
    name = "myproject"
    allowed_domains = ["example.com"] 
    start_urls = ["http://www.example.com"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        ...

4) Run with:

scrapy crawl myproject -o myproject.json
like image 314
Basj Avatar asked Sep 30 '22 02:09

Basj


2 Answers

You can run scrapy spiders as a single script without starting a project by using runspider Is this what you wanted?

#myscript.py
from scrapy.item import Item, Field
from scrapy import Spider

class MyItem(Item):
    title = Field() 
    link = Field()

class MySpider(Spider):

     start_urls = ['http://www.example.com']
     name = 'samplespider'

     def parse(self, response):
          item = MyItem()
          item['title'] = response.xpath('//h1/text()').extract()
          item['link'] = response.url
          yield item  

Now you can run this with scrapy runspider myscript.py -o out.json

like image 200
pad Avatar answered Oct 02 '22 16:10

pad


Scrapy is not unix command it just executable like python,javac,gcc etc.
bcz u are using framework for this you have to use command given provided by framework. one thing you can do is create a bash script and simply execute whenever you need or execute it from some other program program.

you can write crawler using urllib3 its simple

like image 39
aibotnet Avatar answered Oct 02 '22 14:10

aibotnet