I'm a little new to Python and very new to Scrapy.
I've set up a spider to crawl and extract all the information I need. However, I need to pass a .txt file of URLs to the start_urls variable.
For exmaple:
class LinkChecker(BaseSpider):
name = 'linkchecker'
start_urls = [] #Here I want the list to start crawling a list of urls from a text file a pass via the command line.
I've done a little bit of research and keep coming up empty handed. I've seen this type of example (How to pass a user defined argument in scrapy spider), but I don't think that will work for a passing a text file.
crawl − It is used to crawl data using the spider. check − It checks the items returned by the crawled command. list − It displays the list of available spiders present in the project. edit − You can edit the spiders by using the editor.
start_urls contain those links from which the spider start crawling. If you want crawl recursively you should use crawlspider and define rules for that.
CrawlSpider[source] This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for following links by defining a set of rules.
Run your spider with -a
option like:
scrapy crawl myspider -a filename=text.txt
Then read the file in the __init__
method of the spider and define start_urls
:
class MySpider(BaseSpider):
name = 'myspider'
def __init__(self, filename=None):
if filename:
with open(filename, 'r') as f:
self.start_urls = f.readlines()
Hope that helps.
you could simply read-in the .txt file:
with open('your_file.txt') as f:
start_urls = f.readlines()
if you end up with trailing newline characters, try:
with open('your_file.txt') as f:
start_urls = [url.strip() for url in f.readlines()]
Hope this helps
If your urls are line seperated
def get_urls(filename):
f = open(filename).read().split()
urls = []
for i in f:
urls.append(i)
return urls
then this lines of code will give you the urls.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With