Pass Scrapy Spider a list of URLs to crawl via .txt file

Tags:

I'm a little new to Python and very new to Scrapy.

I've set up a spider to crawl and extract all the information I need. However, I need to pass a .txt file of URLs to the start_urls variable.

For exmaple:

class LinkChecker(BaseSpider):
    name = 'linkchecker'
    start_urls = [] #Here I want the list to start crawling a list of urls from a text file a pass via the command line.

I've done a little bit of research and keep coming up empty handed. I've seen this type of example (How to pass a user defined argument in scrapy spider), but I don't think that will work for a passing a text file.

921

asked Jun 25 '13 21:06

cloud36

3 Answers

Run your spider with -a option like:

scrapy crawl myspider -a filename=text.txt

Then read the file in the __init__ method of the spider and define start_urls:

class MySpider(BaseSpider):
    name = 'myspider'

    def __init__(self, filename=None):
        if filename:
            with open(filename, 'r') as f:
                self.start_urls = f.readlines()

Hope that helps.

answered Oct 14 '22 11:10

alecxe

you could simply read-in the .txt file:

with open('your_file.txt') as f:
    start_urls = f.readlines()

if you end up with trailing newline characters, try:

with open('your_file.txt') as f:
    start_urls = [url.strip() for url in f.readlines()]

Hope this helps

answered Oct 14 '22 12:10

Nick Burns

If your urls are line seperated

def get_urls(filename):
        f = open(filename).read().split()
        urls = []
        for i in f:
                urls.append(i)
        return urls

then this lines of code will give you the urls.

answered Oct 14 '22 12:10

rocker_raj

Related questions
                            
                                What is this construct called in python: ( x, y )
                            
                                How to sort this list in Python?
                            
                                Calling types via their name as a string in Python
                            
                                What is/are the Python equivalent(s) to the Java Collections Framework?
                            
                                Is python a serious option for concurrent programming
                            
                                Best web application language for Delphi Developers
                            
                                Apache2: mod_wsgi or mod_python, which one is better?
                            
                                Python: avoiding if condition?
                            
                                Javascript or Python? beginner getting up to speed [closed]
                            
                                Converting string to tuple and adding to tuple
                            
                                Equivalent to GOTO in conditions, Python
                            
                                Normalizing colors in matplotlib
                            
                                Is there a Python function like Lua's string.sub?
                            
                                how to replace every n-th value of an array in python most efficiently?
                            
                                Binding one button to two events with Tkinter
                            
                                Want to prompt browser to save csv
                            
                                How to generate a continuous string?
                            
                                Matplotlib - why is Bar graph line color is black?
                            
                                Python - Defining an integer variable over multiple lines
                            
                                Check string indentation?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pass Scrapy Spider a list of URLs to crawl via .txt file

Tags:

python

command-line-arguments

web-scraping

scrapy

scrapy-spider