Python Scrapy on offline (local) data

2 Answers

SimpleHTTP Server Hosting

If you truly want to host it locally and use scrapy, you could serve it by navigating to the directory it's stored in and run the SimpleHTTPServer (port 8000 shown below):

python -m SimpleHTTPServer 8000

Then just point scrapy at 127.0.0.1:8000

$ scrapy crawl 127.0.0.1:8000

file://

An alternative is to just have scrapy point to the set of files directly:

$ scrapy crawl file:///home/sagi/html_files # Assuming you're on a *nix system

Wrapping up

Once you've set up your scraper for scrapy (see example dirbot), just run the crawler:

$ scrapy crawl 127.0.0.1:8000

If links in the html files are absolute rather than relative though, these may not work well. You would need to adjust the files yourself.

187

answered Sep 30 '22 21:09

Kyle Kelley

Go to your Dataset folder :

import os
files = os.listdir(os.getcwd())
for file in files:
    with open(file,"r") as f:
        page_content = f.read()
        #do here watever you want to do with page_content. I guess parsing with lxml or Beautiful soup.

No need to go for Scrapy !

answered Sep 30 '22 19:09

Ratan Kumar

Related questions
                            
                                What are the names of the magic methods for the operators "is" and "in"?
                            
                                Using Scikit-Learn OneHotEncoder with a Pandas DataFrame
                            
                                How Do You "Permanently" Delete An Experiment In Mlflow?
                            
                                ImportError: cannot import name 'Command' from 'celery.bin.base'
                            
                                How does Python OOP compare to PHP OOP?
                            
                                Checkout or list remote branches in GitPython
                            
                                How to convert the integer date format into YYYYMMDD?
                            
                                Remove n characters from a start of a string
                            
                                How do I use num_rows() function in the MySQLDB API for Python?
                            
                                Using PIP in a virtual environment, how do I install MySQL-python
                            
                                What is the difference between {} and [] in python?
                            
                                Is there a good and easy way to visualize high dimensional data?
                            
                                delay a task until certain time
                            
                                What does "**" mean in python? [duplicate]
                            
                                How to install Colorama in Python?
                            
                                tornado maps GET and POST arguments to lists. How can I disable this "feature"?
                            
                                Show non printable characters in a string
                            
                                Showing an image with pylab.imshow()
                            
                                How do I decrypt using hashlib in python?
                            
                                django form dropdown list of stored models

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Scrapy on offline (local) data

Tags:

python

scrapy

web-crawler

Sagi

People also ask