I have a 270MB dataset (10000 html files) on my computer. Can I use Scrapy to crawl this dataset locally? How?
Scrapy is incredibly fast. Its ability to send asynchronous requests makes it hands-down faster than BeautifulSoup. This means that you'll be able to scrape and extract data from many pages at once. BeautifulSoup doesn't have the means to crawl and scrape pages by itself.
The bottom line. Scrapy is a great choice for larger companies with more complex and ever-growing/changing data collection needs. Beautiful Soup, on the other hand, is better for smaller businesses run by individuals with very specific needs and limited technical capabilities.
Scrapy, being one of the most popular web scraping frameworks, is a great choice if you want to learn how to scrape data from the web.
Scrapy is an open-source framework for extracting the data from websites. It is fast, simple, and extensible. Every data scientist should have familiarity with this, as they often need to gather data in this manner.
If you truly want to host it locally and use scrapy, you could serve it by navigating to the directory it's stored in and run the SimpleHTTPServer (port 8000 shown below):
python -m SimpleHTTPServer 8000
Then just point scrapy at 127.0.0.1:8000
$ scrapy crawl 127.0.0.1:8000
An alternative is to just have scrapy point to the set of files directly:
$ scrapy crawl file:///home/sagi/html_files # Assuming you're on a *nix system
Once you've set up your scraper for scrapy (see example dirbot), just run the crawler:
$ scrapy crawl 127.0.0.1:8000
If links in the html files are absolute rather than relative though, these may not work well. You would need to adjust the files yourself.
Go to your Dataset folder :
import os
files = os.listdir(os.getcwd())
for file in files:
with open(file,"r") as f:
page_content = f.read()
#do here watever you want to do with page_content. I guess parsing with lxml or Beautiful soup.
No need to go for Scrapy !
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With