Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Scrapy on offline (local) data

I have a 270MB dataset (10000 html files) on my computer. Can I use Scrapy to crawl this dataset locally? How?

like image 694
Sagi Avatar asked Oct 15 '13 16:10

Sagi


People also ask

Is Scrapy faster than BeautifulSoup?

Scrapy is incredibly fast. Its ability to send asynchronous requests makes it hands-down faster than BeautifulSoup. This means that you'll be able to scrape and extract data from many pages at once. BeautifulSoup doesn't have the means to crawl and scrape pages by itself.

Is BeautifulSoup better than Scrapy?

The bottom line. Scrapy is a great choice for larger companies with more complex and ever-growing/changing data collection needs. Beautiful Soup, on the other hand, is better for smaller businesses run by individuals with very specific needs and limited technical capabilities.

Is Scrapy good for web scraping?

Scrapy, being one of the most popular web scraping frameworks, is a great choice if you want to learn how to scrape data from the web.

Can I use Scrapy in Jupyter notebook?

Scrapy is an open-source framework for extracting the data from websites. It is fast, simple, and extensible. Every data scientist should have familiarity with this, as they often need to gather data in this manner.


2 Answers

SimpleHTTP Server Hosting

If you truly want to host it locally and use scrapy, you could serve it by navigating to the directory it's stored in and run the SimpleHTTPServer (port 8000 shown below):

python -m SimpleHTTPServer 8000

Then just point scrapy at 127.0.0.1:8000

$ scrapy crawl 127.0.0.1:8000

file://

An alternative is to just have scrapy point to the set of files directly:

$ scrapy crawl file:///home/sagi/html_files # Assuming you're on a *nix system

Wrapping up

Once you've set up your scraper for scrapy (see example dirbot), just run the crawler:

$ scrapy crawl 127.0.0.1:8000

If links in the html files are absolute rather than relative though, these may not work well. You would need to adjust the files yourself.

like image 187
Kyle Kelley Avatar answered Sep 30 '22 21:09

Kyle Kelley


Go to your Dataset folder :

import os
files = os.listdir(os.getcwd())
for file in files:
    with open(file,"r") as f:
        page_content = f.read()
        #do here watever you want to do with page_content. I guess parsing with lxml or Beautiful soup.

No need to go for Scrapy !

like image 34
Ratan Kumar Avatar answered Sep 30 '22 19:09

Ratan Kumar