Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scraping the file with html saved in local system

Tags:

python

scrapy

For example i had a site "www.example.com" Actually i want to scrape the html of this site by saving on to local system. so for testing i saved that page on my desktop as example.html

Now i had written the spider code for this as below

class ExampleSpider(BaseSpider):    name = "example"    start_urls = ["example.html"]     def parse(self, response):        print response        hxs = HtmlXPathSelector(response) 

But when i run the above code i am getting this error as below

ValueError: Missing scheme in request url: example.html 

Finally my intension is to scrape the example.html file that consists of www.example.com html code saved in my local system

Can any one suggest me on how to assign that example.html file in start_urls

Thanks in advance

like image 887
Shiva Krishna Bavandla Avatar asked Jun 05 '12 10:06

Shiva Krishna Bavandla


People also ask

How do you scrape data from local HTML files using python?

BeautifulSoup module in Python allows us to scrape data from local HTML files. For some reason, website pages might get stored in a local (offline environment), and whenever in need, there may be requirements to get the data from them.

Do you need to know HTML for web scraping?

It's not hard to understand, but before you can start web scraping, you need to first master HTML. To extract the right pieces of information, you need to right-click “inspect.” You'll find a very long HTML code that seems infinite. Don't worry. You don't need to know HTML deeply to be able to extract the data.


1 Answers

You can crawl a local file using an url of the following form:

 file:///path/to/file.html 
like image 109
iodbh Avatar answered Sep 23 '22 21:09

iodbh