For example i had a site "www.example.com"
Actually i want to scrape the html of this site by saving on to local system. so for testing i saved that page on my desktop as example.html
Now i had written the spider code for this as below
class ExampleSpider(BaseSpider): name = "example" start_urls = ["example.html"] def parse(self, response): print response hxs = HtmlXPathSelector(response)
But when i run the above code i am getting this error as below
ValueError: Missing scheme in request url: example.html
Finally my intension is to scrape the example.html
file that consists of www.example.com
html code saved in my local system
Can any one suggest me on how to assign that example.html file in start_urls
Thanks in advance
BeautifulSoup module in Python allows us to scrape data from local HTML files. For some reason, website pages might get stored in a local (offline environment), and whenever in need, there may be requirements to get the data from them.
It's not hard to understand, but before you can start web scraping, you need to first master HTML. To extract the right pieces of information, you need to right-click “inspect.” You'll find a very long HTML code that seems infinite. Don't worry. You don't need to know HTML deeply to be able to extract the data.
You can crawl a local file using an url of the following form:
file:///path/to/file.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With